The WhelanLabs Search Engine Manager is a GUI interface to a preconfigured installation of Apache Nutch. The main purpose of this application is to support a simple means to provide a Windows-based search engine implementation for use in an organization with web-based resources that are inaccessible via traditional search engines.
The WhelanLabs Search Engine Manager is freely available from the following sites:
Version 2.0:
http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html
It’s Free!
Figure 1 - Management Page for Search Engine Site Management – version 2.0 (click for demo)
Figure 2 - Management GUI Page for Crawler Management - version 2.0 (click for demo)
|
File Type |
MIME Type |
Extension(s) |
|
Adobe Flash Files (AKA ‘Shockwave Flash’) |
application/x-shockwave-flash |
.swf |
|
Adobe Portable Document Format |
application/pdf |
|
|
ASCII Text Files |
text/plain |
.txt, .text |
|
BZIP Over ZIP Compressed File Archive File |
application/x-bzip2 |
.boz |
|
C Shell Script Files |
application/x-csh |
.csh |
|
Compressed GZIP files |
application/x-gzip |
.gz |
|
eXtensible Markup Language Files |
application/xml |
.xml |
|
eXtensible Markup Language Files |
text/xml |
.xml |
|
HTML Files |
text/html |
.html, .htm |
|
IETF SGML document Files |
text/sgml |
.sgm |
|
JavaScript Files |
application/x-javascript |
.js |
|
Kspread Spreadsheet Application Files |
application/x-kspread |
.ksp |
|
Kword Word Processor Files |
application/x-kword |
.kwd, .kwt |
|
Microsoft Excel Files |
application/vnd.ms-excel |
.xls |
|
Microsoft PowerPoint Files |
application/vnd.ms-powerpoint |
.ppt |
|
Microsoft Rich Text Files |
text/rtf |
.rtf |
|
Microsoft Word Files |
application/msword |
.doc |
|
OASIS Open Document Master Documents |
application/vnd.oasis.opendocument.text-master |
.odm |
|
OASIS Open Document Presentation Templates |
application/vnd.oasis.opendocument.presentation-template |
otp |
|
OASIS Open Document Presentations |
application/vnd.oasis.opendocument.presentation |
.odp |
|
OASIS Open Document Spreadsheet Templates |
application/vnd.oasis.opendocument.spreadsheet-template |
.ots |
|
OASIS Open Document Spreadsheets |
application/vnd.oasis.opendocument.spreadsheet |
.ods |
|
OASIS Open Document Text Files |
application/vnd.oasis.opendocument.text |
.odt |
|
OASIS Open Document Text template for HTML |
application/vnd.oasis.opendocument.text-web |
.oth |
|
OASIS Open Document Text Templates |
application/vnd.oasis.opendocument.text-template |
.ott |
|
OpenOffice Calc Files |
application/vnd.sun.xml.calc |
.sxc |
|
OpenOffice Calc template Files |
application/vnd.sun.xml.calc.template |
.stc |
|
OpenOffice Impress Files |
application/vnd.sun.xml.impress .sxi |
|
|
OpenOffice Impress Template Files |
application/vnd.sun.xml.impress.template |
.sti |
|
OpenOffice Writer Files |
application/vnd.sun.xml.writer |
.sxw |
|
OpenOffice Writer Template Files |
application/vnd.sun.xml.writer.template |
.stw |
|
PostScript Files |
application/postscript |
.ps |
|
Really Simple Syndication Files |
application/rss+xml |
.rss |
|
Rich Text Files |
text/richtext |
.rt |
|
Tab Separated Values Files |
text/tab-separated-values |
.tsv |
|
XHTML Files |
application/xhtml+xml |
.xhtml |
|
ZIP files |
application/zip |
.zip |
|
Protocol |
Protocol String |
|
Regular Web Pages |
HTTP:// |
|
Secure HTTP Web Pages |
HTTPS:// |
|
File Transfer Protocol |
FTP:// |
|
E-mail Links |
MAILTO:// |
|
Local Files |
FILE:// |
No known maximum limits have been published for Nutch. Anecdotal evidence suggests the existence of systems with 100-200 million documents. The main question for ‘sizing’ is available resources.
Current data suggests that, on average, It should take the crawler 4 hours per 1 million URLS to process, and each 1 million URLs should take 1.2 GB of space to index. [Note: these numbers will be updated based on additional reports form the field. To report your results, please port to The WhelanLabs SearchEngine Manager Forum.
The WhelanLabs Search Engine manager is basically a mash-up of technologies with an administrative user interface added. The technology stack for the application is:

Figure 1: Technology Stack for WhelanLabs Search Engine Manager
There are a few areas within the application that merit mention. They are:
Use of templates: In order to support modifications to the configuration of underlying components, the application makes use of configuration file templates in several places to allow the manager to overwrite the content of the configuration files. Most of the use of templates involves the configuration of Apache Tomcat and Apache Nutch. The use of templates does imply that direct modifications of the configuration files might be subsequently overwritten by actions performed by the manager.
Apache Nutch configuration settings: The default configuration settings in Nutch have been changed to produce a system that has features that were seen as needed but missing from the OOTB Nutch configuration. Specifically, the types of searchable document types has been greatly extended, and the maximum amount of a file to be indexed has been increased to a size that I feel will produce less misses in searches without being so big as to overload the system.
Java and Cygwin: It is interesting to note, that for a variety of reasons, the Java and Cygwin components are not part of the shipped installer. Rather the installer checks to see if they are installed on the local system, and if so will use those instances. If they are not locally installed, the installer will aide the administrator in installing them from their respective Internet locations. Additionally, the use of Cygwin is isolated via a WhelanLabs specific mount-point in order to avoid disturbing other uses of Cygwin, and to promote general application isolation.
Outgrowing the Application: It is entirely possible that sites might outgrow the Search Engine Manager application due to special needs (related to configuration, management, performance, or a host of other special needs. I do not believe that there any problems with this approach, and might in fact be a logical evolution as administrators become more familiar with how the system works. Outgrowing the application might come in phases, starting with ‘under the covers configuration’, moving to direct command-line invocation of the shell scripts, and ending with complete abandonment of the administrative UI and replacement of some of the 3rd party components. While I accept and encourage this type of advancement, the Search Engine Manager will likely continue to primarily cater to the newbie, and focus the ongoing development efforts accordingly.
Question: How do I get my questions answered?
Answer: Post a question to the forum maintained at http://n2.nabble.com/WhelanLabs-SearchEngine-Manager-f1641671.html
A forum has been established on Nabble in order to answer questions related to this application. The forum is here.
As for the underlying components, their links are as follows: