http://www.whelanlabs.com

 

Update: Due to recent changes in how Cygwin works with mount points, I have decided to discontinue providing this tool instead of undergoing the large effort to rewrite sections to meet their new model. I find it unfortunate that Cygwin made these changes in their existing 1.5 branch, but feel unable to expend the effort work with their "updated" versions.   

 

 

 

Overview:

The WhelanLabs Search Engine Manager is a GUI interface to a preconfigured installation of Apache Nutch. The main purpose of this application is to support a simple means to provide a Windows-based search engine implementation for use in an organization with web-based resources that are inaccessible via traditional search engines.

 

The WhelanLabs Search Engine Manager is freely available from the following sites:

 

Version 2.0: 

http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html

 

Features:

Cost:

It’s Free!

 

Management GUI:

 

Figure 1 - Management Page for Search Engine Site Management – version 2.0 (click for demo)

 

 

Figure 2 - Management GUI Page for Crawler Management - version 2.0 (click for demo)

 

Administrative Features:

 Supported File Types:

 

File Type

MIME Type

Extension(s)

Adobe Flash Files (AKA ‘Shockwave Flash’)

application/x-shockwave-flash

.swf

Adobe Portable Document Format

application/pdf

.pdf

ASCII Text Files

text/plain

.txt, .text

BZIP Over ZIP Compressed File Archive File

application/x-bzip2

.boz

C Shell Script Files

application/x-csh

.csh

Compressed GZIP files

application/x-gzip

.gz

eXtensible Markup Language Files

application/xml

.xml

eXtensible Markup Language Files

text/xml

.xml

HTML Files

text/html

.html, .htm

IETF SGML document Files

text/sgml

.sgm

JavaScript Files

application/x-javascript

.js

Kspread Spreadsheet Application Files

application/x-kspread

.ksp

Kword Word Processor Files

application/x-kword

.kwd, .kwt

Microsoft Excel Files

application/vnd.ms-excel

.xls

Microsoft PowerPoint Files

application/vnd.ms-powerpoint

.ppt

Microsoft Rich Text Files

text/rtf

.rtf

Microsoft Word Files

application/msword

.doc

OASIS Open Document Master Documents

application/vnd.oasis.opendocument.text-master

.odm

OASIS Open Document Presentation Templates

application/vnd.oasis.opendocument.presentation-template

otp

OASIS Open Document Presentations

application/vnd.oasis.opendocument.presentation

.odp

OASIS Open Document Spreadsheet Templates

application/vnd.oasis.opendocument.spreadsheet-template

.ots

OASIS Open Document Spreadsheets

application/vnd.oasis.opendocument.spreadsheet

.ods

OASIS Open Document Text Files

application/vnd.oasis.opendocument.text

.odt

OASIS Open Document Text template for HTML

application/vnd.oasis.opendocument.text-web

.oth

OASIS Open Document Text Templates

application/vnd.oasis.opendocument.text-template

.ott

OpenOffice Calc Files

application/vnd.sun.xml.calc

.sxc

OpenOffice Calc template Files

application/vnd.sun.xml.calc.template

.stc

OpenOffice Impress Files

application/vnd.sun.xml.impress .sxi

 

OpenOffice Impress Template Files

application/vnd.sun.xml.impress.template

.sti

OpenOffice Writer Files

application/vnd.sun.xml.writer

.sxw

OpenOffice Writer Template Files

application/vnd.sun.xml.writer.template

.stw

PostScript Files

application/postscript

.ps

Really Simple Syndication Files

application/rss+xml

.rss

Rich Text Files

text/richtext

.rt

Tab Separated Values Files

text/tab-separated-values

.tsv

XHTML Files

application/xhtml+xml

.xhtml

ZIP files

application/zip

.zip

 

Supported Protocols:

 

Protocol

Protocol String

Regular Web Pages

HTTP://

Secure HTTP Web Pages

 HTTPS://

File Transfer Protocol

FTP://

E-mail Links

MAILTO://

Local Files

FILE://

 

Sizing and Capacity:

No known maximum limits have been published for Nutch. Anecdotal evidence suggests the existence of systems with 100-200 million documents. The main question for ‘sizing’ is available resources.

 

 

Current data suggests that, on average, It should take the crawler 4 hours per 1 million URLS to process, and each 1 million URLs should take 1.2 GB of space to index. [Note: these numbers will be updated based on additional reports form the field. To report your results, please port to The WhelanLabs SearchEngine Manager Forum.

 

Architecture and Design:

The WhelanLabs Search Engine manager is basically a mash-up of technologies with an administrative user interface added. The technology stack for the application is:

 

Figure 1: Technology Stack for WhelanLabs Search Engine Manager

 

 

There are a few areas within the application that merit mention. They are:

 

Use of templates: In order to support modifications to the configuration of underlying components, the application makes use of configuration file templates in several places to allow the manager to overwrite the content of the configuration files. Most of the use of templates involves the configuration of Apache Tomcat and Apache Nutch. The use of templates does imply that direct modifications of the configuration files might be subsequently overwritten by actions performed by the manager.

 

Apache Nutch configuration settings: The default configuration settings in Nutch have been changed to produce a system that has features that were seen as needed but missing from the OOTB Nutch configuration. Specifically, the types of searchable document types has been greatly extended, and the maximum amount of a file to be indexed has been increased to a size that I feel will produce less misses in searches without being so big as to overload the system.

 

Java and Cygwin: It is interesting to note, that for a variety of reasons, the Java and Cygwin components are not part of the shipped installer. Rather the installer checks to see if they are installed on the local system, and if so will use those instances. If they are not locally installed, the installer will aide the administrator in installing them from their respective Internet locations. Additionally, the use of Cygwin is isolated via a WhelanLabs specific mount-point in order to avoid disturbing other uses of Cygwin, and to promote general application isolation.

 

Outgrowing the Application: It is entirely possible that sites might outgrow the Search Engine Manager application due to special needs (related to configuration, management, performance, or a host of other special needs. I do not believe that there any problems with this approach, and might in fact be a logical evolution as administrators become more familiar with how the system works. Outgrowing the application might come in phases, starting with ‘under the covers configuration’, moving to direct command-line invocation of the shell scripts, and ending with complete abandonment of the administrative UI and replacement of some of the 3rd party components. While I accept and encourage this type of advancement, the Search Engine Manager will likely continue to primarily cater to the newbie, and focus the ongoing development efforts accordingly.

FAQ:

Question: How do I get my questions answered?

Answer: Post a question to the forum maintained at http://n2.nabble.com/WhelanLabs-SearchEngine-Manager-f1641671.html

External Links:

A forum has been established on Nabble in order to answer questions related to this application. The forum is here.

As for the underlying components, their links are as follows:

 

http://www.whelanlabs.com