Spencer Sourceforge Home

Spencer - a Filesystem Indexer

Links to Required Libraries

The jar files that are required by the Spencer components can be found at the links below (at the time of writing) and although they are required by this application, otherwise they are not affiliated to it in any way

jxl.jar
mysql-connector-java-3.0.16-ga-bin.jar
PDFBox-0.6.7a.jar
poi-2.5.1-final-20040804.jar
tm-extractors-0.4.jar
xsdlib.jar
log4j-1.2.8.jar
jcifs-1.1.3.jar

The versions alluded to here have been proven to work with Spencer - newer versions may offer increased performance and stability or they may prove to be incompatible. Feel free to test new versions and let me know how you fare.

Sun JRE
Tomcat Servlet Container
MySQL Database Server

Links to SourceForge Resources

Project Home Page
Download!
Documentation
Screen Shots

Online deployment guide

Welcome to Spencer

This document describes the workings of the Spencer filesystem indexing system. Spencer has three components:

The Indexer (spendex)
The search interface (spencer)
The browse interface (spencer)

This document takes Spendex and Spencer in turn as they are fundamentally different applications. The Spencer web application which provides both the search and browse interfaces is discussed below.The suite requires a MySQL database in which the index is stored and a SMB or SAMBA share which can be accessed over the network. In addition, it requires a standards-compliant servlet container such as Tomcat with an appropriate Java JDK (1.4.2_06 or later) in which to deploy the search/browse tools. The filestore host server must also have an appropriate JRE (1.4.2_06 or later) installed to run the indexer.

Database

The database used in developing this application was MySQL 4.0.18-nt and this version or later should work well with Spencer.
Create a new database by running:
mysqladmin -p create <database>
where <database> should be replaced with the name of the database you wish to use (e.g. spencer). You will be prompted for the root password during this process. Next, create the database structure by placing the supplied spencer.sql file in the mysql bin folder and typing:
mysql -p <database> < spencer.sql
again replacing <database> with the name of your newly created database.
All done, now let's start building the index.

Spendex

The indexer must be executed on the server where the files reside and the index must be populated (at least partially) before the search and browse interfaces will work. The initial indexing exercise will take some time. My reference implementation of around 330000 files in around 50Gb storage took about three days to create on a dual 800MHz Xeon 2GB RAM box.
First create a new folder outside of the file hierarchy to be indexed - for example, e:\bin (this of course refers to a Windows server - there is no reason why this will not work on a UNIX/Linux box).
Now place the following requisite files in this folder:

null.xslt
log4j.xml
spendex.properties
spendex.jar

At this point, open spencer.properties in a text editor and amend the settings appropriately:

server = Name or IP of server on which the database resides
database = Name of the database you created above
username = A user with rights to the database (read/write/delete)
password = The user's password
rootDir = The location of the root of the file hierarchy on the local server (NB: Repace single backslashes with double backslashes - e.g. c:\\Documents and Settings)
logdepth = The verbosity of the output between 0 and 4. I recommend 0 or 1 unless you are experiencing problems.

Now find the files indicated above left on the Internet, these are required third party libraries. Download them and place them in the same folder as the other files.

Now it is time to create a script/batch file to actually execute the indexing process. On windows the appropriate batch file might look something like this:

e:
cd bin
"C:\Program Files\Java\j2re1.4.2_05\bin\java.exe" -Xms250M -Xmx500M -Dlog4j.configuration=file:///c:/bin/log4j.xml -cp "c:\bin\spendex.jar";"c:\bin\poi-2.5.1-final-20040804.jar"; "c:\bin\mysql-connector-java-3.0.15-ga-bin.jar";"c:\bin\xsdlib.jar";"c:\bin\tm-extractors-0.4.jar"; "c:\bin\PDFBox-0.6.7a.jar";"c:\bin\log4j-1.2.8.jar";"c:\bin\jxl.jar" spendex.Index >c:\bin\log.txt 2>&1

The redirectors at the end ensure that all of the output is recorded in the log.txt file for subsequent perusal. Note that for very large indices with a high logging level, this file can easily reach 100Mb very quickly.
The best way to ensure that your index is maintained is to use your Operating System to schedule execution of the batch/script at regular intervals. I choose to run it overnight every night by creating an appropriately scheduled Windows Scheduled Task which points at the batch file. You could use a cron job in UNIX.
Periodically, check the log file and make sure that nothing too untoward is going on. Running the batch file will begin the long task of building the index and you can now get on with deploying the search/browse interfaces (see below).
The indexing task will recursively examine every file and folder in the filesystem starting at the root that you specify in the proprties file. If the file/folder already exists in the database and the date stamp on it has not changed since the last index, nothing is done. If it is not in the database or the time stamp has changed however, the file/folder is (re)added to the database and if the file extension is recognised (MS Office, Openoffice/StarOffice, PDF, zip or plain/marked up text) then the text is stripped out of the file, the words counted and sorted, cross referenced and added to the index.

There are a couple of other features of the indexer:

Running it with spencer.getCommonWords instead of spencer.Index in the command line populates the commons table of the database with the most common words currently listed. An example script might look like:

e:
cd bin
"C:\Program Files\Java\j2re1.4.2_05\bin\java.exe" -Xms250M -Xmx500M -Dlog4j.configuration=file:///e:/bin/log4j.xml -cp "e:\bin\spendex.jar";"e:\bin\poi-2.5.1-final-20040804.jar"; "e:\bin\mysql-connector-java-3.0.15-ga-bin.jar";"e:\bin\xsdlib.jar";"e:\bin\tm-extractors-0.4.jar"; "e:\bin\PDFBox-0.6.7a.jar";"e:\bin\log4j-1.2.8.jar";"e:\bin\jxl.jar" spendex.getCommonWords 250 >e:\bin\common.txt 2>&1

This will populate 250 rows of the table (note the 250 in the command - theres the clue!)

A less useful option forces a reindex of certain file types in a rather dramatic way. use this:

<classpath> spendex.delete aaa_bbb_ccc

To DELETE all files with aaa, bbb or ccc as their extensions AND all of the associated word indices. next time an index runs, the deleted files will be re-indexed anew. This is only really useful if there has been a problem with selected portions of the index.

The Web App

The Spencer.war file can be deployed using your preferred method into an appropriate servlet container (I recommend Tomcat) and the following required files must be present in the Tomcat classpath or the Spencer lib folder:

mysql-connector-java-3.0.15-ga-bin.jar
jcifs-1.1.3.jar

Once deployed, the first page to visit MUST be http://server/spencer/admin.jsp. Here is where you specify the connection and look and feel parameters of the app. It will look pretty ugly the first time you use it but to prettify quickly, enter spencer.css in the stylesheet box and click 'save' to provide an easier on the eye experience.
Most of the fields in this page are pretty self explanatory. Those in the database section will usually match those that you specified in spencer.properties above. Those in the second section need to refer to the server, share and an account with appropriate read access to the contents. The final portions are HTML fragments that will be included in every page and can be used to customise the look and feel of your implementation.

All done! Have a browse, have a search once the indexing is complete and see what you find.

Summary

The page you are viewing uses the same layout as the actual Spencer browser and search interface. It is customisable in Spencer via the css switching option.

This is a dummy search form - it does not actually work