The jar files that are required by the Spencer components can be found at the links below (at the time of writing)
and although they are required by this application, otherwise they are not affiliated to it in any way
The versions alluded to here have been proven to work with Spencer - newer versions may offer increased performance and stability or they may prove to be incompatible. Feel free to test new versions and let me know how you fare.
This document describes the workings of the Spencer filesystem indexing system. Spencer has three
components:
The Indexer (spendex)
The search interface (spencer)
The browse
interface (spencer)
This
document takes Spendex and Spencer in turn as they are fundamentally
different applications. The Spencer web application which provides both
the search and browse interfaces is discussed below.The suite
requires a MySQL database in which the index is stored and a SMB or
SAMBA share which can be accessed over the network. In addition, it
requires a standards-compliant servlet container such as Tomcat with an
appropriate Java JDK (1.4.2_06 or later) in which to deploy the
search/browse tools. The filestore host server must also have an appropriate
JRE (1.4.2_06 or later) installed to run the indexer.
Database
The
database used in developing this application was MySQL 4.0.18-nt and
this version or later should work well with Spencer. Create a new
database by running: mysqladmin -p create <database> where
<database> should be replaced with the name of the database you
wish to use (e.g. spencer). You will be prompted for the root password
during this process. Next, create the database structure by placing the
supplied spencer.sql file in the mysql bin folder and typing: mysql -p
<database> < spencer.sql again replacing <database> with the name
of your newly created database. All done, now let's start building the index.
Spendex
The
indexer must be executed on the server where the files reside and the
index must be populated (at least partially) before the search and
browse interfaces will work. The initial indexing exercise will take
some time. My reference implementation of around 330000 files in around
50Gb storage took about three days to create on a dual 800MHz Xeon 2GB
RAM box. First create a new folder outside
of the file hierarchy to be indexed -
for example, e:\bin (this of course refers to a Windows server - there
is no reason why this will not work on a UNIX/Linux box). Now place the
following requisite files in this folder:
At this point, open spencer.properties
in a text editor and amend the settings appropriately:
server = Name or IP
of server on which the database resides database = Name of the database
you created above username = A user with rights to the database
(read/write/delete) password = The user's password rootDir
= The location of the root of the file hierarchy on the local server
(NB: Repace single backslashes with double backslashes - e.g.
c:\\Documents and Settings) logdepth = The verbosity of the output between
0 and 4. I recommend 0 or 1 unless you are experiencing problems.
Now
find the files indicated above left on the Internet, these are required third
party libraries. Download them and place them in the same folder as the
other files.
Now
it is time to create a script/batch file to actually execute the
indexing process. On windows the appropriate batch file might look
something like this:
e: cd bin "C:\Program
Files\Java\j2re1.4.2_05\bin\java.exe" -Xms250M -Xmx500M
-Dlog4j.configuration=file:///c:/bin/log4j.xml -cp
"c:\bin\spendex.jar";"c:\bin\poi-2.5.1-final-20040804.jar";
"c:\bin\mysql-connector-java-3.0.15-ga-bin.jar";"c:\bin\xsdlib.jar";"c:\bin\tm-extractors-0.4.jar";
"c:\bin\PDFBox-0.6.7a.jar";"c:\bin\log4j-1.2.8.jar";"c:\bin\jxl.jar"
spendex.Index >c:\bin\log.txt 2>&1
The redirectors at
the end ensure that all of the output is recorded in the log.txt file
for subsequent perusal. Note that for very large indices with a high
logging level, this file can easily reach 100Mb very quickly. The
best way to ensure that your index is maintained is to use your
Operating System to schedule execution of the batch/script at regular
intervals. I choose to run it overnight every night by creating an
appropriately scheduled Windows Scheduled Task which points at the
batch file. You could use a cron job in UNIX. Periodically, check
the log file and make sure that nothing too untoward is going on.
Running the batch file will begin the long task of building the index
and you can now get on with deploying the search/browse interfaces (see
below). The indexing task will recursively examine every file and
folder in the filesystem starting at the root that you specify in the proprties
file. If the file/folder already exists in the database and the date
stamp on it has not changed since the last index, nothing is done. If
it is not in the database or the time stamp has changed however, the
file/folder is (re)added to the database and if the file extension is
recognised (MS Office, Openoffice/StarOffice, PDF, zip or plain/marked
up text) then the text is stripped out of the file, the words counted
and sorted, cross referenced and added to the index.
There are a couple
of other features of the indexer:
Running
it with spencer.getCommonWords instead of spencer.Index in the command
line populates the commons table of the database with the most common
words currently listed. An example script might look like:
e: cd bin "C:\Program
Files\Java\j2re1.4.2_05\bin\java.exe" -Xms250M -Xmx500M
-Dlog4j.configuration=file:///e:/bin/log4j.xml -cp
"e:\bin\spendex.jar";"e:\bin\poi-2.5.1-final-20040804.jar";
"e:\bin\mysql-connector-java-3.0.15-ga-bin.jar";"e:\bin\xsdlib.jar";"e:\bin\tm-extractors-0.4.jar";
"e:\bin\PDFBox-0.6.7a.jar";"e:\bin\log4j-1.2.8.jar";"e:\bin\jxl.jar"
spendex.getCommonWords 250 >e:\bin\common.txt 2>&1
This will populate 250 rows of the table (note the 250 in the command - theres the clue!)
A less useful option
forces a reindex of certain file types in a rather dramatic way. use this:
<classpath>
spendex.delete aaa_bbb_ccc
To
DELETE all files with aaa, bbb or ccc as their extensions AND all of
the associated word indices. next time an index runs, the deleted
files will be re-indexed anew. This is only really useful if there has
been a problem with selected portions of the index.
The Web App
The
Spencer.war file can be deployed using your preferred method into an
appropriate servlet container (I recommend Tomcat) and the following
required files must be present in the Tomcat classpath or the Spencer
lib folder:
Once
deployed, the first page to visit MUST be
http://server/spencer/admin.jsp. Here is where you specify the
connection and look and feel parameters of the app. It will look pretty
ugly the first time you use it but to prettify quickly, enter
spencer.css in the stylesheet box and click 'save' to provide an easier
on the eye experience. Most of the fields in this page are pretty
self explanatory. Those in the database section will usually match
those that you specified in spencer.properties above. Those in the
second section need to refer to the server, share and an account with
appropriate read access to the contents. The final portions are HTML
fragments that will be included in every page and can be used to
customise the look and feel of your implementation.
All done! Have a
browse, have a search once the indexing is complete and see what you find.
Summary
The page you are viewing uses the same layout as the actual Spencer browser and search interface. It is customisable
in Spencer via the css switching option.
Search
This is a dummy search form - it does not actually work