Mayank Singh: April 2013

Searching ... a simple concept with a complex algorithms. That's how I see it. Always fascinated by the complexity and wanted to implement and use some sort of basic algorithms.
So I end using Apache's Lucene library to for search functionality. In this post I will walk through on some basic terminology/concept of this library along with small code snippets to initialize and make use of this library.

Lucene - it is text based search library. To built up an index source information could be anything a database, file system or web sites. You can feed in info from any source you like to index.

Data is indexed by Lucene using "Inverted Indexing" technique. That means it will "retrieves the pages related to a keyword instead of searching a pages for a keyword".

On Apache's website there are below two things are available for downloads

Lucene - It is an engine which can be used by programmers to customize searches.
Solr - It's a war file can be used by non-programmers. It can be directly deployed on tomcat, jetty or any web server.

Lucene Concepts

Document - It is unit of search/index. Just like a row when we fire sql queries.
Fields - A document is consist of one of more fields. Columns in a row.
Searching - It is done using "QueryParser" class
Queries - It has its own mini language. It does have ability to add weightage to fields, known as boosting.
Building Indexes - To build index lucene needs a directory on file system where information can be stored. While indexing records, we need to specify

what all fields needs to be stored and
what all fields needs to be indexed.

Directory - FSDirectory is the abstract class which points to the index directory. It has direct sub-classes. I have listed down below.

It is recommended to let lucene pick up implementation class based on environment.
SimpleFSDirectory - Poor for concurrent performace
NIOFSDirectory - Poor choice for windows
MMapDirectory - It has some issue with JRE bug.
RAMDirectory - It cannot handle huge indexes.
There are few more implementation classes provided by lucene which can be easily located in java docs provided by lucene

Coming to to coding part.

Index Directory

File idxFile = new File(INDEX_DIR); //e.g. /work/index/
Directory idxDir = FSDirectory.open(idxFile);

Prepare Analyzer

Map<String, Analyzer> map = new HashMap<String, Analyzer>(): //key, Analyzer pair
map.put(FILE_NAME, new StandardAnalyzer(Version.LUCENE_36));
Analyzer analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_36), map);

Index Writer Configuration

Config = new IndexWriterConfig(Version.LUCENE_36, analyzer);

Index Writer

new IndexWriter(idxDir, config);

Before we go further let's take quick peek on Analyzer.

It builds up token streams.
A policy for extracting index terms from text.
Subclasses

PerFieldAnalyzerWrapper
ReusableAnalyzerBase

PatternAnalyzer
KeywordAnalyzer
PatternAnalyzer

Moving forward with code snippets to

Create documents to be stored in index.

Field fPath = new Field(FULL_PATH, path, Field.Store.YES , Field.Index.toIndex(true, true, false));
document.add(fPath); // Add a field to document. Keep on multiple fields that needs to be stored.

Add document to index

indexWriter.addDocument(document);

Fine tune above process to build index. Once indexes ready, lets gear up for searching...

Initialize index directory

IndexReader ir = IndexReader.open(idxDir);
IndexSearcher searcher = new IndexSearcher(ir);

Prepare fields analyzers

Map map = new HashMap();
map.put(FILE_NAME, new StandardAnalyzer(Version.LUCENE_36));
Analyzer analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_36), map);

Query Parser

QueryParser qp = new MultiFieldQueryParser(Version.LUCENE_36, new String[]{FILE_NAME, TITLE, ALBUM, ARTIST}, analyzer);// FILE_NAME, TITLE, ALBUM, ARTIST are the fields of Document used in my example
Query query = qp.parse(srchText); //pass in the search keyword
TopDocs res = searcher.search(query, 10);

Iterate through results

for(ScoreDoc sc:res.scoreDocs) {

Document doc = searcher.doc(sc.doc);

sc.doc //Gives Document ID

doc.getFieldable(FILE_NAME).stringValue(); // Read Field Value

}

Hope this helps implementation get rolling quickly.

Thanks !!!

Mayank Singh

Saturday, April 06, 2013

Apache Lucene - Brief Concept & Code Snippet