Saturday, April 06, 2013

Apache Lucene - Brief Concept & Code Snippet

Searching ... a simple concept with a complex algorithms. That's how I see it. Always fascinated by the complexity and wanted to implement and use some sort of basic algorithms.
So I end using Apache's Lucene library to for search functionality. In this post I will walk through on some basic terminology/concept of this library along with small code snippets to initialize and make use of this library.

Lucene - it is text based search library. To built up an index source information could be anything a database, file system or web sites. You can feed in info from any source you like to index.

Data is indexed by Lucene using "Inverted Indexing" technique. That means it will "retrieves the pages related to a keyword instead of searching a pages for a keyword".

On Apache's website there are below two things are available for downloads
  • Lucene - It is an engine which can be used by programmers to customize searches.
  • Solr - It's a war file can be used by non-programmers. It can be directly deployed on tomcat, jetty or any web server.
Lucene Concepts
  • Document - It is unit of search/index. Just like a row when we fire sql queries.
  • Fields - A document is consist of one of more fields. Columns in a row.
  • Searching - It is done using "QueryParser" class
  • Queries - It has its own mini language. It does have ability to add weightage to fields, known as boosting.
  • Building Indexes - To build index lucene needs a directory on file system where information can be stored. While indexing records, we need to specify 
    • what all fields needs to be stored and 
    • what all fields needs to be indexed.
  • Directory - FSDirectory is the abstract class which points to the index directory. It has direct sub-classes. I have listed down below.
    • It is recommended to let lucene pick up implementation class based on environment.
    • SimpleFSDirectory - Poor for concurrent performace
    • NIOFSDirectory - Poor choice for windows
    • MMapDirectory - It has some issue with JRE bug.
    • RAMDirectory - It cannot handle huge indexes.
    • There are few more implementation classes provided by lucene which can be easily located in java docs provided by lucene
Coming to to coding part.
  • Index Directory
File idxFile = new File(INDEX_DIR); //e.g. /work/index/
Directory idxDir = FSDirectory.open(idxFile);
  • Prepare Analyzer
Map<String, Analyzer> map = new HashMap<String, Analyzer>(): //key, Analyzer pair
map.put(FILE_NAME, new StandardAnalyzer(Version.LUCENE_36));
Analyzer analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_36), map);
  • Index Writer Configuration
Config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
  • Index Writer
new IndexWriter(idxDir, config);
Before we go further let's take quick peek on Analyzer.
  • It builds up token streams.
  • A policy for extracting index terms from text.
  • Subclasses
    • PerFieldAnalyzerWrapper
    • ReusableAnalyzerBase
      • PatternAnalyzer
      • KeywordAnalyzer
      • PatternAnalyzer
Moving forward with code snippets to 
  • Create documents to be stored in index.
Field fPath = new Field(FULL_PATH, path, Field.Store.YES , Field.Index.toIndex(true, true, false));
document.add(fPath); // Add a field to document. Keep on multiple fields that needs to be stored.
  • Add document to index
indexWriter.addDocument(document);
Fine tune above process to build index. Once indexes ready, lets gear up for searching...
  • Initialize index directory
IndexReader ir = IndexReader.open(idxDir);
IndexSearcher searcher = new IndexSearcher(ir);
  • Prepare fields analyzers
Map map = new HashMap();
map.put(FILE_NAME, new StandardAnalyzer(Version.LUCENE_36));
Analyzer analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_36), map);
  • Query Parser
QueryParser qp = new MultiFieldQueryParser(Version.LUCENE_36, new String[]{FILE_NAME, TITLE, ALBUM, ARTIST}, analyzer);// FILE_NAME, TITLE, ALBUM, ARTIST are the fields of Document used in my example
Query query = qp.parse(srchText); //pass in the search keyword
TopDocs res = searcher.search(query, 10);
  • Iterate through results
for(ScoreDoc sc:res.scoreDocs) {
Document doc = searcher.doc(sc.doc);
sc.doc //Gives Document ID
doc.getFieldable(FILE_NAME).stringValue(); // Read Field Value
}
Hope this helps implementation get rolling quickly. 
Thanks !!!