Analysis of Lucene4 source code two: introduction to Lucene

Recommended for you: Get network issues from WhatsUp Gold. Not end users.

Lucene is a high performance, scalable information retrieval library, can let you easily to increase the full-text search function in their own applications. Simply put, Lucene is the base of the function of the search engine.

Lucene is a high performance, scalable information retrieval library, can let you easily to increase the full-text search function in their own applications. Simply put, Lucene is the base of the function of the search engine. Search engine is a function, according to the net input of Query, to find the relevant documents. Tens of thousands of billions of documents, direct string matching sequential search, as slow as molasses, who all can not stand, so the wise man invented index(Index). The so-called index, the simple point that is set up some words to map some documents, can quickly find the relevant documents according to the word. So Lucene has two basic functions: (1) the indexing of documents; (2) according to the net input, through the index to quickly find relevant documents.

The index is the core of modern search engines, the indexing process is the process of source data into a very convenient query index file. Why the index so important, imagine you want to search for documents containing a keyword in a large number of documents now, so if you don't set up the index of words you need to put the document order is read into memory, and then check this article is it right? Containing key words search, it will cost much time think of the search engine, but find to search results in the millisecond time check. This is due to the establishment of the index of reason, you can put the index to imagine such a data structure, he can make you fast random access memory keywords in the index, the keywords associated documents and find. Lucene uses a called reverse index (inverted index) mechanism. Inverted index is that we maintain a word / phrase table, for each word / phrase in this table, there is a list describing what the document contains the word / phrase. So when users input the time the query conditions, can get search results very fast.

In order to index the document, Lucene provides five basic types, they are Document, Field, IndexWriter, Analyzer, Directory. Below we introduce the use of the five classes:

Document

Document is used to describe the document, the document can be here refers to a HTML page, email, or a text file. A Document object consists of multiple Field objects. Can think of a Document object into a record in a database, and each Field object is the record of a field.

Field

The Field object is used to describe the attributes of a document, such as an email header and content can be described by two Field objects.

Analyzer

Before being indexed in a document, we need to document content segmentation processing, this part of the work is done by Analyzer. The Analyzer class is an abstract class, it has multiple implementations. In view of language and the application of appropriate for different Analyzer. The word Analyzer content to IndexWriter to build the index.

IndexWriter

IndexWriter Lucene is to create a core index class, his role is to the Document object is added to the index one by one to.

Directory

This class represents a Lucene index storage location, this is an abstract class, it currently has two, the first is the FSDirectory, which represents a storage index in file system in the location of. The second is the RAMDirectory, which represents a storage index in memory of position.

The following is a index to build simple code:


public class IndexFiles {

  private IndexFiles() {}

  /** Index all text files under a directory. */

  public static void main(String[] args) {

String usage = "java org.apache.lucene.demo.IndexFiles"

+ " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"

+ "This indexes the documents in DOCS_PATH, creating a Lucene index"

+ "in INDEX_PATH that can be searched with SearchFiles";

String indexPath = "index";

String docsPath = null;

boolean create = true;

for(int i=0;i<args.length;i++) {

//Write index directory

if ("-index".equals(args[i])) {

indexPath = args[i+1];

i++;

//The need to build index document directory

} else if ("-docs".equals(args[i])) {

docsPath = args[i+1];

i++;

//That is updated or new

} else if ("-update".equals(args[i])) {

create = false;

}

}

if (docsPath == null) {

System.err.println("Usage: " + usage);

System.exit(1);

}

final File docDir = new File(docsPath);

if (!docDir.exists() || !docDir.canRead()) {

System.out.println("Document directory '" +docDir.getAbsolutePath()+ "' does not exist or is not readable, please check the path");

System.exit(1);

}

Date start = new Date();

try {

System.out.println("Indexing to directory '" + indexPath + "'...");

//Write index directory

Directory dir = FSDirectory.open(new File(indexPath));

//An analyzer

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

//Index Writer configuration, parameters.

IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);

if (create) {

// Create a new index in the directory, removing any

// previously indexed documents:

iwc.setOpenMode(OpenMode.CREATE);

} else {

// Add new documents to an existing index:

iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);

}

// Optional: for better indexing performance, if you

// are indexing many documents, increase the RAM

// buffer.  But if you do this, increase the max heap

// size to the JVM (eg add -Xmx512m or -Xmx1g):

//

// iwc.setRAMBufferSizeMB(256.0);

//Index Writer, parameters for the index directory and index Writer configuration

IndexWriter writer = new IndexWriter(dir, iwc);

//Establishment of index, parameters of the document directory for index Writer and the need to build index

indexDocs(writer, docDir);

// NOTE: if you want to maximize search performance,

// you can optionally call forceMerge here.  This can be

// a terribly costly operation, so generally it's only

// worth it when your index is relatively static (ie

// you're done adding documents to it):

//

// writer.forceMerge(1);

writer.close();

Date end = new Date();

System.out.println(end.getTime() - start.getTime() + " total milliseconds");

} catch (IOException e) {

System.out.println(" caught a " + e.getClass() +

"\n with message: " + e.getMessage());

}

  }

  /**

   * Indexes the given file using the given writer, or if a directory is given,

   * recurses over files and directories found under the given directory.

   *

   * NOTE: This method indexes one document per input file.  This is slow.  For good

   * throughput, put multiple documents into your input file(s).  An example of this is

   * in the benchmark module, which can create "line doc" files, one document per line,

   * using the

   * <a href="../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html"

   * >WriteLineDocTask</a>.

   * 

   * @param writer Writer to the index where the given file/dir info will be stored

   * @param file The file to index, or the directory to recurse into to find files to index

   * @throws IOException If there is a low-level I/O error

   */

  static void indexDocs(IndexWriter writer, File file)

throws IOException {

if (file.canRead()) {

//If it is a directory, recursive calls

if (file.isDirectory()) {

String[] files = file.list();

if (files != null) {

for (int i = 0; i <files.length; i++) {

indexDocs(writer, new File(file, files[i]));

}

}

} else {

//The indexing of documents

FileInputStream fis;

try {

fis = new FileInputStream(file);

} catch (FileNotFoundException fnfe) {

return;

}

try {

//

Document doc = new Document();

// Add the path of the file as a field named "path".  Use a

// field that is indexed (i.e. searchable), but don't tokenize

// the field into separate words and don't index term frequency

// or positional information:

Field pathField = new StringField("path", file.getPath(), Field.Store.YES);

doc.add(pathField);

// Add the last modified date of the file a field named "modified".

// Use a LongField that is indexed (i.e. efficiently filterable with

// NumericRangeFilter).  This indexes to milli-second resolution, which

// is often too fine.  You could instead create a number based on

// year/month/day/hour/minutes/seconds, down the resolution you require.

// For example the long value 2011021714 would mean

// February 17, 2011, 2-3 PM.

doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));

// Add the contents of the file to a field named "contents".  Specify a Reader,

// so that the text of the file is tokenized and indexed, but not stored.

// Note that FileReader expects the file to be in UTF-8 encoding.

// If that's not the case searching for special characters will fail.

doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {

// New index, so we just add the document (no old document can be there):

System.out.println("adding " + file);

writer.addDocument(doc);

} else {

// Existing index (an old copy of this document may have been indexed) so

// we use updateDocument instead to replace the old one matching the exact

// path, if present:

System.out.println("updating " + file);

writer.updateDocument(new Term("path", file.getPath()), doc);

}

} finally {

fis.close();

}

}

}

  }

}

Search as indexing is very convenient by using Lucene. In the part of the above, we have a directory of text document established index, we have to search the index to find contain a keyword or phrase document now. Lucene provides several base classes to complete this process, they are IndexSearcher, Term, Query, TermQuery, Hits. here we are introduced that several classes of functions.

Query

This is an abstract class, he has multiple implementations, such as TermQuery, BooleanQuery, PrefixQuery. this kind of purpose is to the user input query string into Lucene can recognize Query.

Term

Term is the basic unit of the search, a Term object has two types of String domain. To generate a Term object can have the following a statement to complete: Term term = new Term ("fieldName", "queryWord"); the first parameter represents to search in which a Field document, the second parameter represents the keywords to query.

TermQuery

TermQuery is a subclass of the abstract class Query, a query class basic as it is also supported by Lucene. To generate a TermQuery object is completed by the following statement: TermQuery termQuery = new TermQuery (New Term ("fieldName", "queryWord")); its constructor accepts one parameter, which is a Term object.

IndexSearcher

IndexSearcher is used to search a good index. It can only be opened an index as read-only, so there can be multiple instances of IndexSearcher to operate in an index.

Hits

Hits is used to save the search results.

Here is a simple query code:

public class SearchFiles {

  private SearchFiles() {}

  /** Simple command-line based search demo. */

  public static void main(String[] args) throws Exception {

String usage =

"Usage:\tjava org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";

if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {

System.out.println(usage);

System.exit(0);

}

String index = "index";

String field = "contents";

String queries = null;

int repeat = 0;

boolean raw = false;

String queryString = null;

int hitsPerPage = 10;

for(int i = 0;i <args.length;i++) {

if ("-index".equals(args[i])) {

index = args[i+1];

i++;

} else if ("-field".equals(args[i])) {

field = args[i+1];

i++;

} else if ("-queries".equals(args[i])) {

queries = args[i+1];

i++;

} else if ("-query".equals(args[i])) {

queryString = args[i+1];

i++;

} else if ("-repeat".equals(args[i])) {

repeat = Integer.parseInt(args[i+1]);

i++;

} else if ("-raw".equals(args[i])) {

raw = true;

} else if ("-paging".equals(args[i])) {

hitsPerPage = Integer.parseInt(args[i+1]);

if (hitsPerPage <= 0) {

System.err.println("There must be at least 1 hit per page.");

System.exit(1);

}

i++;

}

}

IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));

IndexSearcher searcher = new IndexSearcher(reader);

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

BufferedReader in = null;

if (queries != null) {

in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "UTF-8"));

} else {

in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));

}

QueryParser parser = new QueryParser(Version.LUCENE_40, field, analyzer);

while (true) {

if (queries == null && queryString == null) {

// prompt the user

System.out.println("Enter query: ");

}

String line = queryString != null ? queryString : in.readLine();

if (line == null || line.length() == -1) {

break;

}

line = line.trim();

if (line.length() == 0) {

break;

}

Query query = parser.parse(line);

System.out.println("Searching for: " + query.toString(field));

if (repeat > 0) {

// repeat & time as benchmark

Date start = new Date();

for (int i = 0; i <repeat; i++) {

searcher.search(query, null, 100);

}

Date end = new Date();

System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");

}

doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);

if (queryString != null) {

break;

}

}

reader.close();

  }

  /**

   * This demonstrates a typical paging search scenario, where the search engine presents

   * pages of size n to the user. The user can then go to the next page if interested in

   * the next hits.

   *

   * When the query is executed for the first time, then only enough results are collected

   * to fill 5 result pages. If the user wants to page beyond this limit, then the query

   * is executed another time and all hits are collected.

   *

   */

  public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query query,

int hitsPerPage, boolean raw, boolean interactive) throws IOException {

// Collect enough docs to show 5 pages

TopDocs results = searcher.search(query, 5 * hitsPerPage);

ScoreDoc[] hits = results.scoreDocs;

int numTotalHits = results.totalHits;

System.out.println(numTotalHits + " total matching documents");

int start = 0;

int end = Math.min(numTotalHits, hitsPerPage);

while (true) {

if (end > hits.length) {

System.out.println("Only results 1 - " + hits.length +" of " + numTotalHits + " total matching documents collected.");

System.out.println("Collect more (y/n) ?");

String line = in.readLine();

if (line.length() == 0 || line.charAt(0) == 'n') {

break;

}

hits = searcher.search(query, numTotalHits).scoreDocs;

}

end = Math.min(hits.length, start + hitsPerPage);

for (int i = start; i <end; i++) {

if (raw) {

// output raw format

System.out.println("doc="+hits[i].doc+" score="+hits[i].score);

continue;

}

Document doc = searcher.doc(hits[i].doc);

String path = doc.get("path");

if (path != null) {

System.out.println((i+1) + ". " + path);

String title = doc.get("title");

if (title != null) {

System.out.println("   Title: " + doc.get("title"));

}

} else {

System.out.println((i+1) + ". " + "No path for this document");

}

}

if (!interactive || end == 0) {

break;

}

if (numTotalHits >= end) {

boolean quit = false;

while (true) {

System.out.print("Press ");

if (start - hitsPerPage >= 0) {

System.out.print("(p)revious page, "); 

}

if (start + hitsPerPage <numTotalHits) {

System.out.print("(n)ext page, ");

}

System.out.println("(q)uit or enter number to jump to a page.");

String line = in.readLine();

if (line.length() == 0 || line.charAt(0)=='q') {

quit = true;

break;

}

if (line.charAt(0) == 'p') {

start = Math.max(0, start - hitsPerPage);

break;

} else if (line.charAt(0) == 'n') {

if (start + hitsPerPage <numTotalHits) {

start+=hitsPerPage;

}

break;

} else {

int page = Integer.parseInt(line);

if ((page - 1) * hitsPerPage <numTotalHits) {

start = (page - 1) * hitsPerPage;

break;

} else {

System.out.println("No such page");

}

}

}

if (quit) break;

end = Math.min(numTotalHits, start + hitsPerPage);

}

}

  }

}

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download

Posted by Darnell at November 15, 2013 - 1:05 AM