January 1, 2021
Estimated Post Reading Time ~

Lucene Index in AEM - Part 3

This post illustrates the use of analyzers in full-text search with sample use case.
Apache Lucene Analyzers :
Analyzers as with the name is used to analyze the text both at the time of indexing and at the time of searching (via query execution)
  • An analyzer examines the text of fields and generates a token stream. It can be either a
    • Single Java class or
    • Composed of a series of Tokenizer and Filter Java classes.
  • Tokenizer breaks the data into lexical units or tokens
  • Filters then examines these tokens -> amends/discard/create new one based on the configuration
  • Series of Tokenizer + Filters => Analyzer
  • There are direct Analyzer classes, Tokenizer and Filters available OOB. Based on our requirement we can choose to use either direct Analyzer or Tokenizer + Filter combination.(Analyzer via composition)
Examples:
Analyzer -

  • StandardAnalyzer (org.apache.lucene.analysis.standard.StandardAnalyzer)
  • Removes stop words, converts to lowercase, recognize URLs and emails - most commonly used
Tokenizer -
  • Standard (org.apache.lucene.analysis.standard.StandardTokenizerFactory)
  • Splits the text field into tokens, treating whitespace and punctuation as delimiters.
Filters :
  • Stopwords Filter (org.apache.lucene.analysis.core.StopFilterFactory) - Removes stop words
  • Lowercase Filter (org.apache.lucene.analysis.core.LowerCaseFilterFactory) - Converts token to lowercase
  • PorterStem Filter (org.apache.lucene.analysis.en.PorterStemFilterFactory) - Creates stem words from the tokens
In Lucene Full Text Index defintion,
  • If we are opting to using direct Analyzer class, fully qualified Java class name is to be mentioned (using a property called "class" - highlighted in demo video)
  • If we are using Tokenizer and Filter combination. name without Factory suffix can be used. (Standard for StandardTokenizerFactory and PorterStem for PorterStemFilterFactory - this is again highlighted in demo video)
  • Note : If we are using Analyzer via composition, class property need to be removed.
Use case:
We will look into the common need for a full text search - Synonym and Stemming support
In we-retail DAM assets,
/content/dam/we-retail/en/activities (which again has biking, climbing. hiking etc as its kind)
We will create stemming filter as part of analyzers in Lucene Full text Index to fetch assets related to "activities" when we search using its stem word like "activity"
/content/dam/we-retail/en/products/apparel (which again has gloves, coats, pants and so on under apparel cateory)
We will create Synonym filter as part of analyzers in Lucene Full text Index to fetch assets related to "apparel" when we search using its synonyms like "clothing or garments"

Stemming:
  • Highlights the use of PorterStem Filter with Standard Tokenizer (Analyzer via composition)
  • EnglishAnalyzer class is used which has PorterStemFilter in it. (Direct Analyzer class)
Synonym:
Synonym Filter with Standard Tokenizer is used (Analyzer via composition)


By aem4beginner

No comments:

Post a Comment

If you have any doubts or questions, please let us know.