January 1, 2021
Estimated Post Reading Time ~

Lucene Index in AEM - Part 1

Lucene index supports both property constraints and full-text constraints. Based on the index definition, it can be used to evaluate property constraints, full-text constraints, path restrictions, and sorting.

Lucene Index Definition/Structure - High level:
Mandatory Properties
NameTypeValue
typeStringlucene
asyncString[]Possible values - async, nrt, fultext-async
Optional/Supporting Properties
compatVersionLong2
Oak uses Lucene index implementation that does not support property constraints, index time aggregation by default. In order to use these features, set this property with value 2
blobSizeLong32768 (32kb - Default Value)
Size of each index file in repository. (for splitting while storing in NodeStore)
maxFieldLength Long10000 (Default value)
Numbers of terms indexed per field
nameStringname of the index
This will be used while logging
indexPathStringPath of the index definition
If the index definition named customluceneIndex is defined under /oak:index in the repo, then /oak:index/customluceneIndex is the value for this property.
includedPaths String[]List of paths to be included in the indexing
Only nodes defined under this path will be indexed
excludedPaths String[]List of paths to be excluded from indexing
Nodes defined under this path will not be indexed
queryPaths String[]List of paths for which this index is to be used
an index is used/picked for a query with specific path predicate - those paths can be provided here.
evaluatePathRestrictions Booleanfalse (Default)
If set to true, the index will evaluate path restrictions.
Query with path predicate is respected while fetching results from index.
Example:
If we search for a text "we-retail" under the DAM path - /content/dam/we-retail
Index definition without this property - will return all the results which has the text - "we-retail". Query Engine will filter out results that are not under /content/dam/we-retail
Index definition with this property(value-true) - will return results under that path alone. 
codec StringName of Lucene Codec to use.
full-text lucene indexing uses OakCodec by default which disables compression -> index size grows because of this.
To enable compression, we should set this property to -> Lucene46
Example: Full text Lucene available at /oak:index/lucene OOB
refresh Booleantrue
Refreshes stored index definition
On the next async job execution cycle, the index definition would be refreshed and this property will be removed upon refresh
functionName StringName to be used to enable index usage with native query support
For native queries(rep:native), we have the means to mention the index type. (Possible values supported are lucene or solr)
In the case of using Lucene, if multiple Lucene indexes are available and if we want to use a specific one for our query, then we can create this functionName property with some meaningful name as value(kind of identifier for this index) 
This name will then be used in native queries.
Example:
//*[rep:native('functionNameValue', 'native search query expression']
Index definition with this functionName will be picked for query execution.
useIfExists StringUseful in blue-green deployments, when using Composite Node Store
(Since Oak version 1.10.0)
In AEM, it is 6.5 version which has Oak version to be 1.10.2
Properties/Node that gets created automatically
reindexBooleanfalse
reindexCountLong1, very first time when the Lucene index is created + first async job is run
the number gets incremented by 1 every time a reindex is triggered.
(+) indexRules
(Node)
nt:unstructuredThis node with properties + few child nodes will be automatically created when we create Lucene index with mandatory properties.
Significance:
Used to define node types and their properties to be indexed as part of this index definition. 
It can have any number of nodes defining the node types and each in turn can have any number of nodes defining the respective node's properties.
Example: OOB cqPageLucene has indexRules defined for node type - "cq:Page" and properties of cq:Page => jcr:title, cq:lastModified, etc (each of these properties is a child node under the node cq:Page)
/oak:index/cqPageLucene/indexRules/cq:Page
Other additional child nodes as part of Lucene index
(+) aggregates
(Node)
nt:unstructuredIt is defined based on the primary node type and relative path patterns
It can have any number of node types and each, in turn, can have include(n) rules (for defining relative paths)
Significance:
To include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.
If we would like to index jcr:content(cq:PageContent) of cq:Page up to certain depth, we can make use of aggregates node.
Example: cqPageLucene has aggregates defined for node type - "cq:PageContent" and include0 to include3(4 nodes) for defining paths up to the desired depth.
where each of the include rules will represent one hierarchy down with respect to the cq:PageContent
/oak:index/cqPageLucene/aggregates/cq:PageContent
(+) analyzers
(Node)
nt:unstructuredOption to specify Analyzer class directly or via composition (defining Tokenzier Filter)
Significance:
Analyzers are used to analyze text while indexing and while searching via query execution.
It converts the given text into smaller units called Tokens (with help of Tokenizers Filters) for the ease of searching
There are many in-built Analyzers that extract keywords from text, converts to lower case, removes stop words/common words, etc.
Most commonly used OOB Analyzer StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) which will filter stop words, punctuation, and converting to lower case. It can also recognize URLs 
Usage: For Full-text search scenario - features like synonyms, stemming support.
Will try to create a custom use case illustrating this in upcoming posts.
(+) tika
(Node)
nt:unstructuredOak uses Apache Tika to extract text from binary content.
Usage: Again in the full-text scenario, for displaying related binary results as part of the search. 
Example: Search for a text - "we-retail" to display related images/pdf or any other related binary content. 
Will try to create a custom use case illustrating this in upcoming posts.

The table above is high-level information of Lucene Index Definition - High-level purpose of index rules, aggregates, analyzers, and tika
Each of these in turn has further configurations (child nodes and respective properties) and has more details to it, will add in upcoming posts for better clarity.

Next step, we will create a custom Lucene Property Index with mandatory properties.

Use case: Get all assets that have "cq:parentPath" property.
path=/content/dam/we-retail
type=dam:AssetContent
1_property=cq:parentPath
1_property.operation=exists
p.limit=-1


By aem4beginner

No comments:

Post a Comment

If you have any doubts or questions, please let us know.