Indexing Bogging AEM Down? Disable Apache Tika!

Recently, we were investigating a CPU performance spike issue with an Adobe Experience Manager (AEM) publish server. After some research, we came across logs that indicated indexing had caused the CPU spike.

Adobe Experience Manager is more than just a content management system or an application to serve content to the user’s request. AEM includes more powerful functionality, such as Apache Lucene indexing, which enables full-featured text searches across content in the repository.

Behind the scenes, Apache Lucene fetches the documents in the repository and indexes the content based on the metadata and text content. The index update thread wakes up every five seconds looking for content updates. Apache Lucene uses Apache Tika, a content analysis tool, to get the internal detail of documents like metadata and text in the document to create the indexes.

In a real-world scenario, many companies do not rely on AEM search functionality. Companies opt for enterprise-wide search implementations like Adobe Search and Promote or Apache Solr. In these scenarios, all text parsing is handled by third-party engines. Now the question is, do we need to continue with Apache Tika parsing the documents in AEM? The answer is no. It is not required, and by disabling Apache Tika parsing inside AEM, we can reduce the CPU spike.

So, how do you disable document parsing by Apache Tika inside AEM? You don’t even need to disable the Apache Tika bundles. Just like configuring the parser in XML format, in AEM, we need to do simple configuration under Oak Index Lucene node.

To disable Apache Tika document indexing in AEM, follow these steps:
Open CRXDE lite
Navigate to /oak:index/lucene
Under lucene node create an nt:unstructured node named tika
Under the tika node, create the file node named config.xml
Open the config.xml, add the below entry:

<properties>
<parsers>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/zip</mime>
<mime>application/msword</mime>
<mime>application/vnd.ms-excel</mime>
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>

Repeat the step 3 – 5 for /oak:index/damAssetLucene
Now save everything.

In the above example, we are disabling the text extraction from Zip, MS-Word, MS-Excel and PDF files. During indexing, these files will be ignored for text extraction.

Below is the image showing the configuration:

You can find a complete list of content types on the IANA website, add the type you want to exclude in step five. Based on the above example, you can add the list of MIME types that you feel can be ignored for text extraction.
Please leave a comment below if you have any questions about indexing or performance-related issues.

Reference: https://blogs.perficient.com/2017/05/08/indexing-bogging-aem-down-disable-apache-tika/

AEM Tutorials for Beginners

April 1, 2020
Estimated Post Reading Time ~ 2 mins