May 2, 2020
Estimated Post Reading Time ~

Getting Started: SOLR Indexing In AEM 6.3

Introduction Code & Theory, we have experience with several indexing solutions such as SOLR, ElasticSearch, and Amazon CloudSearch. We make a recommendation based on client needs, expertise, and stack. On our latest AEM project, we decided to go with SOLR. The main reasons were:
  • Its an Apache project and AEM is built on Apache projects (i.e. Felix, Jackrabbit & Sling)
  • The Java client SolrJ and its dependencies are distributed as OSGi bundles and can be easily deployed to the Felix container.
  • It has a purely REST API giving us the option of querying directly from the front end, or through SolrJ on the backend.
When it comes to indexing AEM content using SOLR, success rests on several factors. A good taxonomy, an extensible suite of OSGi service components, good UX to create components that leverage the indexed data, and a scalable SOLR deployment. At Code & Theory, we do all this for our clients. This how-to, however, is targeted to AEM developers and architects wishing to start integration with SOLR. We’ll use Docker to run SOLR and within just a few minutes you’ll have a SOLR instance up and running, and shortly after that, you’ll be indexing some content. Finally, we’ll point out one little trick we used to index the textual content of a WCM Page.

Prerequisites
  • AEM 6.3 + SP2
  • Docker for your particular platform
  • Maven 3
Create an AEM 6.3 Project
Create a new AEM project using the AEM Maven archetype. I am using version 13 as that is the version that will create an AEM 6.3+SP2 project. Refer to their README if you have another version. Run the following command:


echo Y | \
mvn org.apache.maven.plugins:maven-archetype-plugin:2.4:generate \
 -DarchetypeGroupId=com.adobe.granite.archetypes \
 -DarchetypeArtifactId=aem-project-archetype \
 -Dversion=1.0-SNAPSHOT \
 -DarchetypeVersion=13 \
 -DarchetypeCatalog=https://repo.adobe.com/nexus/content/groups/public/ \
 -DgroupId=org.aem.demo \
 -DartifactId=aem-solr \
 -DappsFolderName=aem-solr \
 -DartifactName=aem-solr \
 -DcomponentGroupName=aem-solr \
 -DconfFolderName=aem-solr \
 -DcssId=aem-solr \
 -DpackageGroup=aem-solr \
 -DsiteName=aem-solr \
 -DcontentFolderName=aem-solr

Run SOLR In Docker
Create docker-compose.yml
Create a file in the aem-solr folder called docker-compose.yml and write the following contents into it. This will create a container using the SOLR Alpine image, creating a new collection, and storing the data on your host drive so that if the container shuts down, you won’t have lost any data. The official SOLR image on Docker Hub is really flexible.

version: "3.3"
services:
    solr:
        image: solr:7.3.1-alpine
        ports:
            - "8983:8983"
        volumes:
            - ./solrdata:/opt/solr/server/solr/mycores
        entrypoint:
          - docker-entrypoint.sh
          - solr-precreate
          - aemsolr

Start SOLR
In the root of the aem-solr folder where you created docker-compose.yml, run this command, and then verify SOLR is up and running by accessing the web console at http://localhost:8983
$ docker-compose up -d

Create a Dependency Content Package
There are a few dependencies that we need that do not ship with AEM. Luckily, these are already distributed as OSGi bundles and all we need to do is deploy them into the Felix container. We need to create a separate content package to do this. Optionally we could embed them directly into our core-bundle but the better practice is to deploy them separately to allow for easier upgrades.

Parent pom.xml dependencyManagement updates
Locate the parent pom.xml under the aem-solr folder and add the following dependencies under the <dependencyManagement> node. Always get in the habit of specifying your dependency versions in the <dependencyManagement> section of the parent POM. It makes for easier maintenance and upgrades.

<!-- SolrJ -->
<dependency>
    <groupId>org.apache.servicemix.bundles</groupId>
    <artifactId>org.apache.servicemix.bundles.solr-solrj</artifactId>
    <version>7.2.1_1</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.servicemix.bundles</groupId>
    <artifactId>org.apache.servicemix.bundles.noggit</artifactId>
    <version>0.8_1</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.zookeeper</groupId>
    <artifactId>zookeeper</artifactId>
    <version>3.4.10</version>
    <scope>provided</scope>
</dependency>

Create dependencies content package Maven project
Create a folder called dependencies under the aem-solr folder. In the dependencies folder, write this pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.aem.demo</groupId>
        <artifactId>aem-solr</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <artifactId>aem-solr.dependencies</artifactId>
    <packaging>content-package</packaging>
    <name>aem-solr - Dependencies</name>
    <description>Dependency bundles package for aem-solr</description>

    <build>
        <plugins>
            <plugin>
                <groupId>com.day.jcr.vault</groupId>
                <artifactId>content-package-maven-plugin</artifactId>
                <extensions>true</extensions>
                <configuration>
                    <verbose>true</verbose>
                    <failOnError>true</failOnError>
                    <group>aem-solr</group>
                    <!-- embed everything which has the same group id -->
                    <!-- nevertheless it only filters from the list of given dependencies. -->
                    <embeddeds>
                        <embedded>
                            <groupId>org.apache.servicemix.bundles</groupId>
                            <target>/apps/system/install</target>
                            <filter>true</filter>
                        </embedded>
                        <embedded>
                            <groupId>org.apache.zookeeper</groupId>
                            <target>/apps/system/install</target>
                            <filter>true</filter>
                        </embedded>
                    </embeddeds>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.servicemix.bundles/org.apache.servicemix.bundles.solr-solrj -->
        <dependency>
            <groupId>org.apache.servicemix.bundles</groupId>
            <artifactId>org.apache.servicemix.bundles.solr-solrj</artifactId>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.servicemix.bundles/org.apache.servicemix.bundles.noggit -->
        <dependency>
            <groupId>org.apache.servicemix.bundles</groupId>
            <artifactId>org.apache.servicemix.bundles.noggit</artifactId>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper -->
        <dependency>
            <groupId>org.apache.zookeeper</groupId>
            <artifactId>zookeeper</artifactId>
        </dependency>
    </dependencies>
</project>

Update Parent pom.xml module list
Add the new dependencies project to list of modules in parent pom.xml
<modules>
    <module>dependencies</module>
    <module>core</module>
    <module>ui.apps</module>
    <module>ui.content</module>
    <module>it.tests</module>
    <module>it.launcher</module>
</modules>

Deploy The AEM Project
Now run mvn clean install -PautoInstallPackage -Padobe-public. Access the Felix at http://localhost:4502/system/console/bundles and you will see the 3 bundles have been deployed and started. You can view the sample content by accessing http://localhost:4502/content/aem-solr/en.html.



Index Your First Resource
Update the core bundle’s pom.xml
Now that you will start using SolrJ in Java code you’ll need to update the dependencies of the core bundle. Locate the core project’s pom.xml and add the following dependencies:

<!-- SolrJ -->
<dependency>
    <groupId>org.apache.servicemix.bundles</groupId>
    <artifactId>org.apache.servicemix.bundles.solr-solrj</artifactId>
    <scope>provided</scope>
</dependency>

Create a Sling Servlet
Create a new Sling Servlet in the core bundle. All we are going to do is merely index the resource.

@Component(service = Servlet.class,
           name = "Property Index SOLR Servlet",
           property = { "sling.servlet.methods=" + HttpConstants.METHOD_GET,
                        "sling.servlet.resourceTypes=aem-solr/components/structure/page",
                        "sling.servlet.selectors=property",
                        "sling.servlet.extensions=index" })
public final class PropertyIndexSolrServlet extends SlingSafeMethodsServlet {

    @Override
    protected void doGet(final SlingHttpServletRequest request, final SlingHttpServletResponse response)
            throws
            ServletException,
            IOException {

        final LabeledResource lblResource = request.getResource()
                                                   .adaptTo(LabeledResource.class);
        final SolrInputDocument document = new SolrInputDocument();
        document.setField("id", lblResource.getPath());
        document.setField("title_s", lblResource.getTitle());
        document.setField("description_s", lblResource.getDescription());

        try (SolrClient client = getClient()) {

            new UpdateRequest().add(document)
                               .commit(client, "aemsolr");

        } catch (final SolrServerException e) {
            throw new ServletException(e);
        }
    }

    private static HttpSolrClient getClient() {

        return new HttpSolrClient.Builder().withBaseSolrUrl("http://localhost:8983/solr")
                                           .build();
    }
}

Execute the Servlet & Verify SOLR Index
The servlet will respond to the following url: http://localhost:4502/content/aem-solr/en/jcr:content.property.index.

After running it, verify the SOLR document was created. Go to the SOLR web console at http://localhost:8983/solr/#/aemsolr/query and click on the Execute Query button at the bottom of the query page. You should see your document in the list of results.

Beyond Just Indexing Properties
If you load up the sample content page at http://localhost:4502/content/aem-solr/en.html, you’ll notice that it has some Lorem Ipsum content. Where and how is this content stored? To make a long story short, this page has been implemented by using sling:resourceSuperType=”core/wcm/components/page/v2/page”. Open up the content in CRX/DE to view the structure: http://localhost:6302/crx/de/index.jsp#/content/aem-solr/en/jcr%3Acontent/root. Getting the page’s title and description was simple enough, but how do we index pages that can have an arbitrary number of child components in a responsive grid structure like the one used by the Core WCM Components? At best it would require an intimate knowledge of the taxonomy and a lot of if statements!

We had a similar situation with one of our clients. Their textual content was stored in several child components within a parsys, usually placed there by content authors. To capture the textual content without getting too deep into the taxonomy, we leveraged SlingRequestProcessor to process requests through Sling and get the rendered HTML.

Parent pom.xml dependencyManagement updates
We are going to leverage the Jsoup HTML parser so we can programatically get the textual content out of the HTML we will render. Locate the parent pom.xml under the aem-solr folder and add the following dependencies under the <dependencyManagement> node.

<!-- JSoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.3</version>
    <scope>provided</scope>
</dependency>

Dependencies pom.xml embeddeds update
Add the following to the <dependencies> node of the dependencies content package project
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
</dependency>

And the following to the <embeddeds> node
<embedded>
    <groupId>org.jsoup</groupId>
    <target>/apps/system/install</target>
    <filter>true</filter>
</embedded>

Core pom.xml dependencies update
Add the following to the <dependencies> node of the core bundle project
<!-- JSoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <scope>provided</scope>
</dependency>

Create a Sling Servlet
@Component(service = Servlet.class,
           name = "Rendering Index SOLR Servlet",
           property = { "sling.servlet.methods=" + HttpConstants.METHOD_GET,
                        "sling.servlet.resourceTypes=aem-solr/components/structure/page",
                        "sling.servlet.selectors=rendering",
                        "sling.servlet.extensions=index" })
public final class RenderingIndexSolrServlet extends SlingSafeMethodsServlet {

    @Reference
    private RequestResponseFactory requestResponseFactory;

    @Reference
    private SlingRequestProcessor requestProcessor;

    @Override
    protected void doGet(final SlingHttpServletRequest request, final SlingHttpServletResponse response)
            throws
            ServletException,
            IOException {

        final Resource resource = request.getResource();
        final SolrInputDocument document = new SolrInputDocument();
        document.setField("id", resource.getPath());
        document.addField("body_s", ImmutableMap.of("set", getText(resource)));

        try (SolrClient client = getClient()) {

            new UpdateRequest().add(document)
                               .commit(client, "aemsolr");

        } catch (final SolrServerException e) {
            throw new ServletException(e);
        }
    }

    private String getText(final Resource resource)
            throws
            ServletException,
            IOException {

        final String uri = String.format("%s.html", resource.getPath());
        final HttpServletRequest request = this.requestResponseFactory.createRequest(HttpConstants.METHOD_GET, uri);
        WCMMode.DISABLED.toRequest(request);
        try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
            final HttpServletResponse response = this.requestResponseFactory.createResponse(out);
            final ResourceResolver resourceResolver = resource.getResourceResolver();
            this.requestProcessor.processRequest(request, response, resourceResolver);
            final String html = out.toString("UTF-8");
            return Jsoup.parse(html)
                        .text();
        }
    }

    private static HttpSolrClient getClient() {

        return new HttpSolrClient.Builder().withBaseSolrUrl("http://localhost:8983/solr")
                                           .build();
    }
}

In the core bundle project, create the following servlet. In this servlet we are leveraging the SlingRequestProcessor to render the resource as HTML, and the Jsoup parser to get the text only content from the HTML. We are also using SOLR’s partial updates feature to update the existing document that would of been created by the previous servlet. Otherwise we would of had to fetch it, update it, and save it. Or completely recreate it.

Execute the Servlet & Verify SOLR Index
The servlet will respond to the following url: http://localhost:4502/content/aem-solr/en/jcr:content.rendering.index.

After running it, verify the SOLR document was created. Go to the SOLR web console at http://localhost:8983/solr/#/aemsolr/query and click on the Execute Query button at the bottom of the query page. You should see your document in the list of results.

Conclusion
The examples given used servlets as a quick way to illustrate how to index a resource. In practice there are a multitude of ways to accomplish this. In our previous projects we encapsulated the Resource to SolrInputDocument mapping into an AdapterFactory, with a suite of supporting OSGi service components to control what and how content got indexed. 

Then we adapted resources to SolrInputDocument within event handlers, work flow processes and Sling jobs. But why stop at resources? Other content we’ve indexed include PDFs and yes, even images. With the index data in place our UX team designed page components to do things from site search to recirculation of article and news pages.

You can find the completed project on GitHub.


By aem4beginner

No comments:

Post a Comment

If you have any doubts or questions, please let us know.