October 2, 2020
Estimated Post Reading Time ~

Demystifying shortened and extension-less URLs in AEM

Your company has decided to migrate their web presence to Adobe Experience Manager and you’re getting to the tail end of the project. This is usually the point when you realize your URLs need to be shortened, because, let’s be frank: who wants to see “/content” in their URL? And, whilst we’re at it, you should probably get rid of that “html” extension as well.

So the problem we’re trying to solve is how to turn a URL like http://acme.com/content/acme/en/about.html into http://acme.com/about. There are various ways of going about it and naturally, there are trade-offs with each approach. In this post, I’m going to summarise each approach and its tradeoffs. 

Web Server Rewrite Rules
One approach is to use a rewrite module in your web server (i.e. mod_rewrite). This will rewrite all incoming URLs to a path that can be resolved to a resource in AEM, before handing it over to the Dispatcher. This is fairly easy to do by using VirtualHost and rewriting the links accordingly depending on the domain. There are many resources available on how to achieve this, so I won’t go into the details.


Request processing flow when web server rewrites incoming URLs (source) 

Benefits
The advantage of this approach is that the Dispatcher cache path mirrors the content path, so you will not have any issues invalidating content on activation. Moreover, rewrite rules are quite powerful, so you can do some pretty fancy things. 

Drawbacks
However, this also means that your AEM instances are not aware of the link rewriting logic, so you will most likely have to create a custom Link Transformer to rewrite links within your pages before serving them up to the end-users.

Alternatively, you can create a Tag Library to rewrite links, but that will not work for URLs contained in authored content. This is not much fun because it requires more work on your part and is duplicating the link rewriting logic. 

Sling Resource Resolution
The next approach leverages the Sling resource resolution mechanism that comes with AEM. With this approach, the resource resolution engine associates a path with a content resource. The benefit here is that not only can AEM resolve incoming requests to content paths, but it can also rewrite links to their shortened versions automagically.


Request processing flow when AEM rewrites incoming URLs (source)

The most straight forward way to achieve this setup is to configure the “JCR Resource Resolver Factory” service by defining the mappings using the “URL Mappings” (a.k.a resource.resolver.mapping) property.

Using the we.retail sample site as an example, once you apply the configuration below you can now browse to the equipment page using the following URL: http://localhost:4502/equipment.html 

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0"
jcr:primaryType="sling:OsgiConfig"
resource.resolver.searchpath="[/apps,/libs,/apps/foundation/components/primary,/libs/foundation/components/primary]"
resource.resolver.manglenamespaces="{Boolean}true"
resource.resolver.allowDirect="{Boolean}true"
resource.resolver.required.providers="[org.apache.sling.jcr.resource.internal.helper.jcr.JcrResourceProviderFactory]"
resource.resolver.virtual="[/:/]"
resource.resolver.mapping="[/-/,/content/we-retail/us/en/-/]"
resource.resolver.map.location="/etc/map"
resource.resolver.default.vanity.redirect.status="302"/>


It’s also important to note that a “reverse” mapping is created from this entry, meaning that the built-in “Day CQ Link Checker Transformer” can shorten links within our HTML pages automatically (provided link rewriting is enabled). You may also combine this with the “Strip HTML Extension” property of the built-in transformer in order to remove the “.html” extension from your links (note: you will need a webserver to append the “.html” extension to your incoming requests so that they can be processed). 

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0"
jcr:primaryType="sling:OsgiConfig"
linkcheckertransformer.strictExtensionCheck="{Boolean}false"
linkcheckertransformer.rewriteElements="[a:href,area:href,form:action]"
linkcheckertransformer.disableRewriting="{Boolean}false"
linkcheckertransformer.disableChecking="{Boolean}false"
linkcheckertransformer.stripHtmltExtension="{Boolean}true"
linkcheckertransformer.mapCacheSize="{Long}5000"/>


The resulting HTML markup for we.retail navigation will look like the snippet below:
<ul class="nav navbar-nav navbar-center">
<li class="visible-xs">
<a href="/">we.<strong class="text-primary">Retail</strong></a>
</li>
<li>
<a href="/experience">Experience</a>
</li>
<li>
<a href="/men">Men</a>
</li>
<li>
<a href="/women">Women</a>
</li>
<li>
<a href="/equipment">Equipment</a>
</li>
</ul>


Benefits

This method works great for sites that have simple URL handling requirements and do not need to support multi-tenancy. It’s very easy to configure (only 2 configuration files) and doesn’t require AEM to be aware of the various domains it’s serving content to. 

Drawbacks
Whilst this approach is easy to configure, this simplicity comes at a cost:
It will not work for websites where the URLs do not nicely match up with the JCR content because it doesn’t support regular expressions.
It doesn’t support cross-site links because this rewriting method is not domain-aware.

If a path is duplicated across multiple sites, it will resolve to the first match. For example, given Geometrixx Outdoors and Geometrixx, when requesting http://localhost:4502/company.html, the resource resolution will either resolve to /content/geometrixx-outdoors/en/company or /content/geometrixx/en/company, depending on which mapping was defined first.
When relying on the LinkCheckerTransformer to provide you with extension-less URLs, your calls to ResourceResolver#map on the backend will still contain the “html” extension. This may not be desired when rendering links to be contained in emails. 

Pulling out the big guns
Another approach is to define Sling Mappings under /etc/map. Sling Mappings are made up of a series of nodes that dictate how the mapping should function for each configured domain. They take more time to set up but are a great compromise because they are almost as powerful as rewrite rules and also make your AEM application aware of the link processing logic.

Here’s what our Sling mappings would look like for we.retail site:
{

"jcr:primaryType": "sling:Folder",
"weretail.com": {
"jcr:primaryType": "sling:Mapping",
"sling:internalRedirect": [
"/content/we-retail/us/en"
],
"weretail_com_content": {
"jcr:primaryType": "sling:Mapping",
"sling:match": "(.+)$",
"sling:internalRedirect": [
"/content/we-retail/us/en/$1",
"/$1"
]
},
"reverse_mapping_content": {
"jcr:primaryType": "sling:Mapping",
"sling:match": "$1",
"sling:internalRedirect": [
"/content/we-retail/us/en/(.*).html"
]
},
"reverse_mapping_content_nohtml": {
"jcr:primaryType": "sling:Mapping",
"sling:match": "$1",
"sling:internalRedirect": [
"/content/we-retail/us/en/(.*)"
]
},
"reverse_mapping_root": {
"jcr:primaryType": "sling:Mapping",
"sling:match": "$",
"sling:internalRedirect": [
"/content/we-retail/us/en(.html)?"
]
}
},
"weretail_com_root": {
"jcr:primaryType": "sling:Mapping",
"sling:match": "weretail.com$",
"sling:internalRedirect": [
"/content/we-retail/us/en.html"
]
}
}

This set of Sling Mappings has more entries than strictly necessary but provides the following:
They handle the rewriting of links created by selecting a resource using the pathfield widget where no “html” extension is provided
Requests to ResourceResolver#map will return the same URL as the one shown to the end user because it isn’t dependent on any transformer to rewrite links

The resulting HTML markup for the we.retail navigation will look like the snippet below:
<ul class="nav navbar-nav navbar-center">
<li class="visible-xs">
<a href="http://weretail.com/">we.<strong class="text-primary">Retail</strong></a>
</li>
<li>
<a href="http://weretail.com/experience">Experience</a>
</li>
<li>
<a href="http://weretail.com/men">Men</a>
</li>
<li>
<a href="http://weretail.com/women">Women</a>
</li>
<li>
<a href="http://weretail.com/equipment">Equipment</a>
</li>
</ul>

Note: The links shown above are absolute links because no web server was set up with the “weretail.com” domain to serve this content. If a web server was set up, the LinkCheckerTransformer would generate internal links like “/equipment” instead of “http://weretail.com/equipment”. 

Benefits
Leverages Sling resource resolution so the rules that are defined in /etc/map will be used by the LinkCheckerTransformer to shorten URLs within web pages.
You can leverage capturing groups to perform sophisticated resource resolution/mapping rules.
With this method, the LinkCheckerTransformer will be able to generate cross-site links. This means that all links will be internal unless the content path points to another website, in which case an absolute link will be rendered. 

Drawbacks
It can be a bit tricky to understand the various properties that can be applied to Sling Mappings
The domain-awareness can make it difficult to maintain the mappings in an environment where AEM instances are provisioned on the fly (may need to use a script to generate the mappings dynamically when an instance is allocated a domain name) 

General Sling Processing Gotchas
As you can see the Sling Resource Resolution is a powerful tool to perform URL handling; However there are a few points to watch out for:
When rewriting a link, if the path provided cannot be resolved to a resource, the link will not be rewritten at all. You will most likely run into this problem if you are using a mix of Sling Filters and request forwards to render your content. A way around this is to write a custom Transformer to force the resource mapping to occur.
The Dispatcher cache path will not match the content path. More specifically, the content will be cached using the shortened path but when a cache invalidation request is sent, the full path will be provided. A way around this is to write rewrite rules on the webserver to correct the path to be invalidated or to use the Dispatcher Flush Rules provided by ACS Commons (as of version 1.5.0, regular expressions are supported for more complex invalidation logic).
Whilst Sling does a lot for you, you will still need to set up some rewrite rules to append the “.html” extension in order for your content to be cached by the Dispatcher.
The LinkCheckerTransformer works just like any other transformer in the rewriting pipeline, it responds to SAX events from the HTML parser. This means that it may not rewrite all links within your HTML page unless it is configured to generate an event. For example, tags may be used to hold open redirects – these will not be rewritten unless the INPUT tag is added to the HTML parser and input:value is added to the list of rewrite elements.
The HTML markup must be valid or else link rewriting will cease to work when it comes across invalid markup. 

Conclusion
Hopefully, this provided you with some insight into URL handling in AEM. It is a tricky topic and it gets even more complex when vanity URLs are thrown in the mix. Avoid them if possible – they get unruly very quick! If you’re going to take away anything from this post, it should be that the Sling Resource Resolution mechanism can do the heavy lifting for you, so I highly recommend you consider it when implementing your URL handling processes.



By aem4beginner

No comments:

Post a Comment

If you have any doubts or questions, please let us know.