AEM Dispatcher

The dispatcher is designed as a caching, security, and load balancing layer of the AEM stack. This is the service that the end-user will connect to when viewing the site.

The dispatcher is made up of 2 distinct parts, the webserver, and the dispatcher plugin. While this page will only discuss the dispatcher plugin it should be noted that some of the restrictions should be put in place in the web server itself. Things like stripping off query parameters should be configured at the webserver level and not left to the dispatcher, or for redirection of error documents using the ErrorDocument directive

Load Balancing
While load balancing is a feature of the dispatcher configuration, it is recommended to use an external load balancer and use the dispatcher as a cache.

Configuring the load balance settings for the dispatcher will be mentioned later in the Configuration section

Security
The security layer of the dispatcher allows you to set what URL's go through to the publisher. It is recommended to set a global block and then open up access to the required content. If the filters setup fails to allow content the dispatcher will return a 404 error. If you have defined error pages in AEM then these will not be shown to the user and a default dispatcher error page will be rendered.

The security section can be removed if you are using sling mappings in the publishers to redirect your content, however, some rules should always be in place.

Caching
Caching is the main use of the dispatcher. It can be configured to cache any set of files, or like the security features can be configured to ignore certain content. By default it will not cache any pages that require an authorisation, however, this can and should be changed.

Configuration
This section will discuss the configuration of the dispatcher as well as what configuration should be added for a good AEM system. As mentioned earlier if you use sling mappings and want custom error pages then a lot of the filters should be removed.

The dispatcher is an apache httpd module and is configured using a file called dispatcher.any. This is the main file, but as I will explain you can include additional files to allow for a cleaner set of configuration for multiple applications.

The indentation of instructions in the dispatcher configuration is very important, while each sub instruction is enclosed in {} the dispatcher will complain if the indentation is wrong, or you may get unexpected results. It is recommended to use a tab to indent all instructions.

For more in-depth configuration details go to the Dispatcher configuration documentation

Structure
The structure of the files is very simple, every command starts with a / and where it has sub instructions they will be wrapped similar to json structure {}

# name of the dispatcher
/name "internet-server"
# each farm configures a set of (load-balanced) renders
/farms
{
$include "author-farm.any"
$include "publish-farm.any"
}

The above is a basic example of a default dispatcher.any file. In any section, another external file can include in the configuration by using $include "path". As this is not a dispatcher instruction it doesn't have to start with /. The include function also allows the inclusion of multiple files at the same time i.e. $include "path*.any" will include any file starting with path and ending with .any

It is good practice to segregate the configuration of different applications so a suggested folder structure is

dispatcher - the dispatcher.any file and any shared configuration
app1 - the configuration for app1
app2 - the configuration for app2

e.t.c

Instructions
The following is a list of the main instructions available for configuring the dispatcher

name

Denotes the name of the dispatcher

farms

A list of the available applications
Each application should be installed as its own farm on the dispatcher

clientheaders

A list of headers that get passed on from the end-user to the publisher
Because header requirements are generally specific to the server this should be in a shared file

virtualhosts

A list of the addresses that make up this farm
Any request for an address in the virtualhosts section will use the configuration in this farm
A match will be to the most specific virtual host. i.e. if you have application.com in farm1 and application.com/abc/* in the other

Calls to https://application.com/def.html will go to farm1
Calls to https://application.com/abc/page.html will go to farm2

Should list all hosts and that will be part of the application
Can be either an ip address or DNS entry of the dispatcher server
Can specify just the host or the scheme as well i.e abc.com or https://abc.com
Can specify wildcards i.e. *.application.com

sessionmanagement

Support for session management and authentication
requires the allowAuthorized in the cache instruction to be set to "0"
directory

The directory where session information will be stored
Will be created if it doesn't exist
Mandatory field

encode

How the session information is encoded. Use "md5" or "hex"
"md5" is the default setting if not added

header

The name of the http header or cookie that stores the authorization information
values start with HTTP: for headers
values start with Cookie: for cookies

timeout

The number of seconds until the session times out after it has been last updated

renders

The servers that provide rendered pages (typically AEM publish instances).
Because renders requirements are generally specific to the server this should be in a shared file
Each render server is named in its own configuration
Having multiple renders will tell the dispatcher to act as a load balancer
hostname

The hostname or ip address of the server

port

The port the server is accessible on

timeout

The time to wait for a connection
Defaults to wait forever

receiveTimeout

The time to wait for content
Defaults to wait forever

ipv4

Specifies if the server is on ipv4 or ipv6

filter

Defines the URLs to which Dispatcher enables access.
This is the core section for handling restrictions and stopping attempts at invalid content going to the publishers
Each rule is on a separate line with a numbered rule
A good practice is to use a whitelist scenario i.e. deny everything and then open up what you need
Any url that is blocked by the dispatcher will return a standard 404 response but with no content.
Wildcard support:

Matches zero or more contiguous instances of any character in the string. The final character of the match is determined by either of the following situations:
A character in the string matches the next character in the pattern, and the pattern character has the following characteristics:
Not a *
Not a ?
A literal character (including space) or a character class
The end of the pattern is reached.

Matches any single character. Use outside character classes.
[ and ]
Demarks the beginning and end of a character class.
Character classes can include one or more character ranges and single characters.
i.e [a-z], [0-9]
A match occurs if the target character matches any of the characters in the character class, or within a defined range.
If the closing bracket is not included, the pattern produces no matches.

Negates the character or character class that follows.
Use only for negating characters and character ranges inside character classes.
Equivalent to the ^ wildcard.

Negates the character or character range that follows.
Use for negating only characters and character ranges inside character classes.
Equivalent to the ! wildcard character.

type

The type of rule, either deny or allow

glob

A single rule to specify all parameter types
When used none of the following url instructions are valid
rules are in the format of METHOD URL QUERYPARAMS
can take wildcards in any of the 3 sections
If a section is not added then it is assumed to be * other than query which is assumed to be empty

method

The type of request this rule is for
Is not valid if a glob instruction has been set
Can be GET, POST, PUT, OPTIONS, HEAD, DELETE

version

The HTTP version requested
Is not valid if a glob instruction has been set
It is not recommended to restrict the http version

The url structure for the rule
Is not valid if a glob instruction has been set
Supports wildcards

query

The query parameters to allow through
query parameters are in the form of name=value
Supports wildcards for both name and value
It is bad practice to use query strings within AEM as this also makes multiple cache values available
Is accumulative so if you deny something at /abc after allowing something at /abc/def then both will be denied
It is bad practice to send query parameters to any page as it will make the page noncacheable. Query parameters should be stripped off the request by the webserver

selectors

A list of selectors to add to the rule
A single string in () with | between elements
Selectors are additional information before the extension
i.e abc.def.html would be for the page abc with the selector def

extension

A list of extensions to add to the rule
A single string in () with | between elements

suffix

A list of suffixes to add to the rule
A single string in () with | between elements
Suffix's are additional information after the extension
i.e. abc.html/def would before the page abc with the selector def

vanity-urls

Configures access to vanity URLs.
Allows the dispatcher to check for changes in vanity urls in AEM.
url

The url on the publisher to get vanity url information

file

The local file to store the vanity url cache

delay

How regularly in seconds between checks of the vanity url

propagateSyndPost

Support for the forwarding of syndication requests.
By default, this is disabled as syndication messages are meant for the dispatcher
Can be enabled by putting a value of "1"
Must have post requests enabled

cache

Configures caching behavior.
This is the second most important part of the dispatcher config, it defines everything that is cached
You should only cache items that are static
docroot

The location the cached files will be stored
The value must be the exact same path as the document root of the web server so that Dispatcher and the webserver handle the same files.
The web server is responsible for delivering the correct status code when the dispatcher cache file is used, that's why it is important that it can find it as well.

statfile

Identifies the file to use as the statfile.
The dispatcher uses this file to register the time of the most recent content update.
The statfile can be any file on the webserver.
if the statfileslevel attribute is configured then the statfile is ignored

serveStateOnError

If set to "1" the dispatcher will retain the cached files until the publisher returns valid content for the request.

allowAuthorized

Controls whether requests that contain any of the following authentication information are cached
set to "1" allows caching of pages with authorization

rules

Specifies the pages to cache according to the document path
The dispatcher will never cache pages under the following rules

If the HTTP method is not GET.

Other common methods are POST for form data and HEAD for the HTTP header.

If the request URI contains a question mark ("?").

This usually indicates a dynamic page, such as a search result that does not need to be cached.

The file extension is missing.

The web server needs the extension to determine the document type (the MIME-type).

The authentication header is set

This can be configured with the allowAuthorized instruction

If the AEM instance responds with the following headers:

§ no-cache
§ no-store
§ must-revalidate

rules are similar to filters, but only support the glob functionality
glob

The url structure for the rule
Supports wildcards

type

The type of rule
Either allow or deny

statfileslevel

If configured the statfile instruction is ignored
The dispatcher creates .stat files in each folder from the docroot folder down to the level that you specify. The docroot folder is level 0
When a page is updated the dispatcher locates the folder on the file path that is at the statfileslevel, and invalidates all files below that folder

invalidate

Allows configuration to automatically invalidate cached files
Uses the same structure as the cache rules
With automatic invalidation, the dispatcher does not delete cached files after a content update but checks their validity when they are next requested
Documents in the cache that are not auto-invalidate will remain in the cache until a content update explicitly deletes them.
Automatic invalidation is typically used for HTML pages
HTML pages often contain links to other pages, making it difficult to determine whether a content update affects a page.

invalidateHandler

Allows you to define a script which is called for each invalidation request received by Dispatcher
Allows the dispatcher to clear content from a CDN
The script will be called with 3 parameters

Handle

The content path that is invalidated

Action

The replication action i.e Activate, Deactivate

Action Scope

The replication actions scope

allowedClients

Limits the clients that can call the dispatcher flush function
Should be limited to the dispatchers publish servers
glob

The server address
Can use wildcards

type

The type of rule
allows allow or deny

ignoreUrlParams

Defines which url parameters are ignored or allowed to be part of the cache structure
This is generally not required as the query parameters should be ignored at the webserver

headers

Defines what headers to be cached at the dispatcher

enableTTL

Set to "1" will use the Cache-Control: max-age or Expires headers to determine the validity of the cache

statistics

Defining statistic categories for load-balancing calculations
Defines categories of files for which Dispatcher scores the responsiveness of each render. The dispatcher uses the scores to determine which render to send a request
categories

Define a category for each type of document for which you want to keep statistics for render selection. The /statistics section contains a /categories section
glob

The structure of the path

unavailablePenalty

Defines the time in tenths of a second that is applied to the render statistics when a connection to the renderer fails

stickyConnectionsFor

defines a list of paths that the dispatcher will make sure the user's requests always go to the same server

health_check

The URL to use to determine service availability.
url

The url to call on the publisher for a health check

retryDelay

The delay before retrying a failed connection.

failover

Resend requests to different renders when the original request fails.

Apache Config
The dispatcher is generally an apache httpd server, while it can be any one of a number of httpd server, apache httpd is the most common, and provides for some extra configuration

In your dispatcher config, you will generally restrict access to certain pages within AEM. To do this we put a set of rewrite rules in place

RewriteEngine on
LogLevel info rewrite:info
RewriteCond %{REQUEST_URI} ^/crx [OR]
RewriteCond %{REQUEST_URI} ^/apps [OR]
RewriteCond %{REQUEST_URI} ^/home [OR]
RewriteCond %{REQUEST_URI} ^/tmp [OR]
RewriteCond %{REQUEST_URI} ^/var [OR]
RewriteCond %{REQUEST_URI} ^/libs.*(?<!/granite/csrf/token.json)$
RewriteRule ^(.*)$ /error$1.html [R,L]
This tells Apache to redirect any url starting with crx, apps, home, tmp, var, and libs as long as it isn't /libs/granite/csrf/token.json to /error/???.html

The reason for putting .html at the end is it makes it into an html page no matter what extension was passed, and AEM will handle it with the standard error handler pages for html

If you are directing calls to your author via the dispatcher then do not use this apache configuration otherwise your author will not work correctly as it blocks access to nonclient facing libraries and services.

AEM Tutorials for Beginners

May 13, 2020
Estimated Post Reading Time ~

AEM Dispatcher

No comments:

Post a Comment

Get Posts In Your Inbox

May 13, 2020 Estimated Post Reading Time ~

AEM Dispatcher

No comments:

Post a Comment

May 13, 2020
Estimated Post Reading Time ~