May 13, 2020
Estimated Post Reading Time ~

AEM Dispatcher

The dispatcher is designed as a caching, security, and load balancing layer of the AEM stack. This is the service that the end-user will connect to when viewing the site.

The dispatcher is made up of 2 distinct parts, the webserver, and the dispatcher plugin. While this page will only discuss the dispatcher plugin it should be noted that some of the restrictions should be put in place in the web server itself. Things like stripping off query parameters should be configured at the webserver level and not left to the dispatcher, or for redirection of error documents using the ErrorDocument directive

Load Balancing
While load balancing is a feature of the dispatcher configuration, it is recommended to use an external load balancer and use the dispatcher as a cache.

Configuring the load balance settings for the dispatcher will be mentioned later in the Configuration section

Security
The security layer of the dispatcher allows you to set what URL's go through to the publisher. It is recommended to set a global block and then open up access to the required content. If the filters setup fails to allow content the dispatcher will return a 404 error. If you have defined error pages in AEM then these will not be shown to the user and a default dispatcher error page will be rendered.

The security section can be removed if you are using sling mappings in the publishers to redirect your content, however, some rules should always be in place.

Caching
Caching is the main use of the dispatcher. It can be configured to cache any set of files, or like the security features can be configured to ignore certain content. By default it will not cache any pages that require an authorisation, however, this can and should be changed.

Configuration
This section will discuss the configuration of the dispatcher as well as what configuration should be added for a good AEM system. As mentioned earlier if you use sling mappings and want custom error pages then a lot of the filters should be removed.

The dispatcher is an apache httpd module and is configured using a file called dispatcher.any. This is the main file, but as I will explain you can include additional files to allow for a cleaner set of configuration for multiple applications.

The indentation of instructions in the dispatcher configuration is very important, while each sub instruction is enclosed in {} the dispatcher will complain if the indentation is wrong, or you may get unexpected results. It is recommended to use a tab to indent all instructions.

For more in-depth configuration details go to the Dispatcher configuration documentation

Structure
The structure of the files is very simple, every command starts with a / and where it has sub instructions they will be wrapped similar to json structure {}

# name of the dispatcher
/name "internet-server"
# each farm configures a set of (load-balanced) renders
/farms
{
$include "author-farm.any"
$include "publish-farm.any"
}


The above is a basic example of a default dispatcher.any file. In any section, another external file can include in the configuration by using $include "path". As this is not a dispatcher instruction it doesn't have to start with /. The include function also allows the inclusion of multiple files at the same time i.e. $include "path*.any" will include any file starting with path and ending with .any

It is good practice to segregate the configuration of different applications so a suggested folder structure is

dispatcher - the dispatcher.any file and any shared configuration
app1 - the configuration for app1
app2 - the configuration for app2

e.t.c

Instructions
The following is a list of the main instructions available for configuring the dispatcher
  • name
    • Denotes the name of the dispatcher
  • farms
    • A list of the available applications
    • Each application should be installed as its own farm on the dispatcher
  • clientheaders
    • A list of headers that get passed on from the end-user to the publisher
    • Because header requirements are generally specific to the server this should be in a shared file
  • virtualhosts
    • A list of the addresses that make up this farm
    • Any request for an address in the virtualhosts section will use the configuration in this farm
    • A match will be to the most specific virtual host. i.e. if you have application.com in farm1 and application.com/abc/* in the other
      • Calls to https://application.com/def.html will go to farm1
      • Calls to https://application.com/abc/page.html will go to farm2
    • Should list all hosts and that will be part of the application
    • Can be either an ip address or DNS entry of the dispatcher server
    • Can specify just the host or the scheme as well i.e abc.com or https://abc.com
    • Can specify wildcards i.e. *.application.com
  • sessionmanagement
    • Support for session management and authentication
    • requires the allowAuthorized in the cache instruction to be set to "0"
    • directory
      • The directory where session information will be stored
      • Will be created if it doesn't exist
      • Mandatory field
    • encode
      • How the session information is encoded. Use "md5" or "hex"
      • "md5" is the default setting if not added
    • header
      • The name of the http header or cookie that stores the authorization information
      • values start with HTTP: for headers
      • values start with Cookie: for cookies
    • timeout
      • The number of seconds until the session times out after it has been last updated
    • renders
      • The servers that provide rendered pages (typically AEM publish instances).
      • Because renders requirements are generally specific to the server this should be in a shared file
      • Each render server is named in its own configuration
      • Having multiple renders will tell the dispatcher to act as a load balancer
      • hostname
        • The hostname or ip address of the server
      • port
        • The port the server is accessible on
      • timeout
        • The time to wait for a connection
        • Defaults to wait forever
      • receiveTimeout
        • The time to wait for content
        • Defaults to wait forever
      • ipv4
        • Specifies if the server is on ipv4 or ipv6
    • filter
      • Defines the URLs to which Dispatcher enables access.
      • This is the core section for handling restrictions and stopping attempts at invalid content going to the publishers
      • Each rule is on a separate line with a numbered rule
      • A good practice is to use a whitelist scenario i.e. deny everything and then open up what you need
      • Any url that is blocked by the dispatcher will return a standard 404 response but with no content.
      • Wildcard support:
        • *
          • Matches zero or more contiguous instances of any character in the string. The final character of the match is determined by either of the following situations:
          • A character in the string matches the next character in the pattern, and the pattern character has the following characteristics:
          • Not a *
          • Not a ?
          • A literal character (including space) or a character class
          • The end of the pattern is reached.
        • ?
          • Matches any single character. Use outside character classes.
          • [ and ]
          • Demarks the beginning and end of a character class.
          • Character classes can include one or more character ranges and single characters.
          • i.e [a-z], [0-9]
          • A match occurs if the target character matches any of the characters in the character class, or within a defined range.
          • If the closing bracket is not included, the pattern produces no matches.
        • !
          • Negates the character or character class that follows.
          • Use only for negating characters and character ranges inside character classes.
          • Equivalent to the ^ wildcard.
        • ^
          • Negates the character or character range that follows.
          • Use for negating only characters and character ranges inside character classes.
          • Equivalent to the ! wildcard character.
        • type
          • The type of rule, either deny or allow
        • glob
          • A single rule to specify all parameter types
          • When used none of the following url instructions are valid
          • rules are in the format of METHOD URL QUERYPARAMS
          • can take wildcards in any of the 3 sections
          • If a section is not added then it is assumed to be * other than query which is assumed to be empty
        • method
          • The type of request this rule is for
          • Is not valid if a glob instruction has been set
          • Can be GET, POST, PUT, OPTIONS, HEAD, DELETE
        • version
          • The HTTP version requested
          • Is not valid if a glob instruction has been set
          • It is not recommended to restrict the http version
        • url
          • The url structure for the rule
          • Is not valid if a glob instruction has been set
          • Supports wildcards
        • query
          • The query parameters to allow through
          • query parameters are in the form of name=value
          • Supports wildcards for both name and value
          • It is bad practice to use query strings within AEM as this also makes multiple cache values available
          • Is accumulative so if you deny something at /abc after allowing something at /abc/def then both will be denied
          • It is bad practice to send query parameters to any page as it will make the page noncacheable. Query parameters should be stripped off the request by the webserver
        • selectors
          • A list of selectors to add to the rule
          • A single string in () with | between elements
          • Selectors are additional information before the extension
          • i.e abc.def.html would be for the page abc with the selector def
        • extension
          • A list of extensions to add to the rule
          • A single string in () with | between elements
        • suffix
          • A list of suffixes to add to the rule
          • A single string in () with | between elements
          • Suffix's are additional information after the extension
          • i.e. abc.html/def would before the page abc with the selector def
      • vanity-urls
        • Configures access to vanity URLs.
        • Allows the dispatcher to check for changes in vanity urls in AEM.
        • url
          • The url on the publisher to get vanity url information
        • file
          • The local file to store the vanity url cache
        • delay
          • How regularly in seconds between checks of the vanity url
      • propagateSyndPost
        • Support for the forwarding of syndication requests.
        • By default, this is disabled as syndication messages are meant for the dispatcher
        • Can be enabled by putting a value of "1"
        • Must have post requests enabled
      • cache
        • Configures caching behavior.
        • This is the second most important part of the dispatcher config, it defines everything that is cached
        • You should only cache items that are static
        • docroot
          • The location the cached files will be stored
          • The value must be the exact same path as the document root of the web server so that Dispatcher and the webserver handle the same files.
          • The web server is responsible for delivering the correct status code when the dispatcher cache file is used, that's why it is important that it can find it as well.
        • statfile
          • Identifies the file to use as the statfile.
          • The dispatcher uses this file to register the time of the most recent content update.
          • The statfile can be any file on the webserver.
          • if the statfileslevel attribute is configured then the statfile is ignored
        • serveStateOnError
          • If set to "1" the dispatcher will retain the cached files until the publisher returns valid content for the request.
        • allowAuthorized
          • Controls whether requests that contain any of the following authentication information are cached
          • set to "1" allows caching of pages with authorization
        • rules
          • Specifies the pages to cache according to the document path
          • The dispatcher will never cache pages under the following rules
            • If the HTTP method is not GET.
              • Other common methods are POST for form data and HEAD for the HTTP header.
            • If the request URI contains a question mark ("?").
              • This usually indicates a dynamic page, such as a search result that does not need to be cached.
            • The file extension is missing.
              • The web server needs the extension to determine the document type (the MIME-type).
            • The authentication header is set
              • This can be configured with the allowAuthorized instruction
            • If the AEM instance responds with the following headers:
              • § no-cache
              • § no-store
              • § must-revalidate
          • rules are similar to filters, but only support the glob functionality
          • glob
            • The url structure for the rule
            • Supports wildcards
          • type
            • The type of rule
            • Either allow or deny
  • statfileslevel
    • If configured the statfile instruction is ignored
    • The dispatcher creates .stat files in each folder from the docroot folder down to the level that you specify. The docroot folder is level 0
    • When a page is updated the dispatcher locates the folder on the file path that is at the statfileslevel, and invalidates all files below that folder
  • invalidate
    • Allows configuration to automatically invalidate cached files
    • Uses the same structure as the cache rules
    • With automatic invalidation, the dispatcher does not delete cached files after a content update but checks their validity when they are next requested
    • Documents in the cache that are not auto-invalidate will remain in the cache until a content update explicitly deletes them.
    • Automatic invalidation is typically used for HTML pages
    • HTML pages often contain links to other pages, making it difficult to determine whether a content update affects a page.
  • invalidateHandler
    • Allows you to define a script which is called for each invalidation request received by Dispatcher
    • Allows the dispatcher to clear content from a CDN
    • The script will be called with 3 parameters
      • Handle
        • The content path that is invalidated
      • Action
        • The replication action i.e Activate, Deactivate
      • Action Scope
        • The replication actions scope
    • allowedClients
      • Limits the clients that can call the dispatcher flush function
      • Should be limited to the dispatchers publish servers
      • glob
        • The server address
        • Can use wildcards
      • type
        • The type of rule
        • allows allow or deny
    • ignoreUrlParams
      • Defines which url parameters are ignored or allowed to be part of the cache structure
      • This is generally not required as the query parameters should be ignored at the webserver
    • headers
      • Defines what headers to be cached at the dispatcher
  • enableTTL
    • Set to "1" will use the Cache-Control: max-age or Expires headers to determine the validity of the cache
  • statistics
    • Defining statistic categories for load-balancing calculations
    • Defines categories of files for which Dispatcher scores the responsiveness of each render. The dispatcher uses the scores to determine which render to send a request
    • categories
      • Define a category for each type of document for which you want to keep statistics for render selection. The /statistics section contains a /categories section
      • glob
        • The structure of the path
  • unavailablePenalty
    • Defines the time in tenths of a second that is applied to the render statistics when a connection to the renderer fails
  • stickyConnectionsFor
    • defines a list of paths that the dispatcher will make sure the user's requests always go to the same server
  • health_check
    • The URL to use to determine service availability.
    • url
      • The url to call on the publisher for a health check
  • retryDelay
    • The delay before retrying a failed connection.
  • failover
    • Resend requests to different renders when the original request fails.
Apache Config
The dispatcher is generally an apache httpd server, while it can be any one of a number of httpd server, apache httpd is the most common, and provides for some extra configuration

In your dispatcher config, you will generally restrict access to certain pages within AEM. To do this we put a set of rewrite rules in place

RewriteEngine on
LogLevel info rewrite:info
RewriteCond %{REQUEST_URI} ^/crx [OR]
RewriteCond %{REQUEST_URI} ^/apps [OR]
RewriteCond %{REQUEST_URI} ^/home [OR]
RewriteCond %{REQUEST_URI} ^/tmp [OR]
RewriteCond %{REQUEST_URI} ^/var [OR]
RewriteCond %{REQUEST_URI} ^/libs.*(?<!/granite/csrf/token.json)$
RewriteRule ^(.*)$ /error$1.html [R,L]
This tells Apache to redirect any url starting with crx, apps, home, tmp, var, and libs as long as it isn't /libs/granite/csrf/token.json to /error/???.html

The reason for putting .html at the end is it makes it into an html page no matter what extension was passed, and AEM will handle it with the standard error handler pages for html

If you are directing calls to your author via the dispatcher then do not use this apache configuration otherwise your author will not work correctly as it blocks access to nonclient facing libraries and services.


By aem4beginner

No comments:

Post a Comment

If you have any doubts or questions, please let us know.