October 1, 2020
Estimated Post Reading Time ~

Robots.txt file in AEM websites

When we think about AEM websites, SEO is one of the major considerations. To ensure the crawlers are crawling our website, we need to have sitemap.xml and a robots.txt which redirects the crawler to corresponding sitemap.xml

A robots.txt file lives at the root folder of the website. Below given the role of a robots.txt on any website. Robots.txt file acts as an entry point to any website and ensures the crawlers are accessing only the relevant items which we have defined.

Click on image to see it big

robots.txt in AEM websites
Let us see how we can implement a robots.txt file on our AEM website. There are many ways to do this, but below is one of the easiest ways to achieve the implementation.

Say we have multiple websites(multi-lingual) with language roots /en, /fr, /gb, /in

Let us see how we can enable robots.txt in our case.

Add robots.txt in Author
Login to the crxde and create a file called 'robots.txt' under path /content/dam/[sitename]
Ensure the following lines are added to the 'robots.txt' in Author of AEM instance and publish the robots.txt

#Any search crawler can crawl our site
User-agent: *

#Allow only below mentioned paths
Allow: /en/
Allow: /fr/
Allow: /gb/
Allow: /in/
#Disallow everything else
Disallow: /

#Crawl all sitemaps mentioned below
Sitemap: https://[sitename]/en/sitemap.xml
Sitemap: https://[sitename]/fr/sitemap.xml
Sitemap: https://[sitename]/gb/sitemap.xml
Sitemap: https://[sitename]/in/sitemap.xml

Now publish the robots.txt Add OSGi configurations for url mapping

Now add below entry in OSGI console> ConfigMgr - 'Apache Sling Resource Resolver Factory'

Add below mapping for section 'URL Mappings'
/content/dam/sitename/robots.txt>/robots.txt$

Add rewrite rule/ allow access to robots.txt via the dispatcher
And allow the crawlers to access robots.txt via the dispatcher

Add allow rule for robots.txt in dispatcher
/0010 { /type "allow" /url "/robots.txt"}

When you hit the www.[sitename]/robots.txt you should see the robots.txt file on the public domain.

Now any search engine which tries to access our site will find the robots.txt and recognizes, whether the crawler has got permission to crawl the site and what areas of the site have got crawl access.

Some sample usage of robots.txt is given below

# Disallow Googlebot accessing example.com/directory1/... and example.com/directory2/...
# but allow access to subdirectories -> directory2/subdirectory1/...
# All other directories on the site are allowed by default.
User-agent: googlebot
Disallow: /directory1/
Disallow: /directory2/
Allow: /directory2/subdirectory1/

# Block the entire site from xyzcrawler.
User-agent: xyzcrawler
Disallow: /

Let me know if you find a better way to do this; via the comments section.


By aem4beginner

No comments:

Post a Comment

If you have any doubts or questions, please let us know.