AEM / Adobe CQ5 : Set Up CQ Cluster and Troubleshoot

Set Up CQ Cluster
There are various ways you can set up a cluster in CQ. Two most common ways are
§ Automatic Cluster Join through UI
§ Manual Cluster Join
In order to understand cluster knowledge of following files are important:

Cluster Join Through UI:
§ In order to add a node to through UI. Go to welcome screen and click on "Clustering" Or go to <HOST>:<PORT>/libs/granite/cluster/content/admin.html

§ Then enter Master URL http://<HOST>:<PORT>/crx/explorer
§ Then enter UserName, Password.
§ Then click on Join

§ Once nodes are in the cluster you can always go back to <HOST>:<PORT>/libs/granite/cluster/content/admin.html to check the status of the cluster.
§ If you Join cluster through UI. In file system, you will see a directory structure like /crx-quickstart/crx.XXX in the slave node. Which is essentially a repository for the cluster.

Manual Cluster Join:
§ Assuming that you are creating a Shared Nothing cluster.
§ In order to do a manual cluster join. First take file system backup of any node in the cluster (This means to stop any cluster node and take a backup of the folder containing quickstart jar file, crx-quickstart folder, And your license file). You can also take an online backup (We will cover this later)
§ Put file system (backup) in another machine. (Based on how you took your backup you can restore whole file system)
§ Then go to /crx-quickstart/repository folder and remove cluster_node.id file
§ Open /crx-quickstart/repository/cluster.properties file and add the IP address of Master instance in addresses property. You can also do echo "addresses=x.x.x.x" >> crx-quickstart/repository/cluster.properties where x.x.x.x is the IP of the master instance.
§ Make sure that Cluster_Id is the same in both master and slave instances.
§ Start Slave instance.
§ Check log of both master and Slave. In Master Log you should see a message saying that slave is connected as soon as the slave is started. In the slave log, you should see the message connecting to the master.

Troubleshoot CQ Clustering
Here are some common questions and their answers to understand clustering better
§ Question: Where is the crx.xxx file get created?
§ Answer: On slave node when it is first joined in the cluster. And this is the current repository. Note that the existence of this directory does not mean that this is a slave node. Please see below how to decide which one is the master.

§ Question: How can I decide which one is the master ?
§ Answer: Go to http://<host>:<port>/crx/config/cluster.jsp on any node.

§ Question: My all instances are down, How can I decide which one was last current master?
§ Answer: Note that if all the instances are down, clustered.txt file is only present in the slave node (If everything is fine). An instance which don't have clustered.txt file is master node.

§ Question: How can I decide which is my current directory?
§ Answer: You can check bootstrap.properties on the node and check for repository.home property. If there is no crx.xxx then crx-quickstart is the current directory.

§ Question: Writes are always performed through master?
§ Answer: Yes.

§ Question: What if Master is down cluster?
§ Answer: Slave will become the master. If you have multiple nodes in cluster one of the slave will become Master based on election. The slave does following to become master. Remove clustered.txt file from /crx-quickstart/crx.XXX and switch it back to Master.

§ Question: What if the old Master then comes back online?
§ Answer: The current master will continue to be a Master in the cluster. Old master will be a slave and you can verify this by the existence of clustered.txt under /repository folder.

§ Question: What if the current master (Old slave) is down again?
§ Answer: Current Slave will become the master (If multiple nodes then based on election one of the slave will become a current master).

§ Question: What is the best way to install HF on a cluster node?
§ Answer: Install in Master (Use above method to determine which is master) -> let it synch to Slave -> Check slave package manager to make sure it is installed -> Click on reinstall option again from package manager in slave for CRX Hotfix package -> Stop Master -> Make sure it is down -> Stop slave -> make sure it is down -> for an instance where you have crx.XXX folder, check current repository from bootstrap.properties file and then copy crx-quickstart/crx.XXX/patches to crx-quickstart/repository (Or use manual install of jar file on slave instance) -> start master -> make sure it is up -> Start Slave -> Check repo version by going to repository configuration and searching for jcr.repository.version

§ Question: At some point, I want to run as a stand-alone system and make crx-quickstart as my current directory what should I do?
§ Answer: If you want to do it in Master instance where there is no crx.xxx folder. You probably don't have to do anything. If you want to do it on slave instance where crx.xxx folder first thing you have to make sure that which is a current repository (You can do that by doing to bootstrap.properties file). Make sure that your system is stopped -> rename repository folder under crx-quickstart folder -> rename crx.xxx to repository -> move it to crx-quickstart folder -> delete bootstrap.properties file -> delete cluster* under crx-quickstart/repository-> delete revision.log -> delete tarJournal -> restart the system. Note that ideally if you want to keep crx.xxx as current directory then you don't have to do anything.

§ Question: What about tar optimization on cluster Instance? (We will cover this later)
§ Answer: TarOptimization always runs on the Master node in a cluster environment. If you are optimizing tar files in a cluster, you need to ensure that the Tar optimization times are set to the same value on all cluster nodes. For example, <param name="autoOptimizeAt" value="1:00-4:00"/>

§ Question: How about Datastore Garbage collection? (We will cover this later)
§ Answer: See http://dev.day.com/content/kb/home/Crx/CrxSystemAdministration/DataStoreGarbageCollection.html for that.

§ Question: What if mater is stopped in a middle of synch process
§ Answer: If this is a graceful stop, Master gives 60000 ms for a slave to sync up with. If slave syncs up before that master is stopped after sync complete. Check cluster system properties to see how to set up this time.

§ Question: How can I make sure, One of the nodes is always master if in a cluster.
§ Answer: You need to set up "preferredMaster" to "true" for that node. For more information please check http://dev.day.com/docs/en/crx/current/administering/persistence_managers.html

§ Question: How replication work in a clustered environment
§ Answer: Similar to write operation, Replication is delegated to master if done from the slave.

§ Question: Ok, I understand the normal scenario but what happens to cluster when there is a network issue.
§ Answer: Ideally if you are not sure about network connections or there is a network problem often between cluster nodes shared-nothing clustering is not recommended. But I chose to select Shared nothing clustering, Slave will try to read from master and after some time when it is unable to do so you will get "Read from master timed out." error and slave will be disconnected.

§ Question: Then what should I do when there is a network issue
§ Answer: You should stop slave, and restart it when the network is normal. Another option is you can set the "becomeMasterOnTimeout" parameter on the slave (In repository.xml), This will make slave as a master when time out happens (Again the problem here would be you will two masters at one time, so not highly recommended).

§ Question: What happens if my cluster instance is indifferent TimeZone?
§ Answer: It is not recommended to have cluster instances on different TimeZone. It can create problems in tar optimization, backup and restore, Garbage collection, Data tar file timestamp mismatch.

§ Question: How to recover from power failure situation or case where both the node have clustered.txt file and it is difficult to decide which one was the last master
§ Answer: If the file 'clustered.txt' exists on all cluster nodes (for example because a power failure caused all cluster nodes to stop at the same time, or an online backup was restored on all cluster nodes), then the file needs to be deleted on one of the cluster nodes. To find out where to delete the marker file, compare the size of the last data*.tar files of the default workspace, the version workspace, and the tarJournal directory. The file clustered.txt needs to be deleted on the cluster node that has more data than the other cluster nodes (any cluster node if all cluster nodes have the same amount of data).

AEM Tutorials for Beginners

March 17, 2020
Estimated Post Reading Time ~ 5 mins