April 16, 2020
Estimated Post Reading Time ~

Setup TarMK Cold Standby in AEM 6

With the Up-gradation of Jackrabbit to Oak in AEM 6. AEM 6 comes with the most awaiting feature of TarMK Cold Standby architecture. To mitigate the risk during fail-over situations. TarMK Cold Standby Approach comes as a option to traditional Master Slave concept.

Note: TarMk cold standby sync is linear from primary to standy node without any repository corruption check, which means if primary is corrupted secondary will also gets corrupted. As per tarmk cold standby architecture standy instance is exact copy of primary and cannot help if primary instance gets corrupted.

After completing this tutorial you will have clear understanding about:
How Tarmk Cold Standby works.
How to setup Tarmk Cold Standby in AEM.
How to debug First time Sync between Primary and Standby Instance.
How to switch from primary to standy during failover.
Advantages and Disadvantages of using Tarmk Cold Standby.
How Tarmk Cold Standby works:

The TarMK Cold Standby approach allows one or more standby AEM instances to connect to a primary instance. Standby instances is a working live copy of the master or primary repository and ensure a quick switch over without any data loss in case the master or primary is unavailable for any reason.

On the primary AEM instance, a TCP port is opened to listening incoming messages. There are two type of messages that our standby instance will send to the primary or master instance:
A message requesting the segment ID of the current head.
A message requesting segment data with a specified ID.

Note: Standby instances do not receive any requests, because they run in sync only mode. Only Felix Console is accessible on standby instance, for configuring OSGI services and components.

Below figure shows a typical TarMK Cold Standby deployment Architecture:


Note: It is recommended to configure a load balancer between dispatcher and the primary instance and load balancer should direct all traffic to primary instance only.

In addition to above Architecture you can also map a network drive, which runs daily once or twice a week to take backup of latest crx-quickstart folder of primary. This will help in scenarios when primary instance is corrupted, you always have an addition back to to restore. Because due to linear sync as soon as primary gets corrupted standby is also corrupted.

Setup Tarmk Cold Standby in AEM
In this tutorial, I have used default Node store(segment store) for data storage .

Follow below steps to setup Primary instance:
Install AEM 6.1 and let it create crx-quickstart folder.
Shutdown the instance and copy the crx-quickstart(installation) folder, from primary to standy instance.


Note: It is advisable to give descriptive names to older like aem-primary and aem-standby to differentiate.
Create a folder install at aem-primary/crx-quickstart/install, if this folder already exist check and delete content present inside it.
Create a folder install.primary(any descriptive name) at aem-primary/crx-quickstart/install/install.primary to store node or data store relate configuration files.
Create a config file with name org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config and enter below values:

org.apache.sling.installer.configuration.persist=B"false"
mode="primary"
port=I"8023"

Start Primary instance with “primary” run mode. For example for Windows Set

set
CQ_RUNMODE=primary,crx3,crx3tar

Create a new log file to store logs related to TarMK sync from the primary.
Go to Felix console.
Create a new Apache Sling Logging Logger for the org.apache.jackrabbit.oak.plugins.segment package.
Set log level to DEBUG.

Note: Change the log level from DEBUG to ERROR or INFO after the first sync, otherwise your TarMK log file size will increase very rapidly.

Follow below steps to setup Standby instance:-
Now go to Standby instance and run jar file under aem-standby folder.
Create the same logging configure as a primary instance.
Once done stop the instance.
Check and delete the content available under folder at aem-standby/crx-quickstart/install.
Create a folder install.standby(any descriptive name) at aem-primary/crx-quickstart/install/install.standby to store node or data store relate configuration files.
Create a config file with name org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config and enter below values:-

org.apache.sling.installer.configuration.persist=B"false"
mode="standby"
primary.host="127.0.0.1"
port=I"8023"
secure=B"false"
interval=I"5"

Note: Change primary.host to primary instance IP address.
Create a config file with name Sample of org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStoreService.config and enter below values:-

name="Oak-Tar"
service.ranking=I"100"
standby=B"true"
Start Standby instance with “standby” run mode. For example for Windows Set

set
CQ_RUNMODE=standby,crx3,crx3tar

The above configurations can also be done from Felix Console:
Go to Felix Console –> Configuration Manager.
Search for “Apache Jackrabbit Oak TarMK Cold Standby service”.
Change the setting, save it and restart the instance to take effect on new changes.


Note: Once saved AEM creates a config file same as we created above and store all the configuration values at \aem-primary\crx-quickstart\launchpad\config\org\apache\jackrabbit\oak\plugins\segment\standby\store.

Debug First time Sync between Primary and Standby Instance:

Once the setup is completed and standby instance starts syncing up with primary instance, you can verify whether it is started properly or not by comparing with below debug logs.

StandBy instance Logs:
Open tarmk-coldstandby.log of standby instance. You will see below logs for the reading segment, got segment and writing segment. Which means sync has started.

*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStore trying to read segment ec1f739c-0e3c-41b8-be2e-5417efc05266

*DEBUG* [nioEventLoopGroup-3-1] org.apache.jackrabbit.oak.plugins.segment.standby.codec.SegmentDecoder received type 1 with id ec1f739c-0e3c-41b8-be2e-5417efc05266 and size 262144


*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStore got segment ec1f739c-0e3c-41b8-be2e-5417efc05266 with size 262144

*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.plugins.segment.file.TarWriter Writing segment ec1f739c-0e3c-41b8-be2e-5417efc05266 to /mnt/crx/author/crx-quickstart/repository/segmentstore/data00016a.tar


Open error.log at standby Instance, you will see sync configuration and its status as started.

*INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService started standby sync with 10.20.30.40:8023 at 5 sec.

Primary Instance Logs:
Open tarmk-coldstandby.log of the primary instance. You will see below logs, here the client is our standby instance.

*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.store.CommunicationObserver got message ‘s.d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd’ from client c7a7ce9b-1e16-488a-976e-627100ddd8cd

*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.server.StandbyServerHandler request segment id d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd

*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.server.StandbyServerHandler sending segment d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd to /10.20.30.40:34998

*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.store.CommunicationObserver did send segment with 262144 bytes to client c7a7ce9b-1e16-488a-976e-627100ddd8cd



Note: Once these logs stop appearing, you can assume that your sync is completed.
You can monitor the below log to verify currently which tar file is getting synchronized.

*DEBUG* [defaultEventExecutorGroup-156-1] org.apache.jackrabbit.oak.plugins.segment.file.TarWriter Writing segment 3a03fafc-d1f9-4a8f-a67a-d0849d5a36d5 to /<<CQROOTDIRECTORY>>/crx-quickstart/repository/segmentstore/data00014a.tar


Switch from primary to standby during failover:
Steps to follow when the primary instance went down or crash in a production environment:-
Remove Primary instance from the load balancer, if you are using it.
Stop standby instance, and bring it up as primary instance(by changing runmode to primary).
Note: Take a backup of the standby crx-quickstart folder, if you feel the primary instance is corrupted and you want to keep the current instance as primary and create a new standby instance.
Add new primary instance to load balancer.
Advantages and Disadvantages of using Tarmk Cold Standby

Advantages of using Tarmk cold standby architecture:
It is robust, as it uses checksum will all packets to take care of damaged packets and handle all network related issues automatically.
As all instances run under the same intranet, security breach becomes difficult. Furthermore, we can restrict Ip range from accessing primary or standby instance from the Felix console.
The failover process is simple and fast.

Dis- Advantages of using Tarmk cold standby architecture:-
It is not multi-threaded. So multi-core cannot speed up the sync process.
One server is idle most of the time.
The failover is not triggered automatic.


By aem4beginner

No comments:

Post a Comment

If you have any doubts or questions, please let us know.