AEM Tutorials for Beginners: Data Store

Showing posts with label Data Store. Show all posts

May 10, 2020
Estimated Post Reading Time ~

DataStore and NodeStore in AEM

In Adobe Experience Manager (AEM), binary data can be stored independently from the content nodes. The binary data is stored in a data store, whereas content nodes are stored in a node store.
Both data stores and node stores can be configured using OSGi configuration. Each OSGi configuration is referenced using a persistent identifier (PID).

Node Store
Currently, there are two storage implementations available in AEM6: Tar Storage and MongoDB Storage.
Configure the node store by creating a configuration file with the name of the node store option you want to use in the crx-quickstart/install
segment node store
document node store
segment node store

The segment node store is the basis of Adobe’s TarMK implementation in AEM6. It uses the org.apache.jackrabbit.oak.segment.SegmentNodeStoreService PID for configuration.

You can configure the following options:
repository.home: Path to repository home under which repository-related data is stored. By default, segment files are stored under the crx-quickstart/segmentstore directory.
tarmk.size: Maximum size of a segment in MB. The default maximum is 256MB.
customBlobStore: Boolean value indicating that a custom data store is used. The default value is false.

The Tar storage uses tar files. It stores the content as various types of records within larger segments. Journals are used to track the latest state of the repository.
There are several key design principles it was build around:

Immutable Segments
The content is stored in segments that can be up to 256KiB in size. They are immutable, which makes it easy to cache frequently accessed segments and reduce system errors that may corrupt the repository.
Each segment is identified by a unique identifier (UUID) and contains a continuous subset of the content tree. In addition, segments can reference other content. Each segment keeps a list of UUIDs of other referenced segments.

Locality
Related records like a node and its immediate children are usually stored in the same segment. This makes searching the repository very fast and avoids most cache misses for typical clients that access more than one related node per session.

Compactness
The formatting of records is optimized for size to reduce IO costs and to fit as much content in caches as possible.
document node store
The document node store is the basis of AEM’s MongoMK implementation. It uses the org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreService PID.
The following configuration options are available:
mongouri: The MongoURI required to connect to Mongo Database. The default is mongodb://localhost:27017
db: Name of the Mongo database. The default is Oak. However, new AEM 6 installations use aem-author as the default database name.
cache: The cache size in MB. This is distributed among various caches used in DocumentNodeStore. The default is 256
changesSize: Size in MB of capped collection used in Mongo for caching the diff output. The default is 256
customBlobStore: Boolean value indicating that a custom data store will be used. The default is false.
The MongoDB storage leverages MongoDB for sharding and clustering. The repository tree is kept in one MongoDB database where each node is a separate document.

It has several particularities:
Revisions
For each update (commit) of the content, a new revision is created. A revision is basically a string that consists of three elements:
A timestamp derived from the system time of the machine it was generated on
A counter to distinguish revisions created with the same timestamp
The cluster node id where the revision was created
Branches
Branches are supported, which allows client to stage multiple changes and make them visible with a single merge call.
Previous documents
MongoDB storage adds data to a document with every modification. However, it only deletes data if a cleanup is explicitly triggered. Old data is moved when a certain threshold is met. Previous documents only contain immutable data, which means they only contain committed and merged revisions.
Cluster node metadata
Data about active and inactive cluster nodes is kept in the database in order to facilitate cluster operations.

When to use MongoDB with AEM
MongoDB will typically be used for supporting AEM author deployments where one of the following criteria is met:
More than 1000 unique users per day;
More than 100 concurrent users;
High volumes of page edits;
Large rollouts or activations.

The criteria above are only for the author instances and not for any publish instances which should all be TarMK based. The number of users refers to authenticated users, as author instances do not allow unauthenticated access.
If the criteria are not met, then a TarMK active/standby deployment is recommended to address availability. Generally, MongoDB should be considered in situations where the scaling requirements are more than what can be achieved with a single item of hardware.

Data Store
FileDataStore
Amazon’s Simple Storage Service (S3)
Microsoft’s Azure storage service
FileDataStore
FileDataStore present in Jackrabbit 2. It provides a way to store the binary data as normal files on the file system. It uses the org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore PID.

These configuration options are available:
repository.home: Path to repository home under which various repository related data is stored. By default, binary files would be stored under crx-quickstart/repository/datastore directory
path: Path to the directory under which the files would be stored. If specified then it takes precedence over repository.home value
minRecordLength: The minimum size in bytes of a file stored in the data store. Binary content less than this value would be inlined.

Amazon’s Simple Storage Service (S3)
AEM can be configured to store data in Amazon’s Simple Storage Service (S3). It uses the org.apache.jackrabbit.oak.plugins.blob.datastore.S3DataStore.config PID for configuration.

Microsoft’s Azure storage service
AEM can be configured to store data in Microsoft’s Azure storage service. It uses the org.apache.jackrabbit.oak.plugins.blob.datastore.AzureDataStore.config PID for configuration.
Quick Notes:

Binary Data(Data Store) + Content Nodes(Node Store) = TAR Implementation

Node Store -> Segment Store(TAR)

By Default AEM will store in Segment Store located at crx-quickstart\repository\segmentstore. If we need to make the default path and configurations then need to add the config file in install faolder before installation
Node Store -> Document Store(MongoDB).

every commit will create a version in mongo db
Data Store -> FileData Store(FileSystem).
Data Store -> S3(Cloud).
Data Store -> Azure(Cloud).

By aem4beginner

May 5, 2020
Estimated Post Reading Time ~

Shared Data Store

The AEM platform starting from AEM 6 is based on a Jackrabbit OAK repository (replacing the Jackrabbit 2.X repository of previous versions). This repository can be split into two different storage elements: the Node Store and the Data Store (also called Blob Store).

Node store contains all the metadata and references of all information in the repository, whereas the data store contains all information bigger than a predefined size (this size is configurable; the standard is 4KB). So all data that are bigger than this size, will be stored on the data store and not in a node store.
For example, it usually contains images, assets, and other binary data.

As you can imagine, one thing to take into account is that the data store may grow a lot, having even terabytes of data for a big site. This means that if we have an author instance and several publish instances, we need to store this big amount of data for each server.

In order to solve this issue, we can use the shared datastore approach. This approach consists of having a unique data store, which is shared between the publish instances and eventually also with the author instance (in this case every file should have a flag saying if it’s published or not).

The schema can be seen in the following image:

In this way, we have only one data store, with the corresponding saving of space on the disk. Another advantage is that the replication process can be faster since once we publish a page, we don’t have to replicate also the binary data.

On the other hand, we need to take into account that maintenance of this approach will become more complex, having to pay attention to the shared nature of the data when we run the garbage collector process, in order to don’t remove active content.

How to configure Shared Data Store:
– Create the data store configuration file on each instance that is required to share the data store. On each configuration file, we need to point to the same data store.

– You can validate the configuration, looking for a unique file added to the data store by each repository that is sharing it with format repository-[UUID], where the UUID is a unique identifier of each individual repository.

– Also, we can change the “Serialization Type” of the “Publish” replication agent from “Default” to “Binary Less” and add an additional argument (binaryless=true) to the replication agent’s “Transport URI”, meaning that the binary itself does not have to be transported across the network, resulting in a faster replication.

By aem4beginner

April 22, 2020
Estimated Post Reading Time ~

Managing AEM Datastore adobe official document

Follow the below adobe document:
https://helpx.adobe.com/content/dam/help/en/experience-manager/kt/eseminars/gems/aem-managing-aem-datastore/_jcr_content/main-pars/download_section/download-1/managing_aem_datastoreoct17.pdf

By aem4beginner

How to Troubleshooting –Rapid DataStore growth in AEM 6.3

Statement: Rapid Datastore growth

Solution:
Possible Reasons:

No DataStore GC for a while
Lucene indexes stored in the DataStore may cause repository growth disproportionate to the content

o Enhancement in AEM 6.4 for active deletion of unused lucene blobs. With regularly scheduled deletes during the day, the repository growth would be under check because of lucene blobs.

Mitigation:
Ensure that DataStore GC is enabled and run weekly
Can also increase DataStore GC frequency (say bi-weekly) and schedule during off-peak hours for periods when a large number of uploads requested
Ensure that revision clean-up enabled and working

By aem4beginner

Where Form posted Datastores in AEM

Statement: Where the data is stored when you have a custom component which accepts user inputs

Solution:
Where the data is stored when you have a custom component which accepts user inputs. Let’s use the following as an example:

Assume above the form is inside a paragraph system and the radio group is set like the following:

So once the user submits the form, the data is generated into CRX at the /content/usergenerated folder. For the example above, it is stored at:

This is how data looks like:

By aem4beginner

How to perform Datastore Consistency Check in AEM

Statement:

How to find datastore consistency
How to fix data store inconsistencies.

Solution:
How to find datastore consistency:

Go to this URL : http://localhost:4502/crx/explorer/config/check.jsp
Login with username and password
Check all the options expect fix consistency check.

Observe the log files as below in the RUN log console, it shows consistency check is done and Zero error found.

How to fix data store inconsistencies.

Assume if there were any repository inconsistency as per the above run, select only Fix inconsistency options else choose the all options to find inconsistencies and fix the same.
Traversal check
Fix inconsistencies
Log each node
Datastore consistency check

That's It!.

By aem4beginner

April 20, 2020
Estimated Post Reading Time ~

Data Store Garbage Collection AEM 6.x

Running Data Store Garbage Collection
There are three ways of running data store garbage collection, depending on the data store setup on which AEM is running:
Via Revision Cleanup - a garbage collection mechanism usually used for node store cleanup.

Via Data Store Garbage Collection - a garbage collection mechanism specific for external data stores, available on the Operations Dashboard.
Via the JMX Console.

If TarMK is being used as both the node store and data store, then Revision Cleanup can be used for garbage collection of both node stores and data stores.

However if an external data store is configured such as File System Data Store, then data store garbage collection must be explicitly triggered separate from Revision Cleanup.

Datastore garbage collection can be triggered either via the Operations Dashboard or the JMX Console.

The below table shows the data store garbage collection type that needs to be used for all the supported data store deployments in AEM 6:

Node Store	Data Store	Garbage Collection Mechanism
TarMK	TarMK	Revision Cleanup (binaries are in-lined with Segment Store)
TarMK	External Filesystem	Data Store Garbage Collection task via the Operations Dashboard JMX Console
MongoDB	MongoDB	Data Store Garbage Collection task via the Operations Dashboard JMX Console
MongoDB	External Filesystem	Data Store Garbage Collection task via the Operations Dashboard JMX Console

Running Data Store Garbage Collection via the Operations Dashboard
The built-in Weekly Maintenance Window, available via the Operations Dashboard, contains a built-in task to trigger the Data Store Garbage Collection at 1 am on Sundays.

If you need to run data store garbage collection outside of this time, it can be triggered manually via the Operations Dashboard.

Before running data store garbage collection you should check that no backups are running at the time.

Open the Operations Dashboard by Navigation -> Tools -> Operations -> Maintenance.

Click or tap the Weekly Maintenance Window.
Select the Data Store Garbage Collection task and then click or tap the Run icon.

Datastore garbage collection runs and its status is displayed in the dashboard.

Note:
The Data Store Garbage Collection task will only be visible if you have configured an external file data store. See Configuring node stores and data stores in AEM 6 for information on how to set up a file data store.

Running Data Store Garbage Collection via the JMX Console
This section is about manually running data store garbage collection via the JMX Console. If your installation is set up without an external data store, then this does not apply to your installation. Instead, see the instructions on how to run Revision cleanup under Maintaining the Repository.

Note:
If you are running TarMK with an external data store, it is required you run Revision Cleanup first in order for garbage collection to be effective.

To run a garbage collection:
In the Apache Felix OSGi Management Console, highlight the Main tab and select JMX from the following menu.

Next, search for and click the Repository Manager MBean (or go to http://host:port/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Drepository+manager%2Ctype%3DRepositoryManagement).

Click startDataStoreGC(boolean markOnly).

Enter "true" for the markOnly parameter if required:

Option	Description
boolean markOnly	Set to true to only mark references and not sweep in the mark and sweep operation. This mode is to be used when the underlying BlobStore is shared between multiple different repositories. For all other cases set it to false to perform full garbage collection.

Click Invoke. CRX runs the garbage collection and indicates when it has completed.

Note:
The data store garbage collection will not collect files that have been deleted in the last 24 hours.

Note:The data store garbage colleciton task will only start if you have configured an external file data store. If an external file data store has not been configured, the task will return the message Cannot perform operation: no service of type BlobGCMBean found after invoking. See Configuring node stores and data stores in AEM 6 for information on how to set up a file data store.

Automating Data Store Garbage Collection

If possible, data store garbage collection should be run when there is little load on the system, for example in the morning.

The built-in Weekly Maintenance Window, available via the Operations Dashboard, contains a built-in task to trigger the Data Store Garbage Collection at 1 am on Sundays. You should also check that no backups are running at this time. The start of the maintenance window can be customized via the dashboard as necessary.

Note:
The reason not to run it concurrently is so that old (and unused) data store files are also backed up, so that if it is required to roll back to an old revision, the binaries are still there in the backup.

If you don't wish to run data store garbage collection with the Weekly Maintenance Window in the Operations Dashboard, it can also be automated using the wget or curl HTTP clients. The following is an example of how to automate backup by using curl:

Caution:
In the following example curl commands various parameters might need to be configured for your instance; for example, the hostname (localhost), port (4502), admin password (xyz) and various parameters for the actual data store garbage collection.

Here is an example curl command to invoke data store garbage colleciton via the command line:

curl -u admin:admin -X POST --data markOnly=true http://localhost:4503/system/console/jmx/org.apache.jackrabbit.oak"%"3Aname"%"3Drepository+manager"%"2Ctype"%"3DRepositoryManagement/op/startDataStoreGC/boolean

The curl command returns immediately.

References:
https://helpx.adobe.com/experience-manager/6-4/sites/administering/using/data-store-garbage-collection.html

By aem4beginner

Configuring node stores and data stores in AEM 6

URL:
https://helpx.adobe.com/experience-manager/6-4/sites/deploying/using/data-store-config.html#Introduction

By aem4beginner

Checking Data Store Consistency in AEM 6.4

The data store consistency check will report any data store binaries that are missing but are still referenced. To start a consistency check, follow these steps:

Go to the JMX console. For information on how to use the JMX console, see this article.
Search for the BlobGarbageCollection Mbean and click it.
Click the checkConsistency() link.

After the consistency check is complete, a message will show the number of binaries reported as missing. If the number is greater than 0, check the error.log for more details on the missing binaries.

Below you will find an example of how the missing binaries are reported in the logs:
11:32:39.673 INFO [main] MarkSweepGarbageCollector.java:600
Consistency check found [1] missing blobs

11:32:39.673 WARN [main] MarkSweepGarbageCollector.java:602 Consistency check failure in the the blob store : DataStore backed BlobStore [org.apache.jackrabbit.oak.plugins.blob.datastore.OakFileDataStore], check missing candidates
in file /tmp/gcworkdir-1467352959243/gccand-1467352959243

By aem4beginner

April 16, 2020
Estimated Post Reading Time ~

Configure Data Store and Node Store in AEM 6

In this article we are going to learn how to configure Data Store and Node Store in AEM 6. We all know how to install AEM, but it’s really important to know which type of configuration is best for which type of scenario. What are the different ways to configure data store and node store in AEM 6.1.

Major difference between CQ5.x and AEM6.x:
AEM6.x implements OAK repository whereas older CQ5 uses CRX2.
AEM6.x uses Microkernel , CQ5.x uses Persistence manager.
Custom re-index is possible in AEM 6.x depends upon the queries . (Will cover more on this in later article)
Slightly is introduced in AEM6.x whereas CQ5 uses JSP.
Prerequisite:
AEM 6.1 jar with valid License file.
Decide which Node and Datastore is required for your project.
Configure different Data Store in AEM 6:

There are basically two type of data store available in AEM 6 Amazon S3 bucket and File Datastore.

Amazon S3 Bucket Data Store :
This type of storage requires account with Amazon. We need this type to store more data’s in an external S3 bucket.

Config file name :
org.apache.jackrabbit.oak.plugins.blob.datastore.S3DataStore.config

Basic config:
accessKey=<provided by Amazon>
secretKey=<provided by Amazon>
s3Bucket=<provided by Amazon>
s3Region=<provided by Amazon>
s3EndPoint=<provided by Amazon>
connectionTimeout=120000
socketTimeout=120000
maxConnections=40
maxErrorRetry=10
writeThreads=20
cacheSize=<size in bytes>
concurrentUploadsThreads=10
asyncUploadLimit=100
cachePurgeTrigFactor=0.95d
path=~/datastore

File Datastore:

This method is required to store all binary data in same local file system .

Config file name :
org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.config

Basic config:
path=~/datastore
minRecordLength=<values in bytes>

Configure Node Store types in AEM 6:

Document Node store:
This type needs to configure MONGODB. Mongo DB setup is configured for HA of instances.

Config file name :
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreService.config

Basic config:
mongouri=mongodb://<hostname>:<port>
db=<db name>
customBlobStore=false
Feel free to drop a comment, if you face any issue while implementing MongoDB

Segment node store:
This method uses to store Metadata , properties in TARMK implementation. By default AEM uses segment store.

Same article contains offline compaction details below.

Config file name :
org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStoreService.config

Basic config:
customBlobStore=true

NOTE: By default , segment store folder creates under /repository folder. If still needs to change the path , we can use repository.home in config file.

Steps to install AEM 6:
Move Jar and license file to appropriate folder.
Rename jar file as “cq6-p<portnumber>.jar” # specify port number 4 or 5 digit
Just Unzip the jar file

Java –jar cq6-p<portnumber>.jar –unpack

Sample output:
[pradeep@host ~]$ <strong>java -jar cq6-author-p4507.jar -unpack</strong>

Loading quickstart properties: default

Loading quickstart properties: instance

Setting properties from filename '/home/pradeep/cq6-author-p4507.jar'

Option '-quickstart.server.port' set to '4507' from filename cq6-author-p4507.jar

Verbose mode - stdout/err not redirected to files, and stdin not closedResourceProvider paths=[/gui, /gui/default]

quickstart.build

quickstart.properties not found, initial install

UpgradeUtil.handleInstallAndUpgrade has mode INSTALL

Saving build number in quickstart.properties

Upgrade: no files to restore from pre-upgrade backup

31 files extracted from jar file

Running chmod +x /home/pradeep/crx-quickstart/bin/start

Running chmod +x /home/pradeep/crx-quickstart/bin/stop

Running chmod +x /home/pradeep/crx-quickstart/bin/status

Running chmod +x /home/pradeep/crx-quickstart/bin/quickstart

Not starting the Quickstart server as the -unpack option is set

Quickstart files unpacked, server startup scripts can be found under /home/pradeep/crx-quickstart

Once extracted properly, check crx-quickstart folder present in the same path
Inside Crx-quickstart folder , create a folder named as “install” folder.
As per above configuration , create config file and configuration.
Once done, start the instance.
Check all bundles are inactive state and review the error log.
Issues while configuring S3 bucket:

After all config made as per above, we faced issues related to Amazon config and local cache path not created properly.

After lots of verification on config file, we finally identified as the config made under install folder is not override on our OSGI console.

Open config similar to your setup and verify the configuration override properly.

Sample S3 config :

Sample Segment store config:

Offline compaction:
Few may face issues with disk space issue on Segment store folder .

To reduce space , AEM has provided with compaction tool. This post explains about offline compaction techniques.

Steps to perform offline compaction:
Download and install latest oak-run . Please visit below URL to check the updates
https://repository.apache.org/content/repositories/releases/org/apache/jackrabbit/oak-run/
Stop the AEM instance.
Backup the instance .
Check the size before running command
Run the below command
java -jar oak-run-x.x.xx.jar checkpoints <segmentstore path>
java -jar oak-run-x.x.xx.jar checkpoints <segmentstore path> rm-unreferenced
java -jar oak-run-x.x.xx.jar compact <segmentstore path>
Start the instance.
Check the segment store size.

Explanation:
First command will identify the older checkpoints.
Second command, checks for unreferenced checkpoints and remove them
Compact the segment store.

NOTE: Once you are familiar with these steps , please implement in script .

Feel free to drop a comment or write to us on mongo DB setup and configuration related issues.

By aem4beginner

May 10, 2020 Estimated Post Reading Time ~

May 5, 2020 Estimated Post Reading Time ~

April 22, 2020 Estimated Post Reading Time ~

April 20, 2020 Estimated Post Reading Time ~

April 16, 2020 Estimated Post Reading Time ~

May 10, 2020
Estimated Post Reading Time ~

May 5, 2020
Estimated Post Reading Time ~

April 22, 2020
Estimated Post Reading Time ~

April 20, 2020
Estimated Post Reading Time ~

April 16, 2020
Estimated Post Reading Time ~