Fixing an out of sync CQ Cluster

Fixing an out of sync CQ Cluster

Every so often a node in a CQ cluster will fall out of sync with its master.  This can happen due to a loss of network connectivity, power loss, or any number of other alternatives.  When the master and slave fall out of sync, it will be impossible to start both nodes (note start, if it was due to a network connectivity issue, it is possible that they can both remain running in a non-clustered state).  When this happens, as any server issue, first place to check is the logs.  Typically, a message like the following is displayed:

*ERROR* RepositoryImpl: Failed to initialize workspace 'crx.default' (RepositoryImpl.java, line 540)
javax.jcr.RepositoryException: Cannot instantiate persistence manager com.day.crx.persistence.tar.TarPersistenceManager
at com.day.crx.core.CRXRepositoryImpl$CRXWorkspaceInfo.createPersistenceManager(CRXRepositoryImpl.java:1101)
at com.day.crx.core.CRXRepositoryImpl$CRXWorkspaceInfo.doInitialize(CRXRepositoryImpl.java:1117)
at org.apache.jackrabbit.core.RepositoryImpl$WorkspaceInfo.initialize(RepositoryImpl.java:1998)
at org.apache.jackrabbit.core.RepositoryImpl.initStartupWorkspaces(RepositoryImpl.java:533)
at com.day.crx.core.CRXRepositoryImpl.initStartupWorkspaces(CRXRepositoryImpl.java:279)
at org.apache.jackrabbit.core.RepositoryImpl.(RepositoryImpl.java:342)
at com.day.crx.core.CRXRepositoryImpl.(CRXRepositoryImpl.java:225)
at com.day.crx.core.CRXRepositoryImpl.(CRXRepositoryImpl.java:267)
at com.day.crx.core.CRXRepositoryImpl.create(CRXRepositoryImpl.java:185)
Caused by: java.io.IOException: This cluster node and the master are out of sync. Operation stopped.
Please ensure the repository is configured correctly.
To continue anyway, please delete the index and data tar files on this cluster node and restart.
Please note the Lucene index may still be out of sync unless it is also deleted.

 

Which will indicate that the cluster is out of sync, and the server needs to be re-indexed.  You can remove the previous indexes without issue by deleting the following:

  1.  /crx-quickstart/repository/version/index*.tar
  2.  /crx-quickstart/repository/workspaces/crx.default/index
  3. /crx-quickstart/repository/workspaces/crx.default/index*.tar
  4. /crx-quickstart/repository/repository/index
  5. /crx-quickstart/repository/tarJournal/index*.tar


You also will need to remove the other (tar) journal entries for the server before it can resynchronize with the master.   The journal files can be found under crx-quickstart/repository/tarJournal/data*.tar.  Typically, when a server is out of sync, I will log into to the slave server, shut it down, take an offline backup of the content (read: copy the crx-quickstart folder and compress it), then remove all of the index and journal tar files.  99% of the time, this will resolve any synchronization issues.

The quick and dirty approach (for the 1%)

If you don't have time to attempt the above (really, it is typically pretty fast), you can do the "quick and dirty" approach (as I like to call it).  This requires a recent backup of the master node.  There is actually a well written Adobe post regarding this method that can be found here: http://www.wemblog.com/2012/01/how-to-fix-out-of-sync-cluster-issue-cq.html

Good luck!

Share this post

0 Comments

comments powered by Disqus