Quantcast
Viewing all articles
Browse latest Browse all 42

Verifying Git Data Integrity

As a Git administrator you’re probably familiar with the git fsck command which checks the integrity of a Git database, but in a large deployment you may have several mirrors in different locations supporting users and build automation (the record I’ve heard so far is over 50 mirrors).  You can run git fsck on each one as normal maintenance, but even if all of the mirrors have intact databases, how do you make sure that all of the mirrors are consistent with the master repository?  You need a simple way to verify Git data integrity for all repositories at all sites.

That’s quite a difficult question to answer. If you have 20 or 30 mirrors, you want to know if any of them are not in sync with the master. Inconsistencies may arise if the replication is lagging behind, or if there is some other subtle corruption.

Git MultiSite provides a simple consistency checker to answer this question quickly. (Bear in mind that Git MultiSite nodes are all writable peer nodes; it does not use a master-slave paradigm.  But the ability to make sure that all peer nodes are consistent is equally valuable.)  The consistency checker can be invoked for any repository in the administration console:

The consistency checker computes a SHA1 ID over the values of all the current refs in the repository on each replicated peer node. This SHA1 is tied to the Global Sequence Number (GSN), which uniquely identifies all of the proposals in Git MultiSite’s Distributed Coordination Engine. The result looks like this:

 

First, I see that the GSN matches across all three nodes. I’m now confident that they’re all reporting results at a consistent point, when the same transactions should be present in all nodes. In other words, I’m able to discount any inconsistencies due to network lag.

More importantly, I see that the SHA1 for the second node doesn’t match the other two. That’s a red flag, and it means that I should immediately investigate what’s wrong on that node.

Now consider this example:

 

Notice that the third node is reporting an earlier GSN (23 versus 29) compared to the other two nodes. That tells me that this node is lagging behind, which may be expected if it’s connected over a WAN and always running 2-3 minutes behind the other nodes.

Running a distributed SCM environment is very difficult, and the consistency check is another way that Git MultiSite makes things easier for you. Check out a free trial and see for yourself!

 

 

Image may be NSFW.
Clik here to view.
avatar

About Randy DeFauw

Randy DeFauw is Director of Product Marketing for WANdisco’s ALM products. He focuses on understanding in detail how WANdisco’s products help solve real world problems, and has deep background in development tools and processes. Prior to joining WANdisco he worked in product management, marketing, consulting, and development. He has several years of experience applying Subversion and Git workflows to modern development challenges.


Viewing all articles
Browse latest Browse all 42

Trending Articles