Git Data Mining with Hadoop

Detecting Cross-Component Commits

Sooner or later every Git administrator will start to dabble with simple reporting and data mining. The questions we need to answer are driven by developers (who’s the most active developer) and the business (show me who’s been modifying the code we’re trying to patent), and range from simple (which files were modified during this sprint) to complex (how many commits led to regressions later on). But here’s a key fact: you probably don’t know in advance all the questions you’ll eventually want to answer. That’s why I decided to explore Git data mining with Hadoop.

We may not normally think of Git data as ‘Big Data’. In terms of sheer volume, Git repositories don’t qualify. In several other respects, however, I think Git data is a perfect candidate for analysis with Big Data tools:

Git data is loosely structured. There is interesting data available in commit comments, commit events intercepted by hooks, authentication data from HTTP and SSH daemons, and other ALM tools. I may also want to correlate data from several Git repositories. I’m probably not tracking all of these data sources consistently, and I may not even know right now how these pieces will eventually fit together. I wouldn’t know how to design a schema today that will answer every question I could ever dream up.
While any single Git repository is fairly small, the aggregate data from hundreds of repositories with several years of history would be challenging for traditional repository analysis tools to handle. For many SCM systems the ‘reporting replica’ is busier than the master server!

Getting Started

As a first step I decided to use Flume to stream Git commit events (as seen by a post-receive hook) to HDFS. I first set up Flume using a netcat source connected to the HDFS sink via a file channel. The flume.conf looks like:

git.sources = git_netcat

git.channels = file_channel

git.sinks = sink_to_hdfs

# Define / Configure source

git.sources.git_netcat.type = netcat

git.sources.git_netcat.bind = 0.0.0.0

git.sources.git_netcat.port = 6666

# HDFS sinks

git.sinks.sink_to_hdfs.type = hdfs

git.sinks.sink_to_hdfs.hdfs.fileType = DataStream

git.sinks.sink_to_hdfs.hdfs.path = /flume/git-events

git.sinks.sink_to_hdfs.hdfs.filePrefix = gitlog

git.sinks.sink_to_hdfs.hdfs.fileSuffix = .log

git.sinks.sink_to_hdfs.hdfs.batchSize = 1000

# Use a channel which buffers events in memory

git.channels.file_channel.type = file

git.channels.file_channel.checkpointDir = /var/flume/checkpoint

git.channels.file_channel.dataDirs = /var/flume/data

# Bind the source and sink to the channel

git.sources.git_netcat.channels = file_channel

git.sinks.sink_to_hdfs.channel = file_channel

The Git Hook

I used the post-receive-email template as a starting point as it contains the basic logic to interpret the data the hook receives. I eventually obtain several pieces of information in the hook:

timestamp
author
repo ID
action
rev type
ref type
ref name
old rev
new rev
list of blobs
list of file paths

Do I really care about all of this information? I don’t really know – and that’s the reason I’m just stuffing the data into HDFS right now. I don’t care about all of it right now, but I might need it a couple years down the road.

Once I marshal all the data I stream it to Flume via nc:

nc_data = \

 "{0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}|{8}|{9}|{10}\n".format( \

 timestamp, author, projectdesc, change_type, rev_type, \

 refname_type, short_refname, oldrev, newrev, ",".join(blobs), \

 ",".join(paths))

p = Popen(['nc', NC_IP, NC_PORT], stdout=PIPE, \

 stdin=PIPE, stderr=STDOUT)

nc_out = p.communicate(input="{0}".format(nc_data))[0]

The First Query

Now that I have Git data streaming into HDFS via Flume, I decided to tackle a question I always find interesting: how isolated are Git commits? In other words, does a typical Git commit touch only one part of a repository, or does it touch files in several parts of the code? If you work in a component based architecture then you’ll recognize the value of detecting cross-component activity.

I decided to use Pig to analyze the data, and started by ingesting data with HCat.

hcat -e "CREATE TABLE GIT_LOGS(time STRING, author STRING, \
  repo_id STRING, action STRING, rev_type STRING, ref_type STRING, \
  ref_name STRING, old_rev STRING, new_rev STRING, blobs STRING, paths STRING) \
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/flume/git-events';"

Now for the fun part – some Pig Latin! Actually detecting cross-component activity will vary depending on the structure of your code; that’s part of the reason why it’s so difficult to come up with a canned schema in advance. But for a simple example let’s say that I want to detect any commit that touches files in two component directories, modA and modB. The list of file paths contained in the commit is a comma delimited field, so some data manipulation is required if we’re to avoid too much regular expression fiddling.

-- load from hcat

raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();

-- tuple, BAG{tuple,tuple}

-- new_rev, BAG{p1,p2}

bagged = FOREACH raw GENERATE new_rev, TOKENIZE(paths) as value;

DESCRIBE bagged;

-- tuple, tuple

-- tuple, tuple

-- new_rev, p1

-- new_rev, p2

bagflat = FOREACH bagged GENERATE $0, FLATTEN(value);

DESCRIBE bagflat;

-- create list that only has first path of interest

modA = FILTER bagflat by $1 matches '^modA/.*';

DESCRIBE modA;

-- create list that only has second path of interest

modB = FILTER bagflat by $1 matches '^modB/.*';

DESCRIBE modB;

-- So now we have lists of commits that hit each of the paths of interest.  Join them...

-- new_rev, p1, new_rev, p2

bothMods = JOIN modA by $0, modB by $0;

DESCRIBE bothMods;

-- join on new_rev

joined = JOIN raw by new_rev, bothMods by $0;

DESCRIBE joined;

-- now that we've joined, we have the rows of interest and can discard the extra fields from both_mods

final = FOREACH joined GENERATE $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10;

DESCRIBE final;

DUMP final;

As the Pig script illustrates, I manipulated the data to obtain a new structure that had one row per file per commit. That made it easier to operate on the file path data; I made lists of commits that contained files in each path of interest, then used a couple of joins to isolate the commits that contain files in both paths. There are certainly other ways to get to the same result, but this method was simple and effective.

In A Picture

A simplified data flow diagram shows how data makes its way from a Git commit into HDFS and eventually out again in a report.

Data Flow

What Next?

This simple example shows some of the power of putting Git data into Hadoop. Without knowing in advance exactly what I wanted to do, I was able to capture some important Git data and manipulate it after the fact. Hadoop’s analysis tools make it easy to work with data that isn’t well structured in advance, and of course I could take advantage of Hadoop’s scalability to run my query on a data set of any size. In the future I could take advantage of data from other ALM tools or authentication systems to flesh out a more complete report. (The next interesting question on my mind is whether commits that span multiple components have a higher defect rate than normal and require more regression fixes.)

Using Hadoop for Git data mining may seem like overkill at first, but I like to have the flexibility and scalability of Hadoop at my fingertips in advance.

About Randy DeFauw

Randy DeFauw is Director of Product Marketing for WANdisco’s ALM products. He focuses on understanding in detail how WANdisco’s products help solve real world problems, and has deep background in development tools and processes. Prior to joining WANdisco he worked in product management, marketing, consulting, and development. He has several years of experience applying Subversion and Git workflows to modern development challenges.

Git Data Mining with Hadoop

Detecting Cross-Component Commits

Getting Started

The Git Hook

The First Query

In A Picture

What Next?

About Randy DeFauw

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112