Detecting Dependency Trends in Components Using R and Hadoop

As I’ve been experimenting with Flume to ingest ALM data into a Hadoop cluster, I’ve made a couple of interesting observations.

First, the Hadoop ecosystem makes it easy for any team to start using these tools to gather data from disparate ALM sources. You don’t need big enterprise data warehouse (EDW) tools – just Flume and a small Hadoop cluster, or even just a VM from one of the Hadoop vendors to get started. These tools are free and easy to use in a small deployment, and you simply scale everything up as your needs grow.

Second, once the data is in Hadoop, you have access to the growing set of free data analysis tools for Hadoop, ranging from Hive and Pig, to scripted MapReduce jobs and more powerful tools like R.

My most recent experiment utilized the RMR package from Revolution Analytics, which provides a bridge between R, MapReduce, and HDFS. In this case, I had already used Flume to ingest Git commit data from a couple of related Git repositories, and I decided to look for any unusual relationships in the commit activity for the components in the system, including:

The most active components
The number of commits that affected more than one component
Which pairs of components tended to see work in the same commit

That last item I often find very interesting, as it may indicate some dependencies between components that aren’t otherwise obvious.

I had all the Git data stored on HDFS, so I used a ‘word count’-style MapReduce task to provide the counts. A partial R script is shown below.

# libraries

require(rmr2)

dfs.git = mapreduce(

 input = "/user/admin/git",

 map = function(k,v)  {

   comps = c()

   for(i in 1:nrow(v)) {

     lcomps = c()

     # … some cleanup work to extract components ...

     lcomps = append(lcomps, component)

     lcomps = sort(unique(lcomps))

     numUnique = length(lcomps)

     multis = c()

     for(j in 1:length(lcomps)) {

       for(k in (j+1):length(lcomps)) {

         # record pairs

         multis = append(multis, paste0(lcomps[j], "-", lcomps[k]))

     lcomps = append(lcomps, multis)

     if(numUnique > 1) {

       lcomps = append(lcomps, "MULTI")

     comps = append(comps, lcomps)

   keyval(comps,1)

},

 reduce = function(k,vv) {

   keyval(k, sum(vv))

})

Now that I’ve got these counts for each component and component pair, I can easily get it back into R for further manipulation.

out = from.dfs(dfs.git)

comps = unlist(out[[1]])

count = unlist(out[[2]])

results = data.frame(comps=comps, count = count)

results = results[order(results[,2], decreasing=T), ]

r = results[count > 250,]

barplot(r$count,names.arg=r$comps,las=3,col="blue")

I’ll just focus on the most active components and pairs, which I can see in this plot.

Anything interesting there? Maybe. It certainly looks like the ‘app’ component is far and away the busiest component, so perhaps it’s ripe for refactoring. I also notice that ‘app’ and ‘spec’ tend to be updated a lot in the same commit, and there’s a lot of cross-component work (“MULTI”) going on. And what’s missing? Well, the ‘doc’ module isn’t updated very often with other components. Perhaps we’re not being good about documenting test cases right away.

But the main point is that I can now do some interesting data exploration with a minimum amount of work and no investment in an EDW.

So even if your ALM data isn’t ‘Big Data’ yet, you can still take advantage of the flexibility, low barriers to entry, and scalability of the Hadoop ecosystem. You’ll have some fairly interesting realizations before you know it!

About Randy DeFauw

Randy DeFauw is Director of Product Marketing for WANdisco’s ALM products. He focuses on understanding in detail how WANdisco’s products help solve real world problems, and has deep background in development tools and processes. Prior to joining WANdisco he worked in product management, marketing, consulting, and development. He has several years of experience applying Subversion and Git workflows to modern development challenges.

Detecting Dependency Trends in Components Using R and Hadoop

About Randy DeFauw

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List