May 19, 2009
More Mining Software Repositories
I tried to talk with as many people from MSR about the Hadley centre codebase, and what might be possible. I’m encouraged, since it seems like anything I can do with respect to mining the version control system, even if it’s recreating another result, would be interesting.
Subversion is probably the VCS that the most work has been done on, and there’s a reasonable amount of work on identifying closely coupled modules. One fellow I need to look up presented a poster (with short paper) on a case study of applying the state of the art technology to an industrial codebase (in Java). He told me a little about the things I should look for in Subversion, and that I should really understand the workflow to make good conclusions. I kind of understand that work is done on branches and then merge at Hadley, which would make things easiest, but I’ll need to verify that.
Some work has done static analysis to identify coupling, which might work with fortran. It wouldn’t capture connections with data files or with configuration options, though. Some of the source code searching could be useful, particularly for the connections between code and data files and configuration file options.
If this is at all successful, I think that a recommended list of files, functions, data files, or configurations to consider with a proposed change might be immediately useful. If there’s some sort of meta-tagging in the VCS, it could be both improved and validated based on some very simple feedback or logging mechanisms.
In a very long-term kind of way, this can tie into deciding how “necessary” the couplings are. I think if we tie this to the idea of mapping the code back to physical processes, then we can see how many clusters of highly coupled (say, as in changed together often) files tie back to related physical processes. It might be possible to say something about what the clusters correspond to, and if any of them are “historical accidents” as opposed to proceses which are necessarily related. Or Conway’s Law kind of effects – to mention to workshop from today. It might be that some of the coupling groups are related to a key person’s area of experise, or related areas of climate science that do not describe tightly coupled processes (areas where maybe there should be some sort of pseudo-inheritance? Maybe?)
To tie the code back to formulae and results, Greg has suggested provenance tools. This I’ll need to sort out more.
Specifics all to come – this is more of a “To Do” list.