Observations on software: from "State of the Nation" to nitty-gritty technical details.

2010/04/01

Distributed version control

Recently, Joel on Software published a blog entry discussing the advantages a distributed version control system [DVCS] has over a version control system [VCS] with its dependency on a central repository.  The key point, Joel says, is that "with distributed version control, the distributed part is actually not the most interesting part," but "that these systems think in terms of changes, not in terms of versions."  Joel doesn't get it.

There is no magic merge bullet going on here; Mercurial (which has a great name (see definition 4)) doesn't track every refactoring operation the developer performs.  It doesn't know (or care) that the developer moved a function from file A to file B.  Rather, just like any other VCS it simply tracks the changes to the individual files A and B.  While it can and should know that stuff was removed from file A and that stuff was added to file B, it cannot and should not know whether what was removed from A was transferred over to B.  So, what is it that really differentiates a DVCS from other VCS applications?

According to Joel's own Mercurial manual the developer should use the tool as follows:
  1. check in often
  2. branch to test new approaches (branch to spike)
  3. synchronize branches quickly
  4. propagate changes (push and pull) often
  5. add new code into the VCS quickly
Now, here's the interesting thing.  All of these are factors described as "best practices" for Software Configuration Management [SCM].  What a surprise.

It isn't (just) about whether or not the tool tracks changesets or file versions, it is about whether the tool facilitates best practices for SCM.  It turns out that Mercurial does an outstanding job across the board:
  1. it makes cloning (branching) incredibly fast and easy.  This encourages developers to branch quickly and often when working on spikes or adding new features.
  2. it makes pushing and pulling fast and easy.  This reduces the probability of sharing workspaces and allows easy synchronization of new code between developers.
  3. it differentiates between commits (which are local to the current repository) and push/pulls (where code is shared).  This encourages developers to commit more frequently, because the fear of "breaking the build" is removed.
  4. it separates the operation of pulling changes from other developers and merging those changes into your own repository.  This means that a developer can accept a change to their repository at any time without having to worry about breaking their own build, until such time they are ready to perform the merge.
  5. it enables developers working on a new feature to distribute changes between themselves without affecting the central repository (yes, CRs are allowed in Mercurial, they just don't take as "central" a "role" as with VCS tools).  Again, fear of "breaking the build" is completely removed.
  6. it does an outstanding job merging changes, in part because merges occur more frequently than in other systems, and in part because it does have an excellent merge algorithm.
I find it interesting that SCM best practices have been around for many years (the above abstract and its corresponding paper were written over a decade ago), but they are not well understood in the developer community.  Developers have been repeatedly told that using some kind of VCS is good, the problem has been simply getting developers on-board with this.  SCM suffers because most VCSs available actually discourage their use.

For example, compare Mercurial with Subversion (a popular open-source VCS).
  1. the latter has no cloning operation, to "clone" a repository requires performing a complete checkout of the source tree from the Central Repository [CR].  Checking code out from the CR is a lengthy operation, because full copies of everything has to be transferred (probably across a network) to your development machine.
  2. the Mercurial "clone" operation actually creates a new branch.  In Subversion you "copy" an entire directory tree to another location in your repository.  The former is local, the latter requires interaction with the CR, which means that a branch in Subversion is forever embedded in the SVN database, even if it was for a spike.
  3. it has no concept of push/pull, you commit code directly to the CR.  The only way to get new code to other developers are either through creating and emailing diffs (that may or may not break when applied to their repository, depending on the changes they made) or by committing your changes.  The former is painful, the latter raises the specter of breaking the build and is hence relatively distasteful.
  4. it has no concept of a "local" commit, the only mechanism available to the developer to "place a stake in the ground" is to commit code to the CR.  Again, this raises the specter of breaking the build.
  5. it requires that changes must be merged prior to the next commit.  If you branch then merging isn't required, but branches are more painful to use, and are merged less frequently.
As you can see the central role that the CR takes with Subversion actually hurts.  I know from personal experience that I tend to refrain from checking stuff in for long periods of time when working in the Subversion realm.  I'm going out on a limb here, but I suspect this is a common problem among developers.  We have heard for years that "breaking the build" is bad so the natural tendency is to avoid doing things that might break the build, like committing changes to a CR.

In any case, using Subversion (and similar VCS tools) actually reduces the likely-hood of the developer to use SCM best practices because proper use of the VCS actually encourages you to do things that cause pain.  DVCS tools, on the other hand, makes the branching to explore new ideas and add new features a natural, meaningful way of thinking and working with your code.  That is the real reason that DVCS tools deserve praise.

No comments:

Post a Comment