Sound Code: git

Showing posts with label git. Show all posts

Thursday, 23 October 2014

Thoughts on the demise of CodePlex and Mercurial

I've been an enthusiastic user of CodePlex ever since it first launched in 2006. 14 of my open source projects are hosted there, including my "main" contribution to .NET open source, NAudio, and my most downloaded project of all time, Skype Voice Changer.

CodePlex was for me a huge improvement over SourceForge, where I had initially attempted to host NAudio in back 2003, but never actually succeeded in figuring out how to use CVS on Windows. Thanks to the TFS to SVN bridge, CodePlex source control was easy to work with using TortoiseSVN, and offered discussion forums, bug tracking, release hosting, and even ClickOnce hosting, which I make use of for a number of projects.

I was particularly excited in 2010 when CodePlex started to support Mercurial for source control. I was just awakening to the amazing power and flexibility of DVCS, and I quickly settled on Mercurial as my preferred option to Git - it just seemed to play a lot better with Windows, have a simpler command line syntax, and wasn't blocked by my work firewall. So I made the switch to Mercurial for NAudio in 2011.

Git vs Mercurial

It became obvious soon after making my decision that Git was definitely winning the popularity contest in the DVCS space. Behind the sometimes arcane command line syntax, there was an incredibly powerful feature-set there, and slowly but surely thanks to tools like SourceTree and GitHub for Windows, the developer experience on Windows improved and overtook Mercurial.

A case in point would be Microsoft’s decision to support Git natively in Visual Studio. Thanks to the built-in integration, it is trivially easy to enable Git source control for every project you create. For a long time I hoped that Mercurial support would follow, especially since Microsoft had appeared to back it in the past through CodePlex, but they made it clear that Mercurial support was never coming.

Likewise with Windows Azure, when Microsoft added the ability to deploy using DVCS, it was Git that was supported, and Mercurial users were left out in the cold again. I believe that has actually now been rectified, but for me at least, the damage had been done. I’ve used Git for almost all my new projects for over a year now, and I only really use Mercurial now for my legacy projects. It’s obvious that if I want to integrate with the latest tooling, I need to be using Git, not Mercurial.

GitHub vs CodePlex

Although CodePlex added Mercurial hosting back in 2010, it was obvious that they were rapidly losing users to GitHub, and in 2012, they finally added Git support. In theory this should have revived CodePlex as the premier hosting site for .NET open source, but it became apparent that they were falling behind in other areas too. GitHub’s forking and pull request system is very slick, and GitHub pages is a much nicer option than the rather clunky wiki system that CodePlex uses for documentation.

For a while it looked like CodePlex was fighting back, with regular updates of new features, but the CodePlex blog has had no news to announce for over a year now, and perhaps more of an indictment, GitHub has become the hosting provider of choice for the new and exciting open source projects coming out of Microsoft, such as the new ASP.NET vNext project. There are some notable exceptions such as Roslyn (which is considering a move to GitHub) and Visual F# tools (which has a top-voted feature request to move to GitHub).

Does CodePlex offer any benefits over GitHub? Well there are a few. I like having a separate Discussion forum to my Issues list. The ClickOnce hosting is useful for several of my projects. And I can’t complain about the modest income stream that their DeveloperMedia ad integration allows you to tap into if you have a popular project you’d like to generate some income for. But GitHub is a clear winner in most other respects.

Standardisation vs Competition

Now it could be considered a good thing that Git has won the DVCS war and GitHub has won the open source hosting war. It allows us all to embrace them and standardise on one way of working, saving time learning multiple tools and workflows. But there is a part of me that feels reluctant to walk away from Mercurial and CodePlex, as a lack of competition in these spaces will ultimately leave us poorer. If there is no viable competition, what will drive Git and GitHub to keep innovating, and meeting the needs of all their users?

For example, GitHub at one point unexpectedly decided to ditch their uploads feature. This immediately made it unsuitable for hosting lots of the sorts of projects that I work on, which are released as a set of binaries. It looks like they have remedied the situation now with a releases feature, but for me that did highlight a danger that they were already in such a position of strength they could afford to make a decision that would be hugely unpopular with many of their users.

I’m also uneasy about the way that GitHub has become established in the minds of many developers as the only place that counts when evaluating someone’s contribution to open source. My CoderWall page simply ignores all my CodePlex work and focuses entirely on a few peripheral projects I have hosted on GitHub and BitBucket. My OpenHub (formerly Ohloh) page does at least attempt to track my CodePlex work but somehow only picks up a very limited subset of my actual commit history (apparently I did almost nothing in the last two years). I’d rather they didn’t show anything about my commit history than a misrepresentation. I’ve also read numerous blogs proclaiming that you should only hire a developer after checking their GitHub contributions. So it is concerning that the all the work I have done on CodePlex counts for nothing in the minds of some simply because I did it on the wrong hosting site with the wrong DVCS tool. Hopefully the new Microsoft MVP criteria won’t take the same blinkered approach.

Time to Transition?

So I find myself at the end of 2014 wondering whether the time has come to migrate NAudio to GitHub. I was initially against the idea, but it would certainly make it easier to accept contributions (very few people are willing to learn Mercurial), and GitHub pages would be a great platform to build improved documentation on. And all of a sudden these tools that attempt to “rank” you as a contributor to open source would finally recognize me as having done something!

But part of me wishes that the likes of CodePlex and Mercurial would have a renaissance, as well as new DVCS (Veracity?) and alternative open source hosting sites like the excellent BitBucket will continue to grow and flourish and provide real competition to Git and GitHub, spurring them on to more innovation.

I’d love to know your thoughts on this in the comments. Have you transitioned your open source development to Git and GitHub, and why / why not?

TLDR: Git is awesome and GitHub is awesome but the software development community is poorer for there being no viable competition.

Monday, 27 January 2014

Announcing Understanding Distributed Version Control

I’m very pleased to announce that my third Pluralsight course has been published today. This one is entitled “Understanding Distributed Version Control”. Regular followers of my blog will know this is a subject I often write about, and have spoken on it at various developer groups. This course draws and expands on that material to provide what I hope will be a really accessible introduction to what Distributed Version Control systems are, how they work, what the workflow is, and why you should consider using them.

The course is aimed at anyone who is interested in finding out what all the fuss is about DVCS. I know that when I started investigating DVCS, it seemed quite confusing at first, so I have tried to make the type of course that I wish I had seen back then. I focus in particular on explaining the way that DVCS systems store the revision history (in a graph structure known as a “DAG”). Personally I think that once this concept becomes clear in your mind, much of the complexity of DVCS goes away.

I have also tried to show what benefits there are to switching from centralised to distributed version control systems. I know many developers quite sensibly take an “if it ain’t broke don’t fix it” attitude, and so can be reluctant to consider such a fundamental shift in the way they use version control. I’ve tried to show that whether you work alone on a “single developer project”, or work on open source, or work in a large team of developers commercially, DVCS has some very compelling benefits. I’ve also tried to be open about some of the limitations of DVCS compared with centralized tools. I am very enthusiastic about the power and potential of DVCS, but I do recognise that there are still some rough edges in the tooling at the moment.

The final thing I try to do in this course is give you a feel for the general workflow involved in using DVCS day to day. I’ve done this by showing demos in both Mercurial and Git. I quite deliberately chose to use two different DVCS tools to show that the main principles are the same, irrespective of the particular tool you choose. Since there are many other great tutorials as well as Pluralsight courses going into more detail on the specifics of individual DVCS such as Git or Mercurial, I didn’t want to repeat that material. Really my course is designed as a precursor to those courses, helping you understand the big picture, before you get into the nitty-gritty of learning the command line syntax for various operations.

I hope you enjoy the course and find it helpful. I love to hear feedback from people who watched the course, so let me know how you get on with it. I’m always wanting to know how I can improve my courses, so any constructive criticism will also be gratefully received.

Saturday, 16 November 2013

How to Convert a Mercurial Repository to Git on Windows

There are various guides on the internet to converting a Mercurial repository into a git one, but I found that they tended to assume you had certain things installed that might not be there on a Windows PC. So here’s how I did it, with TortoiseHg installed for Mercurial, and using the version of git that comes with GitHub for Windows. (both hg and git need to be in your path to run the commands shown here).

Step 1 – Clone hg-git

hg-git is a Mercurial extension. This seems to be the official repository. Clone it locally:

hg clone https://bitbucket.org/durin42/hg-git

Step 2 - Add hg-git as an extension

You now need to add hg-git as a mercurial extension. This can either be done by editing the mercurial.ini file that TortoiseHg puts in your user folder, or just enable it for this one repository, by editing (or creating) the hgrc file in the .hg folder and adding the following configuration

[extensions]
hggit = c:/users/mark/code/temp/hg-git/hggit

Step 3 – Create a bare git repo to convert into

You now need a git repository to convert into. If you already have one created on GitHub or BitBucket, then this step is unnecessary. But if you want to push to a local git repository, then it needs to be a “bare” repo, or you’ll get this error: “abort: git remote error: refs/heads/master failed to update”. Here’s how you make a bare git repository:

git init --bare git_repo

Step 4 – Push to Git from Mercurial

Navigate to the mercurial repository you wish to convert. The hg-git documentation says you need to run the following one-time configuration:

hg bookmarks hg

And having done that, you can now push to your git repository, with the following simple command:

hg push path\to\git_repo

You should see that all your commits have been pushed to git, and you can navigate into the bare repository folder and do a git log to make sure it worked.

Step 5 – Get a non-bare git repository

If you followed step 3 and converted into a bare repository, you might want to convert to a regular git repository. Do that by cloning it again:

git clone git_bare_repo git_regular_repo

Tuesday, 11 September 2012

New Open Source Project–GraphPad

I’ve open sourced a simple utility I wrote earlier this year, when I was preparing to give a talk on DVCS. It’s called GraphPad, and it’s a simple tool for visualising Directed Acyclic Graphs, with the ability to define your own with a basic syntax, or import history from a Git or Mercurial repository.

The idea behind it was that I would be able to use it in my talk to show what the DAG would look like for various. The really tricky thing is coming up with a good node layout algorithm, and mine is extremely rudimentary.

What’s more there are some very nice ones being developed now, particularly for Git, such as SeeGit or the very impressive “dragon” tool that comes with the Git Source Control Provider for Visual Studio, both of which are also WPF applications. Mine does at least have the distinction of being the only one I know of that also works with Mercurial.

For now, I am not actively developing this project, but I thought I’d open source it in case anyone has a use for it and wants to take it a bit further (the next step was to make the nodes show the commit details in a nice looking tooltip for Git/Hg repositories, as well as showing which nodes the various branch and head/tip labels are pointing at).

You can find it here on bitbucket.

Friday, 20 April 2012

Useful Git and Mercurial Links

Following on from my talk on DVCS at NxtGenUG last night, here are some links to the tools I talked about.

Git Install on Windows

Git users on Windows should start by installing msysgit. Visual Studio integration is available with VisualGit. To visualise your repository, keep an eye on SeeGit, which is still in its early stages, but could develop into a very useful tool. There is also a forthcoming GitHub for Windows application (still in private beta) that should be worth the wait. Finally, Posh-Git is a good PowerShell integration for git.

Mercurial Install on Windows

Get started with Mercurial by installing TortoiseHg and if you want Visual Studio integration there is VisualHg. Posh-Hg provides PowerShell integration.

Converting Repositories

Both Mercurial and git support importing the history from other Version Control Systems. In Mercurial, see the convert extension which supports many DVCS systems. You can also import from TFS and SourceSafe.

For git users there are similar tools. Here’s some help for converting VSS to git, TFS to git and Subversion to git.

Publishing a Repository from Your Local Machine

There are occasions when you want another developer to pull from a repository on your local machine. With Mercurial, you can use hgserve to launch a webserver, and with git, you use git daemon.

Open Source Git and Mercurial Hosting

Sourceforge, google code, bitbucket and codeplex all support both git and Mercurial open source projects. Then there is github, which only supports git, but has a deserved reputation for being one of the best DVCS hosting sites out there.

Private Git and Mercurial Hosting

A number of sites allow you to pay for hosting of private repositories. Both github and bitbucket offer paid options. Bitbucket has the distinction of offering unlimited free private repositories (with a limit on the number of users). Others worth looking at include repository hosting and unfuddle. Fog Creek have a product called Kiln which is based on Mercurial.

If you want to host the repositories on your own server, you can do so, but the hosting software tends to be more primitive than offered by sites like github, so you miss out on features user management, forking, pull requests etc. There is an open source project called rhodecode you can use to host Mercurial repositories. It would be great if a product similar to github were available for you to install locally, but I don’t know of any such product at the moment (let me know in the comments if such a thing does exist – gitorious is the closest I have seen so far).

Other DVCS products worth keeping an eye on

Although git is very much out in front at the moment, I think there is still plenty of scope for innovation in the DVCS space, so it may not be the final word. Three to keep your eye on for the future are Fossil, PlasticSCM, and Veracity, which look to fill in some of the missing gaps in existing DVCS systems, like better visualisation, integration with bug tracking systems and IDEs, or capabilities such as exclusive file locks, or archiving of large binary files.

Tuesday, 17 April 2012

Understanding Distributed Version Control Systems

This post is based on a short article I wrote to explain DVCS to my colleagues at work (hence the comparison with TFS). For those wanting to go further, I strongly recommend Eric Sink’s book Version Control By Example. I’ll be presenting some of this material at an upcoming session at NxtGenUG in Southampton.

The last decade has seen a lot of innovation in version control systems, with the rise of the Distributed Version Control System the most notable development. Distributed Version Control Systems (DVCS) are an idea that was first implemented in the 90s, but in the last five years, open source DVCS such as Git and Mercurial have rapidly started to gain market share over the more established centralized version control systems, such as SourceSafe, TFS or Subversion.

The Centralized Model

In the centralised version control systems (CVCS) that you are already familiar with, every developer connects to the central server and asks for the latest version of the source code. They then work locally on their machine until they have completed a feature and then they check in. As soon as they check in, those changes are made available on the central server for other developers. If someone checked in before them, they must get latest and merge before they check in.

Understanding DVCS with DAGs

To understand Distributed Version Control systems, you need to think of your repository in terms of a directed acyclic graph (a “DAG”). Each node in the graph represents a revision, and each arrow represents which revision the current revision changes (it’s ‘parent’ node), so the arrows go in the opposite direction to time.

Imagine that our repository has had three commits (1, 2 and 3). This can be represented by the following simple DAG. Revision 3 is a change based on revision 2, and revision 2 is a change based on revision 1.

Cloning

With a CVCS, we would do a get latest and just get the state of all files at revision 3. However, in the world of DVCS, we do a clone instead. This means that we take a copy of the whole DAG. (n.b. This makes some people concerned about speed, but a clone is only done once, and DVCS are well known for their much faster speeds than CVCS). Suppose BILL decides to take a clone from the server. Now his local copy looks like this:

In other words, it is identical to the SERVER. This is why people say that you don’t need to have a central server for DVCS, since we could take away SERVER and nothing would be lost. However, in practice, it usually makes sense to designate one computer as the central repository.

Committing

Now BILL wants to make some changes. He writes some code and performs a commit. Now his DAG has an extra node in it:

However, unlike a checkin in TFS or another CVCS, nothing has been sent to the server. Revision 4 is only on BILL’s machine.

BILL can actually carry on developing and do another commit. Now he has two local revisions that the SERVER doesn’t have.

Already, the benefits should be obvious. It is like BILL has his own personal branch to work on. If Revision 5 was a mistake, he can roll back to Revision 4 and try again. He is able to make regular and often small checkins without fear that he will break the build by committing to the SERVER.

Pushing

At some point, BILL is ready to share his work with everyone else so he does a push. This simply compares his DAG (which contains Revisions 1-5) with the SERVER’s DAG (which currently still only has revisions 1-3). When he does a push, the DVCS works out that the server needs revisions 4 and 5, so it appends them onto its own DAG. Now the server’s DAG is identical to BILL’s again.

Pulling

But what if someone else got in there before BILL? Maybe the SERVER now looks like this, with revisions 10 and 11 having been pushed from someone else:

Maybe we would like this to happen when we push:

But this is not allowed (and in fact not possible). Revision 4’s parent is Revision 3, not Revision 11, so we can’t just stick it on the end of our DAG. What will happen is much the same as with a CVCS, which will block you from checking in, and tell you that you need to do a get latest to merge those changes into your working copy. A DVCS will warn you that you probably don’t want to do a push because it will create two “heads”, and instead you should do a pull and a merge.

So BILL does a pull, which is the opposite of a push. It looks at the SERVER and sees what nodes it has that he doesn’t. In this case, it is revision 10 and revision 11. They are pulled from the SERVER and added to his DAG. But notice that now both Revision 4 and Revision 10 are changes to Revision 3, and it means that our repository now has two “heads” - revision 5 and revision 11.

The good news for BILL is that his local commits, 4 and 5 are still perfectly safe and intact. Unlike a get latest with CVCS, no automatic merging has taken place that could overwrite or break anything he has already done. The merge takes place as a separate step.

Merging

Now that BILL has the latest changes from the SERVER, he performs a merge. If there are no conflicts then this is quite trivial. If someone else has changed the same bit of code as him, then he must use the typical merge tools you are already familiar with to select which change is the correct one – there is no getting round this. However, once he has done the merge, he then makes another local commit, in this case, revision 6. Now his repository has only one head and is ready to be pushed to the SERVER.

What are the benefits of DVCS over CVCS?

Hopefully some of the benefits of DVCS over CVCS are already apparent, but let me list a few.

First, it promotes little and often checkins. With a CVCS, you only want to check in if you are 100% sure that your work is fully tested and won’t break the build. This encourages developers to work sometimes for weeks at a time without checking in. With DVCS, you can check in dozens of times a day, and still only push to the SERVER when you are ready. (n.b. it is up to you whether you want to combine all your local checkins in to one before you push to the server. Different developers have different philosophies about this).

Second, it gives every developer multiple personal branches. It is common to have to context switch in the middle of working on something (“drop everything and fix this bug right now”). Whilst TFS has Shelvesets and Workspaces that allow you to separate out multiple tasks you are working on, they are rather cumbersome to use, and end up getting neglected. With DVCS, it is trivially easy to create another local branch (or clone, depending on your preference), to work on the new feature. You can even merge freely between your local branches if necessary.

Third, it allows ad-hoc teams. You don’t have to push everything via the central server. If you are working with another developer and want to share some work-in-progress changes you have made, you can simply pull from each other. Despite both developers now having the same revisions in their local DAG, each one will only get pushed to the server once, and will be attributed to the developer who wrote the code. TFS shelvesets cannot do this.

Fourth, it gives complete branching flexibility. With TFS you can only safely merge into a parent or child branch. This restriction is completely removed with DVCS. You are free to pull or push to whatever branch you want. Obviously there will still need to be processes in place that say what branches ought to be pushed to.

Fifth, it allows for much more flexible handling of merges. Unlike with a CVCS you are not forced to deal with them the moment you do a get latest. Your local repository can have two heads for a time, allowing you to defer the merge until you are ready. It also no longer means that the first developer to check in wins, while the second is lumbered with the merge. You can ask the developer who made the conflicting changes to pull from your local repository, perform the merge, and then you can pull that merge back from them.

Sixth, it allows disconnected working. If you need to work from home without network access, you can still commit locally and push when you have access to the network. This is also good for outsourced teams. They can make commits to their clone of your repository, and you can pull those revisions into your own without ever needing to give them commit access to your own server.

Seventh, it is great for backup. The central server could suffer a catastrophic disk failure, but it can be instantly recreated by cloning from a developer’s repository. Also, a developer can easily backup their work in progress by pushing to a repository on another computer.

What are the limitations of DVCS?

So far I have painted a very positive picture of DVCS. Is there anything it doesn’t do well?

One feature offered by some CVCS that you will probably lose is the ability to lock files so that only one person can work on them at a time. This is most often needed for binary assets. Obviously if your developers are disconnected from a central server, they have no way of knowing whether someone else is also working on a file. However, some DVCS have extensions allowing you to manage files that need to be locked via a central server.

DVCS are not typically a good choice if you want to store very large files or to have very large repositories where people might not want to do a get latest of every folder in the repository. Most users of DVCS simply develop practices that don’t require storing huge binaries in source control, and split vast projects up into smaller sub-projects.

With a DVCS, history is immutable. If you want to modify revision 2, or expunge it altogether from the history of your source control, you can get into trouble, as you effectively end up with a completely new DAG. When the next user does a push, the old revision 2 and its ancestors will come back. You have to make the change on a central repository and then ask all your developers to destroy their local repositories and re-clone from the central one to permanently change history.

How do I get started?

There’s nothing stopping you trying out DVCS for yourself, and I strongly recommend it for any small projects you write. Even if you are the only developer it is great to have a commit history to remind you what you were doing, and the ability to go back in time if you make a mistake. I use Mercurial for all my personal projects, and if you ever collaborate on open source projects you will likely end up needing to learn git at some point, as it is the most popular.

Monday, 20 February 2012

To rebase or not to rebase, that is the question

In recent weeks I’ve made my first code contributions to open source projects on GitHub. A number of GitHub hosted projects like their contributors to rebase their changes before issuing a pull request. Since I normally use Mercurial, which doesn’t enable rebasing by default (although is easily turned on with an extension), I haven’t used rebase much except to experiment with it. And rebase is notorious for causing problems if you use it incorrectly.

Why would you want to rebase?

1. Simplified history

The main argument for rebasing is that it makes the history of your repository much simpler. Instead of seeing lots of short-lived branches and merge commits, you get one single linear history for your branch.

This might seem like a case of OCD on the part of the developers wanting rebase, but unfortunately, the task of looking back through source control history is something we’ve all had to do at some time or another, so the desire for it to appear as simple as possible is understandable.

If like me you quite often forget to do a pull before making a change to your local clone, you may end up requiring a merge when you do get round to doing a push. And more often than not, it is a trivial merge, since your change doesn’t conflict with the most recent commits that you forgot to fetch. In this scenario, rebase is able to present your change as though you had remembered to do the pull before starting work, and eliminates the superfluous merge commit.

Also if you work on a feature for a few weeks, you might end up with lots of intermediate merge commits as you keep pulling from the source repository. Rebasing simplifies the history of your feature by eliminating the many (often trivial) merge commits.

2. Commit combining

At the same time as rebasing, it is a common practice to combine your commits into one. Again, this is more about keeping the history of your repository nice and clean. If you made 20 commits to implement a single feature, perhaps with a few of them going in directions that you later backtracked out of, do you really want all those to be pushed up into the master repository?

Again, at some point in the future, someone will be trawling back through history trying to find where something went wrong, and it is a waste of their time to look at a revision that perhaps contains a silly mistake (might not even build), when your work could be combined into one revision.

Having your feature contained in a single commit also makes life simpler for the code reviewer, since they can just look at the diff for that one revision (although your tools ought to be able to show you a combined diff for a range of revisions – I noticed that github does this very elegantly with pull requests that contain multiple revisions).

Some people have called commit combining a “pretend you are perfect” feature – where you show the final outcome of your work without revealing all the stupid mistakes you made along the way. But that shouldn’t be the point of combining commits. You do it to save time for whoever in the future needs to look back at the history of the project.

What are the dangers?

So if rebase makes our history simpler, why would some people not want to use it. There are a few dangers associated with rebasing and combining commits.

1. Rebasing published revisions

The biggest danger with rebase is rebasing already published revisions. This is most likely to happen if you accidentally rebase the wrong way round (rebasing changes you pulled in from elsewhere onto your own work), something I imagine would be quite easy to do by mistake for a beginner with DVCS. Or maybe you forgot or unintentionally already pushed the revisions you are about to rebase to a public repository.

Doing this means that the revisions you rebased, rather than disappearing, will keep coming back again, and end up getting merged back in alongside the rebased versions of the same change. It can be very hard with a DVCS to get rid of revisions you no longer care about once they have been published.

In Mercurial 2.1, there is a new feature called phases, which gives the repository the ability to know which revisions have been published. This means that commands like rebase can now refuse to work on published revisions, making it a much safer command to use. It will be interesting to see how well this works in practice, and if it does work, maybe Mercurial will allow rebase and other history changing extensions, to be enabled by default). Having said that, since (unlike git), Mercurial by default pushes and pulls all heads, you might find you end up sharing a work in progress branch earlier than you intended.

2. Loss of information provided by a merge commit

One of the benefits of rebase, the removal of the merge commit, is also one of its dangers. It is possible for a rebase to complete successfully, with no merge conflicts, but the resulting code to be broken (e.g. one developer adds a call to functionX, while another developer renames functionX to functionY).

The original commit you made may well have been working perfectly and passing all its unit tests when it was in its own branch, but now it has been rebased there is a commit with your name against it that doesn’t build or contains bugs.

With an explicit merge commit, it is much easier to identify the point at which things went wrong. This remains the main reason why I am not convinced that rebase should be a major part of my workflow. The important thing to remember is that a rebase is just like a merge – it needs to be tested before it can be considered a success, even if there were no conflicts.

3. Loss of intermediate revisions

The goal of rebasing and collapsing is to get rid of intermediate revisions. But I wonder whether you could shoot yourself in the foot by over-enthusiastic collapsing of multiple revisions into one. Often after getting a feature working, I might quickly go over the code and do some last minute cleanup, refactoring a few class names, deleting TODO comments etc. But what if I accidentally break something in these final commits? If I collapse to a single commit it is too late to do a revert of the offending revision, or to rewind to the last good revision and make the changes correctly. Keeping your intermediate commits allows you to backtrack to the last good point.

Summary

In short, I think rebase is a useful tool to have available, but one that should be used with caution. Innovations like Mercurial’s phases could make it much safer, but on the whole, I prefer my source control history to show what really happened, rather than what I would have liked to happen in an ideal world.

As always, I welcome your thoughts on this in the comments.

Thursday, 16 February 2012

Fork First or Just in Time?

I’ve been following the progress of Code 52 a bit over recent weeks. It is an audacious attempt to create a new open source project every week for a year. They also seem willing to accept code contributions, so I was tempted to download a few of their projects and make some minor improvements.

I’ve been using Mercurial for my projects at CodePlex for some time now, but I hadn’t used git in anger, and since Code52 store all their projects on the very impressive github, it gave me a good excuse to learn.

I read up a few tutorials on how to fork a repository in github. The official guide is good, and Scott Hanselman wrote a great post on how to contribute to Code 52 projects. But one thing they all have in common, is a workflow of Fork, Clone, Commit, [optional: Pull & Merge], Push, Pull Request. This has always struck me as being the wrong way round. (CodePlex recommends essentially the same workflow for Mercurial).

Fork First

The reason I don’t like this workflow, is that it assumes the first thing I want to do is create a fork. But that’s not how I typically interact with an open source project. My workflow goes like this:

I come across a new open source project and maybe I find it interesting
Often I will just want to download compiled binaries, but maybe I want to explore the code to see how it was implemented
I clone it (git/hg clone) and maybe I will get round to playing with it later
I attempt to build it locally and maybe it succeeds on my machine (surprising how often it doesn’t)
I attempt to use it and maybe I find a bug or I wish it had a new feature
I report the bug or feature to the developers, and maybe I think I could fix it myself
I explore the source code, and maybe I understand it well enough to make a change
I begin coding a fix/feature, and maybe I get it working
I realise my code needs cleaning up before I issue a pull request, and maybe I get round to doing so
If I have made it this far, now is the time I am ready to push to a public fork and issue a pull request. I estimate I get to this step on less than 1 percent of open source projects I come across.

As you can see, it is only at step 10 that I need to have a fork, but the tutorials all want me to make my fork at step 3. This results in lots of projects having multiple forks that have never been pushed to. Or have been pushed to but no pull request ever submitted, leaving you wondering what the status of the changes is.

Just in time fork

In my opinion, forks (which are really just publicly visible clones), should be made just in time. Currently, Code 52’s Pretzel project has 47 forks, and as far as I can tell, many (most?) of them have had no changes pushed to them at all. (In fact, a nice github feature would be to hide forks that have not been pushed to yet, and to highlight forks that have pull requests outstanding).

The just in time fork workflow isn’t difficult. First clone from the main repository. Think of this clone as your private fork if that helps.

git clone https://github.com/Code52/pretzel.git

Once you decide to make some changes to your repo, you can make a branch to work on (not strictly necessary, but recommended).

git checkout –b my-new-feature

Now work away on your feature. You can pull in changes from the master repository, and optionally merge them into your working branch whenever you like.

Once you are sure that you want to contribute to the project, at this point, you create your public fork on github. Now you add it as a remote:

git remote add myfork https://[email protected]/myfork/pretzel.git

You can now easily push to your github fork. I think it is probably best to also have a feature branch on your github fork, which means that if you wanted to contribute another unrelated feature, you could do that in another branch, and have two pull requests outstanding that weren’t dependent on each other.

git push myfork my-new-feature

The github gui makes it very easy to issue a pull request from a branch.

Summary

Why create dozens of unused forks when it is straightforward to create them at the point they are needed? Am I missing some important reason why you shouldn’t work like this? Let me know in the comments.