Sunday, December 21, 2014

Git: What are diffs and hunks?

When I was learning Git for the first time many years ago, one of the features that made me go, "Wow!! That's something I have really wanted all these years!" was the ability to choose which changes to commit among all the changes in a given file. I hadn’t seen this in the other version control systems I’d used, which were CVS and SVN.
Here’s an example of what I am trying to illustrate. Suppose I have a file named Employee.java with the following contents,
class Employee {
     private String firstName;
     private String lastName;

     Employee(String firstName, String lastName) {
          this.firstName = firstName;
          this.lastName = lastName;
     }

     public void equals(Employee e) {
          if !(e instanceof Employee)
               return false;
          return e.firstName.equals(this.firstName) && e.lastName.equals(this.lastName);
     }
}
Ignore the fact that there's no hashCode() implementation, please!!
You decide to add more functionality to Employee.java, namely, a grade instance variable and a toString() method that prints out who the employee is and what he does. Employee.java now looks like this:

class Employee {

     private String firstName;
     private String lastName;
     private String grade;

     Employee(String firstName, String lastName, String grade) {
          this.firstName = firstName;
          this.lastName = lastName;
          this.grade = grade;
     }

     public void equals(Employee e) {
          if !(e instanceof Employee)
               return false;
          return e.firstName.equals(this.firstName) && e.lastName.equals(this.lastName);
     }

     public void toString() {
          return “I am “ + this.firstName + “ “ + this.lastName + “, working as “ + this.grade;
     }
}
Ignore the fact that grade is not part of equals(), please!!
When you do a git diff on Employee.java, this is what you get:

When you do a git add at this point, all the newly introduced code will be ready for commit. Let’s say you want to add the toString() function as a separate commit. In other VCSs, that's not simple. You will have to maintain two copies of Employee.java, with one copy introducing the grade variable, and another copy introducing toString(). This is cumbersome, but in Git, is very easy. You just do
git add -p
which allows you to choose what pieces of code change to commit. For the above example, doing git add -p would give you


At this point, keying in 'y' will add this to the index, after which the next piece of code change is shown.


and so on…
When I learnt this, I thought, "All that’s fine, but what is the word ‘hunk’ doing there in “Stage this hunk?"? What does it mean anyway?”
To know what’s a hunk, you’ll have to know more about the output of the diff command. Note that we are not talking about git diff, but just diff.

Understanding the diff command

diff is the Linux command to generate a report that documents the differences between two files. According to Wikipedia, given two files, a and b, with b being an updated version of a, then diff basically reports what changes should be done on a to make it b.
The report that diff generates can be in 3 forms. They are: a) Edit script, b) Context format, or c) Unified format. With git diff, we get the Unified format.
The unified format, explained in short, goes like this:
The entire output of diff is called ‘diff’. That’s why people often say, “Send me the diff”. They are actually asking for the output of the diff command.
A diff begins with two lines that indicate the two files being compared. The first line begins with ‘---’ and indicates the original file, while the second line begins with ‘+++’ and indicates the newer file. Line additions are preceded with a  ‘+’ symbol, while line deletions are preceded with a ‘-’ symbol. Line modifications are represented as a combination of line deletion and addition.
Now, when a change occurs to a file, the change can be:  a) in only one line, b) in consecutive lines, or c) in lines spread all over the file.
Thus, the receiver of a diff would like to know which line numbers in the original unchanged file were changed. Hence, it is enough if the output of diff includes a special line that indicates the starting line position of the change, as well as the destination line position, followed by the actual changes. The destination line position is included since earlier changes in the same diff could have pushed the original line further down the file.
However, (especially in open-source projects), it is possible that two changes are applied to a file by two separate users at the same line. When integrating these two changes, it is not useful if you only have the line numbers. You also need to provide some context, by which we mean some lines before and after the changed line. This is useful when applying conflicting changes like the one above, as we can use it to determine how the second change should fit in on the first change.
The unified format handles both by providing context around the changed line, and also providing a special line that indicates where in the file, the first line of context starts, and how many lines of context are provided. To indicate that these lines are special lines that are only for the receiver’s understanding and are not part of the diff, the Unified format surrounds such special lines with ‘@@‘ symbols. Such lines are called range information lines. The format of a range information line is:
@@ -<<starting line number of context in original file,number of lines of context from original file>> +<<starting line number of context in modified file,number of lines of context from modified file>> @@

Understanding Employee.java diff

This should now help us understand the output of git diff that we did on Employee.java earlier. Let’s take a look at it again:

The first two lines that you see,
diff -- git a/Employee.java b/Employee.java
index b2ea747..cbdaf9e 100644
are generated by Git. Beyond this is the actual diff output. So let's ignore this and move onto the diff.

The first two lines in the diff,
--- a/Employee.java
+++ b/Employee.java
are the two files that diff is trying to compare. Employee.java is prefixed with ‘a/’ and ‘b/’ in the two lines because Git is comparing your copy of Employee.java with the copy in HEAD. Git tries to represent these two versions of Employee.java as being in two folders ‘a/’ and ‘b/’, just as a way of differentiating them. In reality, if you had used just diff, you would have provided two files physically present on the filesystem.

The first range information line is:
@@ -1,6 +1,7 @@
In the range information line, the “-1,6” indicates that the original file’s context provided starts from the first line of the file, and 6 lines of context are provided. The “+1,7” indicates that the new file’s context provided starts from the first line of the file, and 7 lines of context are provided. Why 7? Because of the addition of the grade variable, that is only present in the new file.
The second grade information line is:
@@ -12,5 +13,9 @@ class Employee {
In this range information line, the “-12,5” indicates that the original file’s context provided starts from the 12th line of the file, and 5 lines of context are provided. The “+13,9” indicates that the new file’s context provided starts from the 13th line of the file, and 9 lines of context are provided. Why is the starting line position in the new file 13? Because of the addition of the grade variable previously. Why 9 lines of context? Because of the addition of the toString() method in the new context.

So what’s a hunk?

Now that you’ve understood the diff output, it becomes easy to understand hunks. Hunks are simply the term for the combination of a range information line followed by the change information until the next range information line.