Comparing Binary Files

Linux is rich in ways to compare and analyze text files. The diff command will compare two files for you, and highlight the differences. It can even provide a few lines on either side of the changes to provide some context around the changed lines. And the colordiff command adds color to make visually parsing the differences even easier.

Developers and authors use diff to highlight the differences between different versions of program source code files, or draft texts. It’s fast and easy, and you don’t need any technical skills to see the differences between strings of text.

In the world of binary files, things aren’t so simple. Binary files are not composed of plain text. They’re made up of many bytes containing numeric values. If it’s a compressed file such as a TAR archive or a ZIP file, those values represent the compressed files that are stored inside the archive file, along with the tables of symbols that are required for the decompression and extraction of the files.

If the binary file is an executable file, the numeric values of the file’s bytes are interpreted as such things as machine-code instructions for the CPU, metadata, labels, or encoded data. Changes to a binary file or a library file are likely to lead to differences in behavior when the binary executes or is used by another application.

It’s easy to spoof the creation or modification date and time of a file. That means there could be two versions of a file that have the same name, file size—if the changes replace existing content byte for byte—and date stamps. And yet, one of the files may have been altered.

Secure Hash Algorithms

A secure hash algorithm is a math-based algorithm. It creates a 64-bit value by scanning all the bytes in a file and applying a mathematical transform to them to generate the hash value. On any day, the same file will always produce the same hash. Even a one-byte difference will result in a radically different hash.

You’ll often see the hash of a file displayed on its download page. You should generate a hash for the file once you’ve downloaded it. If it is different from the hash displayed on the webpage, you know that the file is compromised. It has either been tampered with and substituted for the genuine file—to make people download the tainted file—or it has been corrupted in transit.

On our test computer, we have two copies of the same file, a shared library. The files have been renamed so that they can be in the same directory. In theory, these files should be the same. After all, they’re supposed to be the same version of the shared library.

The files have the same size, the same date stamps, and the same time stamps. To the casual observer, they will appear to be the same. Let’s use the sha256sum command and generate a hash for each file.

The hashes are completely different, clearly indicating that there are differences between the two files. If the website shows the hash of the genuine file, you can discard the file that doesn’t match.

Finding the Differences

If you want to look at the changes, there are ways to do that too. You don’t need to be able to decompile the file, nor to understand assembly or machine code just to see the modifications. Understanding what those changes mean, and what their purpose is, of course, would require deeper technical knowledge. But simply knowing how substantial the changes are can be indicative of what’s happened to the file.

If we use diff on the two binary files, we’ll get a response that is a little underwhelming.

We already knew the files were different. Let’s try cmp .

This tells us a tiny bit more. The first byte that differs between the two files is byte number 13451. That is, counted from the start of the binary file, byte 13451 is different in the two binary files. So 13451 is the offset of the first difference, from the start of the file.

Just by chance, throughout the file, there will be bytes that contain the hexadecimal value of 0x10. This is the value that Linux uses in text files as the end-of-line character. The cmp command encountered 131 bytes with this value between the start of the binary file and the location of the first difference. So it thinks it is on line 132. It really doesn’t mean anything in this context.

If we add the -l (verbose) option we’ll start to get useful information.

All of the differing bytes are listed. The byte number or offset, the value from the first file, and the value from the second file are shown, with one byte per line of output.

The byte values are shown in octal, instead of the usual hexadecimal format used with binary files. Nonetheless, we’ve learned something else. All the changed bytes are in one continual sequence. Their offsets are incremented by one for each byte.

The hexdump tool will dump a binary file to the terminal window. If we use the -C (canonical) option the output will list on each line the offset, the values of 16 bytes at that offset, and—if there is one—the ASCII representation of the byte values.

We can use the output from hexdump as input to diff, letting diff work as though it were reading two text files.

diff finds the lines that are different and shows the hexadecimal byte values from the first file above the values from the second file. The offset of the first line is 0x3480, or 13440 in decimal. Earlier, cmp told us the first change occurred at byte 13451, which is 0x348B. That actually matches what we see here.

The output from diff is in two-byte blocks. The first pair of bytes are bytes 0 and 1 from the offset of 0x3480, the second block holds bytes 2 and 3 from the offset. Block 6 will hold bytes 0xA and 0xB, or 10 and 11 in decimal. Those are bytes 13450 and 13451. And we can see that they are the first bytes that differ. The first five pairs of bytes are the same in both files.

However, because diff is counting from base zero, what cmp calls 13451 will be byte 13540 to diff. And to make matters even more confusing, the byte order in each two-byte block is reversed by diff. The bytes are actually listed in this order: 1 and 0, 3 and 2, 5 and 4, 7 and 6, and so on.

The command is also computationally expensive—two hexdumps and a diff all at once—especially if the files being compared are large.

But if hexdump -C can send an ASCII version of the binary file to the terminal window, why don’t we redirect the output to text files, and then compare those two text files with diff?

The difference between the two files is displayed in two short extracts. There’s an ASCII representation alongside them. There will be a pair of extracts for each difference between the files. In this example, there’s only one difference.

That’s all very fine, but wouldn’t it be great if there was something that did all that for you?

VBinDiff

The VBinDiff program can be installed from the usual repositories for all of the major distributions. To install it on Ubuntu, use this command:

On Fedora, you need to type:

Manjaro users need to use pacman.

To use the program, pass the name of the two binary files on the command line.

The terminal-based application opens, showing both files in a scrolling view.

You can use the mouse scroll wheel or the “UpArrow”, “DownArrow”, “Home”, “End”, “PageUp”, and “PageDown” keys to move through the files. Both files will scroll.

Hit the “Enter” key to jump to the first difference. The difference is highlighted in both files.

If there were more differences, hitting “Enter” would display the next difference. Pressing “q” or “Esc” will exit the program.

What’s the Difference?

If you’re working on a computer that belongs to someone else and you’re not allowed to install any packages, you can use cmp, diff, and hexdump. If you need to capture the output for further processing, these are the tools to use, too.

But if you are permitted to install packages, VBinDiff makes your workflow easier and faster. And in fact, using VBinDiff with a single binary file is an easy and convenient way to browse through binary files, which is a nice bonus.

RELATED: How to Peek Inside Binary Files From the Linux Command Line