What's so special about the Hamming Distance?

The hamming distance is the number of bits different between two bit strings.

Let’s say you have two bit strings:

1 0 0 1 1 0 1 1

1 1 0 0 1 1 0 1

We can see the difference in these two strings, the number of changes required to change one bit string to another is the hamming distance. In our example the hamming distance is 4.

It’s cool and all but how is it useful?

Repeated XOR

Before we jump the gun, let’s talk about a simple cryptographic encryption, repeated XOR.

Repeated XOR encryption is when you take a key x and XOR your string y using x in a rotated manner.

Let’s say y = “foobargoo” and x = 0xf2, 0x9c, 0x31.

The encrypted XOR would be:

f    o    o    b    a    r    g    o    o  
^ 0xf2 0x9c 0x31 0xf2 0x9c 0x31 0xf2 0x9c 0x31

The result is 0x94 0xf3 0x5e 0x90 0xfd 0x43 0x90 0xf3 0x5e (r).

Now the question is how do we decrypt r to get back y when we don’t know what x is.

This is easily solvable when we know the length of x (KEYSIZE). We just split y into buckets of length x. We now solve for each bucket independently by trying all possible values for that bucket between 0 to 255 and picking the value which give us the most English looking answer.

This works because A = B ^ C means B = A ^ C (associative).

0x94 0xf3 0x5e 0x90 0xfd 0x43 0x90 0xf3 0x5e

Bucket 1: 0x94 0x90 0x90
Bucket 2: 0xf3 0xfd 0xf3
Bucket 3: 0x5e 0x43 0x5e

All well and good. But how do we solve it when we don’t know the length of x.

The first step seems to be discovering the length of x. Because by doing that, we can easily solve the problem with the above solution.

Now that’s where we use the hamming distance.

Hamming Distance to discover keysize in repeating XOR

So, to figure out the KEYSIZE we iterate over possible KEYSIZE (let’s say 2 to 10) take the result r and split it into buckets of size KEYSIZE and the one which has the least hamming distance between groups should be the required KEYSIZE.

For our example our correct key size was 3. Let’s see what happens when we try for KEYSIZE 2, 3 and 4, to see how hamming distance helps with correct KEYSIZE.

KEYSIZE 2

Bucket 1: 0x94 0xf3
Bucket 2: 0x5e 0x90
Bucket 3: 0xfd 0x43
Bucket 4: 0x90 0xf3
Bucket 5: 0x5e

The hamming distance between Bucket 1 and Bucket 2 is: 8
Normalized distance: 8/2 = 4

KEYSIZE 3

Bucket 1: 0x94 0xf3 0x5e
Bucket 2: 0x90 0xfd 0x43
Bucket 3: 0x90 0xf3 0x5e

The hamming distance between Bucket 1 and Bucket 2 is: 8
Normalized distance: 8/3 = 2.67

KEYSIZE 4

Bucket 1: 0x94 0xf3 0x5e 0x90
Bucket 2: 0xfd 0x43 0x90 0xf3

The hamming distance between Bucket 1 and Bucket 2 is: 16
Normalized distance: 16/4 = 4

We normalize because we want to ideally find the hamming distance per byte but since we’re taking KEYSIZE elements, we normalize.

We can always take multiple buckets and average out our results (we’ll still have to normalize against the KEYSIZE). This will give similar results.


So this clearly shows how using the hamming distance lets us discover the KEYSIZE.
The way this works is that english text in ASCII is within a close range and hence has lower average hamming distance between two alphabets than two random bytes between 0 to 255.

Let’s calculate the average of hamming distance over all a-zA-Z:

func main() {
	hdist := 0
	count := 0
	for i := 65; i <= 127; i++ {
		for j := 65; j <= 127; j++ {
			hdist = hdist + HammingDistance(byte(i), byte(j))
			count += 1
			if j == 90 {
				j = 96
			}	
		}
		if i == 90 {
			i = 96
		}
	}
	fmt.Println(float64(hdist)/float64(count))
}

Gives the output: 2.99

Whereas running an average of hamming distance over all 256 values:

func main() {
	hdist := 0
	count := 0
	for i := 0; i <= 256; i++ {
		for j := 0; j <= 256; j++ {
			hdist = hdist + HammingDistance(byte(i), byte(j))
			count += 1
		}
	}
	fmt.Println(float64(hdist)/float64(count))
}

Gives the output: 3.99


So let’s say we have a string MJHNPY XOR’d with key KK1K2. This means that our result is:

(M ^ K) (J ^ K1) (H ^ K2) (N ^ K) (P ^ K1) (Y ^ K2)

If we guess the KEYSIZE to be 2, we’d be doing

(M ^ K) ^ (H ^ K2)

That is both M and H would be in the first bucket. This would give an estimated value of 3.99 because K and K2 are random values.

whereas if we picked KEYSIZE to be 3, we’d be doing

(M ^ K) ^ (N ^ K)

(M ^ N)

Here both M and N are in the first bucket. The K gets eliminated because X ^ Y ^ Y = X.

This would give an estimated value of 2.99 because here we only have M and N and both fall under english ASCII.


Where did I learn this?

https://cryptopals.com/sets/1/challenges/6