Attacking Hash Functions in Theory
Learn about the birthday paradox and how it relates to the probability that hash value collisions will occur.
We'll cover the following
We noted that, from the attacker’s perspective, finding collisions is normally the ‘easiest’ attack to conduct. It’s protection against collisions that primarily determines the output lengths of practical hash functions. That’s why we’ll focus on finding collisions during our discussion of hash function attacks.
We’ll focus on the following question: how many bits long should the hash of a message be for it to be regarded as secure? Since we’re really worried about collision resistance, this question could be phrased as follows: how long does a hash have to be before finding collisions becomes hard?
Throughout this discussion, we’ll keep in mind that, as we’ll also see, a popular application of hash functions is to apply them before creating a digital signature on a message. In other words, the digital signature is on the hash of the message and not the message itself. The consequences of finding a collision are serious for this type of application. If two messages can be found with the same hash, a user could sign one message and then claim that the signature was on the other.
The danger of a very small hash
A very small hash is a bad idea. For example, suppose that before digitally signing the message ‘Bruce owes Sheila $10,’ we hash it using a 2bit hash function. There are only four possible values for the resulting hash of this message—00, 01, 10, or 11.
Sheila now receives this digitally signed message, and being the manipulative type, she decides to change the message to ‘Bruce owes Sheila $100.’ Of course, Sheila does not have the correct signature key, so she cannot digitally sign this message. However, because there are only four possible hashes, there is a 25% chance that the hash of the modified message is the same as the hash of the original message. If it is, Sheila doesn’t have to do anything further to conduct her attack.
Since:
$h(Bruce \space owes \space Sheila \space \$ 10) = h(Bruce \space owes \space Sheila \space \$ 100)$
the signature on the message ‘Bruce owes Sheila $100’ is the same as the message ‘Bruce owes Sheila $10.’ So now Sheila claims she received the second message, and Bruce is in financial trouble. Consequently, the output length of a hash function must be long enough that ‘getting lucky’ with the hash is not a reasonable attack strategy.
The danger of small hash
If we have a slightly longer hash, say a 10bit hash, we might reasonably regard ‘getting lucky’ as unlikely. In this case, there are $2^{10}$ possible different hashes, which is about 1,000. So there is only a 0.1% chance of getting lucky by just guessing the hash value.
Suppose now, however, that a clerk approves payments by digitally signing them on behalf of an organization that has approved the use of a 10bit hash function. A certain Fred C. Piper intends to approach the clerk with a request for a payment of $200 to himself that he is confident will be authorized. Knowing that a 10bit hash function is used, he can conduct the following attack:

First, he generates 1,000 variants of a request for a payment of $200 by adjusting the format of the request. For example:
a. Pay Fred Piper $200
b. Pay F. Piper $200
c. Pay F.C. Piper two hundred dollars
d. Pay F.C. Piper two hundred dollars only
e. Pay two hundred dollars to Mr. Fred Piper
f. Etc.

He now generates 1,000 variants of a request for payment of $8,000 in a similar way:
a. Pay Fred Piper $8,000
b. Pay F. Piper $8,000
c. Pay F.C. Piper eight thousand dollars
d. Pay F.C. Piper eight thousand dollars only
e. Pay eight thousand dollars to Mr. Fred Piper
f. Etc.

He calculates the hash of all 2,000 of these different messages. Since there are only 1,000 different possible values of the hash, there’s a very good chance (although it’s not 100%) that he will discover that the hash for one of the $200 messages is the same as the hash for one of the $8,000 messages. We will assume he’s successful (and he almost certainly will be), and he discovers the following:
$h(Pay \space Fred \space Piper \space the \space of \space \$200) = h(We \space agree \space to \space pay \space F. \space C. \space Piper \space \$8000)$

He presents the clerk to request ‘Pay Fred Piper the sum of $200.’ The clerk agrees to authorize the payment by digitally signing this request (which involves hashing the request message and then signing it).

Fred now sends his bank the digital signature on the hash of ‘Pay Fred Piper the sum of $200,’ but claims it’s on the message ‘We agree to pay F. C. Piper $8,000.’ Fred has a valid digital signature on this message.
There are two reasons why this attack was successful:

In this example, there are many potentially meaningful collisions. The nature of the example is more illustrative than realistic. A good way to reduce the number of meaningful collisions would be to insist, as many organizations do, that payment orders are formatted in a particular way and not left as open as this one was.

The length of the hash is still too short. Even though ‘getting lucky’ at guessing the hash is now a little more difficult than our first example, it’s still easy to find collisions by conducting a relatively small amount of work (in this case, 2,000 hash function computations).
So the lesson is that a hash must be long enough that finding collisions is ‘infeasible.’ We now consider how many bits long a hash must be for this to be true.
Get handson with 1200+ tech skills courses.