MD5 in Password-Based Encryption

First, a word of warning: I am no cryptography expert. But I’ve read a lot about cryptography, written a few implementations of cryptographic algorithms, and am not generally confused by cryptographic terminology.

People like myself seem to be an oddly rare breed. Most programmers I meet seem to think of cryptography as magic dust one sprinkles on a problem, and then — poof — security happens. Worse than those are programmers that read up on cryptography without understanding basic concepts.

I had a run-in with one such person the other day, who discussed concerns about using MD5 over SHA1 in an implementation of password-based encryption. The person was obviously aware that MD5 is vulnerable to collisions1. I raised the point that the implementation had weaknesses that are much, much easier to exploit than any weakness in a cryptographic hash function, which makes worrying about the hash function used rather pointless2.

In this post, I’d like to explain how exactly cryptographic hash functions are used in password-based encryption, and why the strength of the hash function does not matter all that much.

Hash Functions

First, let’s consider what a hash function is. According to Wikipedia:

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer.
http://en.wikipedia.org/wiki/Hash_function

Hash functions have other properties that make them useful for various different applications, but let’s ignore them for now, and examine a hash function so simple it probably shouldn’t be called a hash function: the modulo operation.

Here’s a tiny sample program (in Python) that displays results of a modulo operation on various large-ish integer values:

for i in range(2000, 3000):
  print i, "-", i % 8

Yes, it’s terribly unexiting. Let me explain the choice of the particular numbers in this little program:

  1. I’m taking each number modulo 8. The output of the program clearly shows that results range from 0 to 7. Those values can be held in 3 bits.
  2. I’m then subjecting numbers between 2000 and 3000 to this operation. Those numbers cannot be held by a single byte (8 bit) any longer, so are clearly larger than the 3 bit result value.
  3. 2000 can be represented in 11 bits, but 3000 can only be represented in 12 bits. The input size is thus variable.

And that simply satisfies the definition of a hash function above. It also very easily illustrates a point about hash functions: because the input size is larger than the output size, different inputs will invariably produce the same output. That’s what’s called a collision.

Cryptographic Hash Functions

In order to avoid collisions, cryptographic hash functions are a lot more complex than the modulo operation. In order to be useful in a cryptographic context (and again citing Wikipedia), a hash function must have four ideal properties:

  • it is easy to compute the hash value for any given message,
  • it is infeasible to find a message that has a given hash,
  • it is infeasible to modify a message without changing its hash,
  • it is infeasible to find two different messages with the same hash.

The modulo operation satisfies the first requirement. The second requirement is satisfied insofar as the size of the input could be anything, but it’s terribly easy to narrow down the range of inputs: if the divisor in the modulo operation is N, then every Nth value is a candidate. If you know anything about the probably size of the input, finding the corresponding message to a hash isn’t all that hard. The third and fourth properties are very clearly not satisfied by the modulo operation. Add any multiple of N to the input, and you modified the message without changing it’s hash, thereby having found two different messages with the same hash.

All of the above weaknesses of the module operation as a cryptographic hash function are due to the linearity of the result values. That’s not the only problem with the modulo operation as a cryptographic hash function, of course, but the first and most evident one.

So in order for a hash function to work as a cryptographic hash function, it’s output must appear as random as possible — that is, minor changes to the input will produce vastly different output. That property will become important later on, just remember it for now.

Password-Based Encryption

So what about password-based encryption? How do hash functions factor into that?

First of all, in password-based encryption you’re using a symmetric cipher, meaning that the key for encryption and decryption are the same. Then, there are several different cipher modes, in which output from one encrypted block may or may not serve as input to another encrypted block. That does not really matter very much beyond the fact that in most cases, your password will only be used to encrypt the first block of plaintext.

Be that as it may, it’s reasonable to consider an attempt to break into password-based encryption as an attempt to decrypt a fixed-length block of data with a key we’re trying to find.

  1. Incidentally, so is SHA1. []
  2. To be precise, the implementation makes phishing so simple that you might as well scrap it if you’re trying to protect Joe User. If you’re trying to protect an average computer geek, it’ll probably be good enough. []