## Wednesday, March 31, 2010

### Cryptography - MD5

For the first Part of this - hopefully ongoing - series, I decided to look at the MD5 hash algorithm. It’s one of the most commonly used cryptographic algorithms out there and I would claim that nearly everyone has a password somewhere that is stored with an MD5 or similar hash.

MD5 stands for Message-Digest Algorithm 5, and is - as already mentioned and you probably already knew - a hash algorithm.

The MD5 hash algorithm is in simple terms a deterministic function (or blackbox) that will calculate a 128-Bit hash value from a given string of well-nigh any length … yeah, I had to read this sentence over a few times, and it’s just rubbish. If I wanted to write something like that, I could’ve gone Wikipedia. So let’s crack this one open.

You feed the MD5-Box a string of any length you want. This “string” doesn’t have to be alphanumeric of course, any stream of bits and bytes is just fine, like the bitstream of a file, for instance. The output string has always a length of 128 bits and is usually noted as a string of 32 octets, like this one: “B5A8AD3A9CDD6A6953FCBE6975FDE734″ (try guessing what I typed in though).

One of the most important things about hashes is, that they are so-called one-way-functions, meaning, they only encrypt stuff, and can’t - and must not - be decrypted. So hashes are often used for storing passwords in a databases. The same plaintext will always be hashed to the same cipher text with MD5, so all you have to do to check if your password and the stored (hashed) password are identical is to compute the hash of the given password and compare it with the stored one.

There are several demands a good hash-function has to meet in order not to get cracked in the first two hours of its lifetime.
The first one is, that a minor change in the plaintext (like “ghacks” and “gHacks”) should have a big impact on the computed hash (”D1B81FBDEB51C3A850E37177A5A22498″ and “DB3E20DC88EF0B6CA6A8FD5DA448D323″). If the difference would be only minor, and I know the plaintext and hash of “ghacks” (which I do, of course), and have the hash of “gHacks” without the knowledge of its plaintext, I could easily guess it.

The second very important demand is that a hash-function produces a much smaller memory imprint than the original stream. If you hash an 11MB installer to verify its integrity and have to download another 10MB of hash file as well, it’s pretty useless. There are lots of other points to keep an eye on, but these will (and have to) suffice.

As I mentioned already, hash-functions such as MD5 are most commonly used to store passwords without actually storing them in plaintext, and to verify the integrity of files. When you put a file online, just compute the hash and publish it together (but separate) with the file. Ever user would be able to determine if the downloaded file has been tampered with by simply comparing the hash of the downloaded file with the one published on the website.

Now I’d like to say something about security and known (and partly successful) attacks against hashes and MD5 in particular.
Due to the reduction (a 2MB file gets reduced to a 32-octet hash), information gets lost. This gets perfectly clear, if you take a look at the numbers. There are only 2^128 possible hash values, but infinite possible plaintexts. So in a best-case-scenario, after hashing plaintext numbers (2^128)+1 you have at least two plaintexts getting mapped on one and the same hash value.

So the first attack tries to make use of this very fact. When the same hash value is calculated from two different plaintexts, it is called a collision. Depending on the scenario of the attack using collisions, the birthday paradox comes in handy as well, increasing the attackers chance of success.

That would mean that you do not attempt to break the encryption or guess the user’s password when trying to crack a password, but just try to create another password that leads to the very same hash value, granting you access to the account. Of course, knowledge of the hashed password is required, but without that information, most attacks on modern ciphers are more than just tricky.

Edit: please take a look at comments for more clarification on the types of attacks mentioned above.

The second attack is based on a brute-force attack, which is basically “try all possible keys/passwords”. Depending on the numbers this could take some time. Let’s say you’ve already acquired the target hash value and your machine is able to try 100 keys per ms. That would make 100.000 keys per second, and 6.000.000 keys per minute. 2^128 hash values. That’s 3.4E38. We’re talking “age of the universe in seconds”-numbers here.

But there’s more to it than meets the eye. There are several options to reduce the available possibilities. Can you reduce the amount of possible plaintexts maybe? Maybe the password only allows to be 8 alphanumeric letters long? Can you have a look at the used algorithm and find something that may help you further? Do you know part of the plaintext? Maybe a name of son/wife/pet? Then you could combine it with a dictionary-attack. Every bit of information helps reducing the number of possibilities further, which in the end leads to a situation like this:

The following is a description of an attack to crack the user passwords of windows accounts (up to XP), and implemented in a near-perfect way by ophcrack. If interested, do make sure to check this tutorial, it’s quite fascinating and yet unbelievably scary.

Windows saves hash values of the user passwords, but if a password is longer than 7 signs, it gets broken up into chunks of length <= 7. Then the chunks get converted to uppercase only. Microsoft used DES for creating the hashes, but there’s no difference regarding this kind of attack.

So the attacker knows pretty much about the plaintext and can reduce its possibilities by a great deal. Now a computer starts calculating all possible hash values for this particular range of plaintexts (up to 7 digits, uppercase, numbers and some special characters only) and stores them in a database. Once finished, the database is from about 0.7 to 4 GB in size and can be easily transported using a thumb drive or a DVD.

Now all the attacker needs is a few minutes alone with the target computer and it’s done. Again, check the tutorial mentioned above, it kinda blew my mind. 1.7 minutes was the average time in this experiment for cracking a password to your windows account. Ouch.

Since I read and heard all of the above some time ago, I started wondering about the benefits and risks of using MD5. Most security experts discourage the use of MD5 nowadays for its known vulnerability to collision attacks. It should be replaced by something like the SHA-1 or since it is kind of outdated as well the even newer SHA-512. But that doesn’t help against the attack last mentioned, apart from increasing the possible hash values to even greater dimensions.

After some time, I found this very helpful article about spicing up your hashs to be more secure. I have to say though, these tips are NOT increasing the security of your hash function in a mathematical way. Luckily, the real world’s not all about math, so I think they are an easy way to get some extra security.

Edit: Please keep in mind that the tutorial posted here is not a perfect implementation of salts. It’s - as always - a source for ideas, not a perfect solution. But I always like it more if it’s explained like that, easy and understandable and in a rather digestible way. Please correct me if I’m mistaken.

If you want to screw around with MD5 a bit, here’s a link to an applet where you can do just that (SHA-1 as well). Switch to MD5, enter some text and press “Text digest”. Try guessing my hash from above (reaaaal easy), if you like and post the answer in the comments. First to score gets a cookie.