Ben Laurie blathering

Compression Violates Semantic Security

There’s been quite a lot of noise about the still not-fully-disclosed CRIME attack on TLS recently. But, fully disclosed or not, I think we can say with certainty that it turns out that compression is a problem.

The interesting thing, to me at least, is that, in retrospect, this is completely obvious. In cryptography, we have standards that we hold encryption algorithms to, and one of these is semantic security. In short, this means that an attacker should learn nothing (other than length[1]) about a plaintext, given its ciphertext. One way this is often phrased is as a game: given two plaintexts of equal lengths, and one ciphertext made from one of the two plaintexts, then an attacker, who knows everything about the algorithm other than the key, should not be able to guess better than chance which of the two plaintexts was used.

It is obvious that, in general, if compression is used, this game can only go the attacker’s way: the length of the ciphertext must reveal something about the content of the plaintext. This is because, in general, not all texts can compress – indeed, if some plaintexts come out shorter, there must also be some that come out longer. So, since the attacker knows what compression algorithm is in use, he can tell which of the two plaintexts was used by the length of the ciphertext, in general (note that there may be pairs of plaintexts for which this is not true, but in general, there are pairs where the lengths are different). And thus he wins the game, which shows that compression simply cannot be used in a system giving semantic security[2].

And we expect TLS to give semantic security, just like all modern crypto. So, it should’ve been obvious from the start that compression was a non-runner. Why did we not realise? I think the answer to that question would be very interesting indeed. Also, what else do we do now that obviously violates semantic security?

[1] Sometimes even academics admit to real world constraints!

[2] Pedants might argue that actually, yes, you can use compression: just pad everything to the longest a plaintext could compress to. As I’ve noted above, if the compression works at all (that is, some texts are reduced in length), then some texts must actually expand. Which means that you must pad to longer than the original length. So, yeah, pedant, you can use compression, but only if it actually expands!


  1. The rationale I recalled was that compress-before-encrypt gave an attacker less ciphertext to work with. Perhaps it was an artifact of an era where symmetric ciphers were slow, weaker and less generally well understood?

    Comment by djm — 20 Sep 2012 @ 17:03

  2. I think even a longer ciphertext could end up being faster if the plaintext was compressed. So imagine that the plaintext message is 1K, the cipher(plaintext) is 1K, compress(plaintext) is .2K and cipher(pad(compress(plaintext))) is 1.5K. The browser would only need to download the first .2K before it started displaying the page (since the rest is just padding). It would still continue downloading the padding, so that attackers wouldn’t know what the compressed length was, but the user would experience the page loading as fast as if it was compressed.

    The big flaw would be if the user took a next action, the attacker might be able to back-calculate how much of the page they had received how quickly, but humans introduce enough randomness that this would be a far more difficult attack.

    Comment by Aaron Swartz — 20 Sep 2012 @ 17:37

  3. Well, it was realized that compression violates semantic security: RFC 3749 (TLS compression) “security considerations” section says pretty much so:

    “However, combining compression with encryption can sometimes reveal information that would not have been revealed without compression: data that is the same length before compression might be a different length after compression, so adversaries that observe the length of the compressed data might be able to derive information about the corresponding uncompressed data.”

    But that’s of course far from realizing that there’s a practical real-world attack.

    I wonder if there are practical attacks against compression in other protocols? There’s compression in SSH transport layer and IPsec ESP, too…

    Comment by Pasi — 20 Sep 2012 @ 19:21

  4. I think the old logic is, compression increases each plaintext’s entropy-per-bit. If you assume both plaintexts in the game have the same length after compression, then this cannot possibly make the attacker’s job easier. However, this should only matter if your cipher is weak anyway; modern ciphers are designed to offer the adversary no advantage even if plaintext A is all-bits-zero and plaintext B is all-bits-one.

    On some level I feel that this is just another special case of “plaintext length is in fact sensitive information” — consider the attacks on VoIP with variable-bit-rate codecs, which can (not always) manage to recover the whole conversation just from packet length and timing! And compression is too valuable, engineering-wise, to give up on. We need to figure out what we can do about obscuring plaintext length in general, and at what cost.

    Comment by Zack — 20 Sep 2012 @ 19:26

  5. Why is semantic security defined only in terms of same-length pairs of plaintexts? The length equality is a piece of knowledge which we are apparently assuming is available to the attacker. A real message stream would have messages of various lengths; after compression, it will also have a distribution of messages of various lengths. It is not transparent which pairs of messages had equal plaintext lengths.

    Comment by Joseph Boyle — 4 Oct 2012 @ 10:52

  6. Note that you can compress and partially pad. You do not need to pad to to the longest message length, you just need to pad enough to be “usually adequate”. You can deal with the overly long cases by delaying some of the information.

    The thing you want to achieve here is a uniform presentation of encrypted information. The amount of padding needed winds up being a quality issue (and a content issue, since you can learn about some of the kinds of content do not compress well and this can allow you to avoid some of them). Note also that silence can compress well.

    Anyways, if your padding is enough to make your packets longer than your average compressed packet, that’s probably a good starting point.

    Comment by rdm — 4 Oct 2012 @ 22:39

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress