Ben Laurie blathering

Can HMRC and NAO Protect Their Own Data?

(Update: an ORG colleague correctly points out I should say “redacted” not “elided” – thanks, Louise!)

In the wake of the HMRC disaster (nicely summarised by Kim Cameron), the National Audit Office has published scans of correspondence relating to the lost data.

First of all, it’s notable that everyone concerned seems to be far more concerned about cost than about privacy. But an interesting question arises in relation to the redactions made to protect the “innocent”. Once more, NAO and HMRC have shown their lack of competence in these matters…

A few years ago it was a popular pastime to recover redacted data from such documents, using a variety of techniques, from the hilarious cut’n’paste attacks (where the redacted data had not been removed, merely covered over with black graphics) to the much more interesting typography related attacks. The way these work is by working backwards from the way that computers typeset. For each font, there are lookup tables that show exactly how wide each character is, and also modifications for particular pairs of characters (for example, “fe” often has less of a gap between the characters than would be indicated by the widths of the two letters alone). This means that if you can accurately measure the width of some text it is possible to deduce which characters must have made up the text (and often what order those characters must appear in). Obviously this isn’t guaranteed to give a single result, but often gives a very small number of possibilities, which can then be further reduced by other evidence (such as grammar or spelling).

It seems HMRC and NAO are entirely ignorant of these attacks, since they have left themselves wide open to them. For example, on page 5 of the PDF, take the first line “From: redacted (Benefits and Credits)”. We can easily measure the gap between “:” and “(“, which must span a space, one or more words (presumably names) and another space. From this measurement we can probably make a good shortlist of possible names.

Even more promising is line 3, “cc: redacted@…”. In this case the space between the : and the @ must be filled by characters that make a legal email address and contain no spaces. Another target is the second line of the letter itself “redacted has passed this over to me for my views”. Here we can measure the gap between the left hand margin and the first character of “has” – and fit into that space a capital letter and some other letters, no spaces. Should be pretty easy to recover that name.

And so on.

This clearly demonstrates that those who are entrusted with our data have absolutely no idea of the threats it faces, nor the countermeasures one should take to avoid those threats.


  1. […] Google’s Ben Laurie has a revealing link to correspondence published by the National Auditing Office relating to HMRC’s recent identity disaster.  […]

    Pingback by IdentityBlog - Digital Identity, Privacy, and the Internet's Missing Identity Layer — 24 Nov 2007 @ 2:25

  2. UK govt quoted in the Danish press as saying that they didn’t know where these lost cds are but no data was in the hands of criminals.

    If accurate – and it sounds about right – then that’s a pretty interesting thing to say.

    There is a mention of passwords which implies encryption but passwords that can be passed on in an email? Sounds like xyz123 to me but I guess it could be secure.

    Comment by robin — 24 Nov 2007 @ 7:52

  3. […] to add: In a more-serious-but-still-highly-amusing note, Ben Laurie points out that the redactions in the PDF aren’t actually good enough to conceal the names of the […]

    Pingback by i blog, you blog, they blog, weblog » Typeface of doom — 24 Nov 2007 @ 19:33

  4. Is it really so easy? Please show us the results of doing it, and tell us how long it took.

    Comment by Martyn Thomas — 25 Nov 2007 @ 19:22

  5. 2 clever by 1/2 🙂

    Comment by Peter Laurie — 27 Nov 2007 @ 8:51

  6. […] government waited nearly a month before revealing that they had lost personal data on 25 million UK citizens. Presumably they could have waited forever if they’d thought they’d get away with […]

    Pingback by Links » Notification on Personal Data Breaches — 14 Dec 2007 @ 14:17

  7. Thank you for opening my eyes to that particular technique. I guess it could work for small gaps but the number of possible combinations of characters to fill larger gaps would make it pointless, even allowing for proper grammar and spelling. Since names are often redacted, and names can have all sorts of weird spellings, I guess that makes it harder again.

    Your method assumes that the redaction is applied automatically. It won’t work if someone simply uses a marker pen or scissors manually on the original and scans it. Too much variation.

    Furthermore, you mentioned differences in character spacing due to the width of the character and variable gaps (called Kerning). Another variable is the number of characters on a line: if the paragraph is fully justified, the computer adjusts the gaps to ‘even out’ the text to fill the line. What this means is that your best sample of known letters to measure the width consists of the individual lines containing each redacted section, rather less than the entire unredacted letter sample. But hopefully the word processor applies its spacing rules systematically – and you may be lucky enough to find the text left or right justified (including the last line of a paragraph).

    Yet more variables would be the use of bold, italics and font size changes in the redacted sections.

    Even the vertical spacing between lines could give clues: some word processors adjust the line heights to leave space for accented characters, super/subscripts and perhaps other situations such as underlines, capitals, over/under-hanging etc.

    The cryptanalytic technique of ‘cribs’ (known plaintext) might help: if you can figure out or guess a redacted word such as a name, that word becomes a likely candidate for other redacted sections, reducing the number of possibilities to check. This is a common failing of all redactions: all instances of a word such as a sensitive name are likely to be redacted throughout the document.

    As to the time needed to do all this, it is surely an ideal candidate for computerization, in which case it could be done in a flash once the program is written and debugged (at least to the level of potential redaction candidates for human selection, then further rounds of analysis …). The specification and programming is no trivial matter though.

    Finally, this is all moot if people continue simply sending CDs/DVDs of unredacted unencrypted data in the post, leaving them in their cars/offices/homes to be stolen, leaving them in taxis & buses, falling prey to social engineering attackers who request the data under false pretences or hackers who sniff the systems and networks etc. etc.

    Thanks for getting me thinking this morning.


    Comment by Gary — 5 Apr 2008 @ 23:10

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress