Links

Ben Laurie blathering


rss2email, quilt and FeedBurner

(It recently occurred to me that I rarely talk about what I do best, which is write code. So, this is an experimental kind of post wherein I write in far too much detail about some piece of coding. I’d be interested to know whether people want to read this kind of stuff)

Recently, FeedBurner did something that I found irritating – they removed the author’s name from the HTML in their RSS feed and instead put it in its own field. This didn’t actually break my RSS reading setup, which uses rss2email to convert RSS to, err, email (though apparently I am not the only one affected by changes in FeedBurner’s RSS feed) but it did mean I could no longer tell who had written any particular post on, for example, BoingBoing.

Today I got annoyed enough about this to decide to do something about it. Since, of course, I am using open source tools, I can fix them. Normally the way I would proceed with this would be to compare the version I am currently running against the original source, using diff, of course, then upgrade to the latest version and apply my changes to it.

This is always a slightly painful process, so over the years I have played with a couple of ways to make it less painful. Well, usually. Sometimes you have to do a make distclean or some other variant to avoid getting generated files in the diff.

One early experiment was to use CVS vendor branches. I’ve never really got on well with this, for various reasons. Firstly, the standard advice for merging vendor changes into the main tree is to run

$ cvs checkout -jFSF:yesterday -jFSF wdiff

pretty obviously this only works if you don’t import more than once a day, though you can fix this using tags, but my main problem is that I’ve always found this command completely meaningless to me. Which is perhaps why I suffered from my other problem with this approach, which was that over time it appeared to gradually drift away from both the vendor source and my patches, in apparently random ways.

More recently, I’ve tended to just grab the tarball, unpack it, rename it (typically to <package>-ben) unpack it a second time and make my changes to the -ben version. Then when I’m done I can do

$ make clean
$ cd ..
$ diff -urN <package> <package-ben>

and presto, a patch. One snag with this scheme has always been that you then end up with one monolithic patch for everything. This causes two issues; firstly, when I want to apply the patch to a new version, its hard to see which changes go together, especially when they span multiple files, and so can get tricky to make sure you resolve conflicts correctly. Secondly, if I want to contribute the patches back upstream, which I often do, developers usually want patches separated by functionality, so they can review them more easily.

It turns out that this is hardly a new problem, and a friend of mine recently turned me on to quilt. quilt is pretty cool. It automates the production of diffs. It has the idea of a “stack” of patches, so I can divide stuff up according to functionality, and have a patch for each, which I can apply and unapply at the drop of a hat. The patches themselves just live as, well, patchfiles, so I can send them in emails and stuff without any problems. So, for my inaugural use of quilt, I decided to attempt my rss2email upgrade using it.

Unfortunately, despite my claim above to be somewhat organised about patching software, it turns out that I didn’t actually save the original version of rss2email that I started from, and I can’t find it on the web, either. I blame rss2email‘s somewhat eccentric distribution method, which doesn’t start with a tarball, but instead just hands you links to individual files. I seem to remember I had to seek some of them out first time around, too. In the end I decided to just start from scratch. I know what I want, so I just need to keep hacking until I get it.

Step one is to add the convenience script I use to run rss2email, r2e. First off, tell quilt we’re making a new patch

$ quilt new add_r2e.diff

now add the new file to the patch

$ quilt add r2e

once that’s done, I can create r2e (apparently I have to do the add before the actual creation), and get quilt to update the patch accordingly

$ quilt refresh

and if I want, take a look at it

$ quilt diff
Index: rss2email/r2e
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ rss2email/r2e 2007-10-07 11:50:44.000000000 +0100
@@ -0,0 +1,4 @@
+#!/bin/sh
+# need this line to run installed version from cron...
+#cd ~/.rss2email/
+/usr/local/bin/python rss2email.py feeds.dat $*

Next, I have a somewhat different version of config.py from the distributed one, so

$ quilt new my_config.diff
$ quilt add config.py
...edit config.py...
$ quilt refresh

an interesting thing to note here is that as I went along I wanted to make further changes to config.py even though I now had other patches stacked on top of this one. A cute feature of quilt is that you can still do that, so long as later patches don’t make conflicting changes, by making the edit, then doing

$ quilt refresh my_config.diff

If later patches do conflict, then you can either pop patches until you get back to this one, make your change, refresh, then push, resolving conflicts as you go, or create a new patch at the top of the stack that makes the change. Which I’d do would depend on whether the change fits logically in the existing patch or not. The patch isn’t very fascinating, but for completeness, here it is

$ quilt diff -P my_config.diff
Index: rss2email/config.py
===================================================================
--- rss2email.orig/config.py 2007-10-07 11:59:57.000000000 +0100
+++ rss2email/config.py 2007-10-07 13:01:33.000000000 +0100
@@ -1,5 +1,6 @@
-SMTP_SEND = 1
-SMTP_SERVER = "my.mailserver.com"
-AUTHREQUIRED = 0
-SMTP_USER="username"
-SMTP_PASS="password"
+SMTP_SEND = 0
+#SMTP_SERVER = "my.mailserver.com"
+#AUTHREQUIRED = 0
+#SMTP_USER="username"
+#SMTP_PASS="password"
+HTML_MAIL = 1

Next, I wanted to be able to make changes to the config for debugging, without having to keep different versions of the config file for “production” and debug versions. So, I decided to add a second “local config” file, called, amazingly, local_config.py.

$ quilt new local_config.diff
$ quilt add rss2email.py
$ quilt add local_config.py
... edit ...
$ quilt refresh

Slightly cheating here, I am anticipating my next change, which is to add more verbosity, so I can see what’s going on. Here’s the output from quilt when asked to show this patch a bit later in the process

$ quilt diff -P local_config.diff
Index: rss2email/local_config.py
===================================================================
--- /dev/null   1970-01-01 00:00:00.000000000 +0000
+++ rss2email/local_config.py   2007-10-07 12:08:33.000000000 +0100
@@ -0,0 +1,2 @@
+VERBOSE = 1
+VERYVERBOSE = 1
Index: rss2email/rss2email.py
===================================================================
--- rss2email.orig/rss2email.py 2007-10-07 12:07:15.000000000 +0100
+++ rss2email/rss2email.py      2007-10-07 12:32:31.691629000 +0100
@@ -206,6 +206,12 @@
 except:
        pass
 
+# Read options from local config file, if present (useful for debugging)
+try:
+       from local_config import *
+except:
+       pass
+
 ### Import Modules ###
 
 import cPickle as pickle, md5, time, os, traceback, urllib2, sys, types
Warning: more recent patches modify files in patch local_config.diff

Note the handy warning at the end.

I try to avoid ever having to rely on my memory (though I do still find I sometimes have to think hard to remember the name of a piece of software I only occasionally use, so I can find it on my disk again – any suggestions?), so the next thing I do is add a Makefile for testing

$ quilt new testing.diff
$ quilt add Makefile
... edit ...
$ quilt refresh

and the diff, by yet another means

$ cat patches/testing.diff 
Index: rss2email/Makefile
===================================================================
--- /dev/null   1970-01-01 00:00:00.000000000 +0000
+++ rss2email/Makefile  2007-10-07 12:20:07.000000000 +0100
@@ -0,0 +1,5 @@
+test:
+       rm -f feeds.dat
+       ./r2e new ben@links.org
+       ./r2e add http://www.boingboing.net/atom.xml
+       ./r2e run --no-send

(quilt maintains the patches/ directory for you). Finally I’m ready to do some real work! I want to know what would be sent in email, and what the parsed RSS looks like. I think you have got the hang of creating patches by now, so I’ll just show you the patch itself…

$ cat patches/verbosity.diff 
Index: rss2email/rss2email.py
===================================================================
--- rss2email.orig/rss2email.py 2007-10-07 12:32:31.691629000 +0100
+++ rss2email/rss2email.py      2007-10-07 12:48:45.000000000 +0100
@@ -430,11 +430,18 @@
 #@timelimit(FEED_TIMEOUT)
 def parse(url, etag, modified):
        if PROXY == '':
-               return feedparser.parse(url, etag, modified)
+               parsed = feedparser.parse(url, etag, modified)
        else:
                proxy = urllib2.ProxyHandler( {"http":PROXY} )
-               return feedparser.parse(url, etag, modified, handlers = [proxy])
-
+               parsed = feedparser.parse(url, etag, modified, handlers = [proxy])
+       if VERYVERBOSE:
+               import pprint
+               pp = pprint.PrettyPrinter(indent = 2)
+               for entry in parsed['entries']:
+                       print "++++++++++"
+                       print pp.pprint(entry)
+                       print "++++++++++"
+       return parsed
 
 ### Program Functions ###
 
@@ -707,7 +714,17 @@
                if action == "run": 
                        if args and args[0] == "--no-send":
                                def send(sender, recipient, subject, body, contenttype, extraheaders=None, smtpserver=None):
-                                       if VERBOSE: print 'Not sending:', unu(subject)
+                                       if VERYVERBOSE:
+                                               print "From: ", sender
+                                               print "To: ", recipient
+                                               print "Subject: ", subject
+                                               print "Content-type: ", contenttype
+                                               for hdr in extraheaders.keys():
+                                                       print hdr, ": ", extraheaders[hdr]
+                                               print
+                                               print unu(body)
+                                               print "-------------------"
+                                       elif VERBOSE: print 'Not sending:', unu(subject)
 
                        if args and args[-1].isdigit(): run(int(args[-1]))
                        else: run()

Now I can see what is going on!

(At this point, I get less Popper and more Feyerabend, as I am now writing this post as I work on the code, instead of after the fact)

I can’t actually remember the changes I made to the original rss2email so, as I said, I am results-oriented here. My first complaint is that the author no longer appears in the output, and if I do a make, I can see that this is still true, even using the updated version, as this sample shows

From: "Boing Boing" <bozo@dev.null.invalid>
To: ben@links.org
Subject: China's net cops apparently trying to block RSS
Content-type: html
Date : Sun, 07 Oct 2007 12:01:44 -0000
User-Agent : rss2email

China’s net cops apparently trying to block RSS

I’ve been poring through emails and web comments from Boing Boing tv viewers today, and noticed a number of messages that read more or less like this:

Hi, I’m in mainland China, and for some reason I can’t subscribe to subscribe to Boing Boing tv‘s RSS feed. — and come to think of it, I can’t subscribe to feeds for Boing Boing or Boing Boing Gadgets, either. Dude WTF?

Our RSS feeds are not broken, nor are they the only ones affected, not by a long shot. According to various reports, authorities in China are attempting to block *all* RSS feeds to keep out information that may be critical of the nation’s government. Link to item on Ars Technica.

URL: http://feeds.feedburner.com/~r/boingboing/iBag/~3/166381853/china-blocks-all-rss.html

Note that this isn’t quite exactly what was output – I removed FeedBurner’s snoopy images. More on that later. But as you can see, no mention of an author (though the output is quite a bit prettier than I’m used to). Looking at the parsed RSS feed, though, I see

{ 'author': u'Xeni Jardin',
  'content': [ { 'base': 'http://feeds.feedburner.com/boingboing/iBag',
                 'language': None,
                 'type': 'text/html',
                 .
                 .
                 .

At this point I should note that the version of rss2email I’ve been running up to now did not, as far as I can tell, in any way process this field. Also, I’ve exchanged email with BoingBoing and they say they haven’t changed anything. I conclude, therefore, that FeedBurner has, as people suspect, probably changed the format (from including the author version in the post content to only having it in the markup). However, the new version does look for author information, which it tries to include as the “From” field in the email. Here’s what it does

def getName(r, entry):
	"""Get the best name."""

	feed = r.feed
	if r.url in OVERRIDE_FROM.keys():
		return OVERRIDE_FROM[r.url]
	
	name = feed.get('title', '')

	if 'name' in entry.get('author_detail', []): # normally {} but py2.1
		if entry.author_detail.name:
			if name: name += ": "
			det=entry.author_detail.name
			try:
			    name +=  entry.author_detail.name
			except UnicodeDecodeError:
			    name +=  unicode(entry.author_detail.name, 'utf-8')

	elif 'name' in feed.get('author_detail', []):
		if feed.author_detail.name:
			if name: name += ", "
			name += feed.author_detail.name
	
	return name

which would work fine, if only there were an author_detail field!

The short answer is that this is a bug in feedparser.py (this is another really good reason for using quilt: this particular patch will have to go to someone different to get incorporate)

$ cat patches/add_author.diff 
Index: rss2email/feedparser.py
===================================================================
--- rss2email.orig/feedparser.py        2006-01-11 05:00:52.000000000 +0000
+++ rss2email/feedparser.py     2007-10-07 15:29:10.000000000 +0100
@@ -976,7 +976,10 @@
             author = context.get(key)
             if not author: return
             emailmatch = re.search(r'''(([a-zA-Z0-9\_\-\.\+]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?))''', author)
-            if not emailmatch: return
+            context.setdefault('%s_detail' % key, FeedParserDict())
+            if not emailmatch:
+                context['%s_detail' % key]['name'] = author
+                return
             email = emailmatch.group(0)
             # probably a better way to do the following, but it passes all the tests
             author = author.replace(email, '')
@@ -987,7 +990,6 @@
             if author and (author[-1] == ')'):
                 author = author[:-1]
             author = author.strip()
-            context.setdefault('%s_detail' % key, FeedParserDict())
             context['%s_detail' % key]['name'] = author
             context['%s_detail' % key]['email'] = email

and now the mail header looks like this

From: "Boing Boing: Xeni Jardin"
To: ben@links.org
Subject: China's net cops apparently trying to block RSS
Content-type: html
Date : Sun, 07 Oct 2007 14:29:17 -0000
User-Agent : rss2email

Yay, we have an author! I even like the idea of it being in the from field. At this point I could probably stop but another thing has been irritating me, and that’s FeedBurner’s web bugs at the end of each post. So, I’m going to remove them. They look like this

<a href="http://feeds.feedburner.com/~a/boingboing/iBag?a=GNLz23"><img src="http://feeds.feedburner.com/~a/boingboing/iBag?i=GNLz23" border="0" /></a></p><img src="http://feeds.feedburner.com/~r/boingboing/iBag/~4/166381853" height="1" width="1" />

Its always a bit tricky removing something like this – you want to be sure you don’t accidentally remove some other similar-looking stuff. Regular expressions are the answer, of course. They are, however, a bastard to debug since a mistake anywhere causes the whole thing to not match. My technique is to start at the left-hand end and extend the expression a piece at a time as I get it working. The one hint I have for python is that using an “r” like this, r'\w+', preserves backslashes. Anyway, here’s the patch…

$ cat patches/remove_feedburner_webbugs.diff 
Index: rss2email/rss2email.py
===================================================================
--- rss2email.orig/rss2email.py 2007-10-07 14:51:42.000000000 +0100
+++ rss2email/rss2email.py      2007-10-07 17:03:55.000000000 +0100
@@ -58,6 +58,10 @@
 # 0: Just use the DEFAULT_FROM email instead.
 USE_PUBLISHER_EMAIL = 0
 
+# 1: Remove FeedBurner web bugs (only works if HTML_MAIL = 1)
+# 0: don't
+REMOVE_FEEDBURNER_WEB_BUGS = 1
+
 # 1: Use SMTP_SERVER to send mail.
 # 0: Call /usr/sbin/sendmail to send mail.
 SMTP_SEND = 0
@@ -297,6 +301,17 @@
 
 ### Parsing Utilities ###
 
+def maybeRemoveFeedBurnerWebBugs(html):
+       """If enabled, remove FeedBurner's web bugs from the HTML supplied"""
+
+       if not REMOVE_FEEDBURNER_WEB_BUGS:
+               return
+
+       import re
+       return re.sub(r'<p><a href="http://feeds.feedburner.com/~a/[^/]+/iBag\?a=\w+"><img src="http://feeds.feedburner.com/~a/[^/]+/iBag\?i=\w+" border="0" /></a></p><img src="http://feeds.feedburner.com/~r/[^/]+/iBag/[^"]+" height="1" width="1" />',
+                     '', html)
+
+
 def getContent(entry, HTMLOK=0):
        """Select the best content from an entry, deHTMLizing if necessary.
        If raw HTML is best, an ('HTML', best) tuple is returned. """
@@ -321,7 +336,7 @@
        if conts:
                if HTMLOK:
                        for c in conts:
-                               if contains(c.type, 'html'): return ('HTML', c.value)
+                               if contains(c.type, 'html'): return ('HTML', maybeRemoveFeedBurnerWebBugs(c.value))
 
                if not HTMLOK: # Only need to convert to text if HTML isn't OK
                        for c in conts:

If I’d done this in feedparser.py then it would also work for text-mode emails, probably. I could probably be persuaded to put the patch there instead.

Anyway, now I’m done, so some retroactive tweakery of the makefile, to include an install target, and also to make sure that the patches are available to anyone reading this, yielding a new version of the makefile patch…

$ cat patches/testing.diff 
Index: rss2email/Makefile
===================================================================
--- /dev/null   1970-01-01 00:00:00.000000000 +0000
+++ rss2email/Makefile  2007-10-07 17:16:37.000000000 +0100
@@ -0,0 +1,18 @@
+test:: newfeed
+       ./r2e run --no-send
+
+testmail:: newfeed
+       ./r2e run
+
+newfeed::
+       rm -f feeds.dat
+       ./r2e new ben@links.org
+       ./r2e add http://www.boingboing.net/atom.xml
+
+install::
+       cd ~/.rss2email && tar cvfz /tmp/r2e-backup.tgz .
+       cp *.py ~/.rss2email
+
+patches::
+       cd patches && tar cvfz /tmp/r2e-patches.tgz .
+       scp /tmp/r2e-patches.tgz sump2.links.org:files

4 Comments

  1. I like the post, and I like the suggestion to look into quilt.

    So thanks.

    Comment by lars — 7 Oct 2007 @ 19:01

  2. God stuff Ben. I hadn’t heard of quilt either. Thumbs up for techie coding posts 🙂

    Comment by Pat Patterson — 7 Oct 2007 @ 20:03

  3. If you liked quilt, have a look at mercurial queues.
    http://hgbook.red-bean.com/hgbookch12.html.

    They are an extension to mercurial that allow you to maintain a patch set just like quilt does. But, once a patch is applied normal mercurial commands like ‘hg log’ and ‘hg diff’ work as if the patches were normal revisions of the repo.

    And it’s fast too. And you can version control your patches. And you get the mercurial merge support, etc.

    Comment by David Roussel — 10 Oct 2007 @ 13:48

  4. […] Caja is out in the wild, and I can’t use Google’s internal development tools, I find quilt is coming in handy (why not mercurial queues? I’d prefer it, but the version I can easily […]

    Pingback by Links » Quilt and SVN: A Slightly Unhappy Marriage — 16 Nov 2007 @ 5:49

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress