For a while, its been apparent to me that Jabber was occasionally dropping messages. Last week I finally got annoyed enough to investigate it in earnest.
Unfortunately, I started off on entirely the wrong track, and blamed GTalk (sorry, guys!) – but much investigation later, with help from some very patient friends (you know who you are: thanks!), I found that it was my own Jabber server that was to blame.
However, it was not an easy journey. First of all, how do you tell messages are being dropped? I am pretty certain my server has been dropping messages since before Christmas – i.e. at least four weeks, and I am fairly certain it has been doing it ever since I first built it – which must be a year or two now. Could it be that it could drop messages for that long and no-one noticed? It seems to me, in retrospect, that it could! A wise friend of mine once said, “you know, 90% of what we say to each other could be completely different and it would make no difference”. This is even more true for IM. We send messages out. Sometimes we get answers. Sometimes we don’t. If we don’t, well, the other guy was away, or not interested, or got busy and forgot to respond. It’s fine. It was probably one of the 90%. When it’s one of the 10%, well, then we say it again. And this time we get an answer, and we’re both happy. So, it you can go on for years and not notice that stuff is missing.
It wasn’t until I started badgering my friends to tell me when they thought messages were going missing that it became clear that they were, indeed. And not just a few – a lot! I now know that it was dropping about 50% of incoming messages (i.e. messages sent to me) and no outgoing messages. God knows what kind of rude bastard my friends think I am by now! An interesting feature is that it would drop them in batches – i.e. drop for 5 minutes, forward for 5 minutes, drop for 5 minutes and so on. If it had been every second message it would have been apparent sooner, I suspect, because the conversation would be quite choppy.
But even knowing that messages were being dropped was not the end of the story. How do you figure out what is to blame? In the typical scenario, because I run my own server, there are at least 3 connections and 4 pieces of software that could be at fault.
- The other guy’s client,
- the connection from that to their server,
- their server,
- the connection from their server to my server,
- my server,
- the connection from my server to my client,
- my client
As I said above, I started at the wrong end – with GTalk. With some help, it became apparent that GTalk was unlikely to be to blame (and because it was upstream from the other guy’s client, we could eliminate that, too). So the next easiest target to look at was my server – which I did, with the help of tcpdump and Wireshark, though investigation was complicated by both OTR and SSL, which make it very hard to interpret and track messages. Luckily the server-to-server connection was in plain text (which is one reason I use OTR), so it could be done, with difficulty – particularly since it turned out that my jabber daemon was the culprit – so I could see messages coming in in the traces, and no corresponding activity in the server-to-client connection. Sometimes.
To cut a long story short, after much poking at my existing jabber server, which was jabberd14, I decided to replace it with jabberd2. But before I did that, I wanted to be really sure that jabberd14 was to blame, and that jabberd2 would fix it. So, I wanted the Jabber equivalent of ping. To my amazement, there appears to be no such thing! There is a Jabber ping extension but I can’t find anything that uses it. Which is the final reason I am writing this blog post – I wrote a pair of scripts that will do a Jabber ping test, Ping.pl and Pong.pl, feel free to use them. And if you are using jabberd14, I’d really like to know if you, too, get message drops…
I was planning to make them count and produce statistics and such, but I got lazy. Since you can see both ends, eyeballing them is enough to let you know what’s going on – Ping does count how many it got back, though, so you can leave it running without watching it all the time. To run them, you need two Jabber accounts, one on the suspect server and the other elsewhere. You can run them like this:
./Ping.pl suspect-server.com account1 password1 firstname.lastname@example.org
./Pong.pl otherserver.org account2 password2
Pong will actually answer multiple Pings running simultaneously. Ping pings every 10 seconds. Output should be reasonably obvious. Because Jabber does store-and-forward, Ping will ignore Pongs from a different session. And because they use different resources, you can use the same account at both ends, if you want. Like I say, I’d be really interested to hear from anyone that experiences drops – a couple of hundred pings was always enough to show them when I was testing.
Oh yeah, and the good news: jabberd2 has now answered over 500 pings without a single drop. So, if you felt ignored, I hope things will improve!