Inside the Massive 711 Million Record Onliner Spambot Dump
by Troy Hunt
August 30, 2017
Last week I was contacted by someone alerting me to the presence of a spam list. A big one. That’s a bit of a relative term though because whilst I’ve loaded “big” spam lists into Have I been pwned (HIBP) before, the largest to date has been a mere 393m records and belonged to River City Media. The one I’m writing about today is 711m records which makes it the largest single set of data I’ve ever loaded into HIBP. Just for a sense of scale, that’s almost one address for every single man, woman and child in all of Europe. This blog posts explains everything I know about it.
Firstly, the guy who contacted me is Benkow moʞuƎq and he’s done some really interesting malware and spambot analysis in the past. During our communication over the last week, I had a read of his piece on Spambot safari #2 – Online Mail System which is a good example of the sort of work he’s been doing (it’s also a good example of how dodgy some of this spammer code is!) He went on to explain how he’d located a machine used by the “Onliner Spambot” and pointed me to a path on an IP address with directory listing enabled:
I’ve obfuscated a bunch of info here because as of the time of writing, the server is still up and I don’t want to give away any information that could result in the data being spread further. The IP address is actually based in the Netherlands and Benkow and I have been in touch with a trusted source there who’s communicating with law enforcement in an attempt to get it shut down ASAP. Until that time, I’m not going to share file names in their entirety although I’ll certainly describe anything of relevance in them.
Before I dive into the data, Benkow has posted a dedicated piece on the mechanics of this spambot that’s worth a read. You can also find a great story on ZDNet from Zack Whittaker which is a good overview of the situation. The gap I want to fill here is to explain what I can about the data because there’ll be a very large number of people finding themselves on HIBP and wondering what an earth is going on. If you haven’t already read Benkow’s piece, there’s 2 important classes of data you need to understand:
- Email addresses. That’s it – just masses and masses of email addresses used to deliver spam to. In some cases, a single file may contain tens or even hundreds of millions of addresses.
- Email addresses and passwords. Benkow explains that these are used in an attempt to abuse the owners’ SMTP server in order to deliver spam. I also believe that many of these may simply be aggregations from various other breach sources I’ll talk about a little later on.
Getting on to the data itself, the first place to start is with an uncomfortable truth: my email address is in there. Twice:
That first file is the 14GB one from the earlier directly listing whilst the second is 131MB. In many cases, I found the same data in both the former larger file and a subsequent smaller one. Interestingly, as you can see from the suffix above, both refer to “UK” (I’m certainly not from the United Kingdom) whilst others refer to “AU” (although I’m not in there). There are no other 2 letter country codes represented in the file names but clearly when we’re talking many hundreds of millions of addresses here, a heap of them are from other locations so take those suffixes with a grain of salt.
One of the files with the “NewFile_” prefix contained over 43k rows associated with the Roads and Maritime Services of my neighbouring state here in Australia:
Every row contains RMSETollDontReply@rms.nsw.gov.au in quotes followed by “support@” and then predominantly .com.au domains, albeit with over 13k .ru domains. This email address is used to send notifications relating to the “E-Tag” device installed on your car windscreen so that you can pay tolls. I know this because I’ve received a bunch of them in the past:
I’ll take a stab at it and say that there’s not many legitimate drivers using the New South Wales toll road system with Russian email addresses! Clearly, the constant alias on every one of these accounts is auto-generated. Interestingly, I saw a similar pattern with the B2B USA Businesses spam list I loaded last month with many comments like this:
There’s also some pretty poorly parsed data in there which I suspect may have been scraped off the web. For example, Employeesemail@example.com appears twice:
The first file is the same one my own email address was in and the second is the same file name structure albeit with a different number in it. And if you’re wondering why I’ve publicly listed someone else’s address, it’s because it’s already publicly listed:
But of course, the data in the dump has a bunch of junk prefixed to the address, junk which appears to be an HTML file name and may indicate the “address” was scraped off the web and the parsing simply wasn’t done very well. The point here is that there’s going to be a bunch of addresses here that simply aren’t very well-formed so whilst the “711 million” headline is technically accurate, the number of real humans in the data is going to be somewhat less.