Subscribe to DSC Newsletter

This is an excerpt from my blogpost Working With Large Data Sets...


For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here.


I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spamhaus had previously been throttled and disconnected, and then measure the duration until it appeared on Spamhaus. I set a job to pre-process a selected set of customers data and arbitrarily decided 66 days would be a good amount to process, as this was 2 months plus a little breathing room. I knew from my experience it was possible that it might take 2 months for a bad IP to be picked up by Spamhaus.


I extracted 28,204,693 distinct IPs, some of which were seen over million times in this data set.

Click here to read more...

Views: 333

Tags: data, fire, firehose, hadoop, hdfs, hose, large, mailchannels, processing, real-time, More…search, sets, spamhaus, statistics

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Phil Whelan on September 28, 2010 at 3:02pm
Hi Tom. I have not worked with Twitter data for a long time, so it's probably very different now. You should check out the Twitter API as that's how you'll retrieve the data, unless you're willing to spend the millions of dollars a year it costs to get the Twitter firehose (every tweet). See http://apiwiki.twitter.com

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service