Before becoming a co-founder of Distil Networks (Now Imperva Bot Management), my background was in writing bots that scraped web pages. Every day I was deploying new bots that logged into websites, scraped their data and dumped it all in my local database. None of this was actually done for malicious reasons, but I was still launching 10,000+ requests an hour at a server that probably didn’t get that many requests a day.
Eventually, they blocked my IP. The hours of work I spent writing the perfect scraper went down the drain.
Until I took 10 seconds and changed my IP. After that I was back to scraping.
The problem is that (nowadays in particular) an IP address is cheap, easy to come by, and most importantly, on-demand. If every IP available at my house got blocked, I’d use my work IP. If all available work IPs got blocked: off to the cloud! I could spin up a new Amazon EC2 instance or Windows Azure instance. Exhausted all those options? Time for other cloud providers (Google, Rackspace, HP, etc) or the Tor network.
This is why we talk a lot with our customers about the cat and mouse game of bot detection and mitigation. For every IP you block, there are millions of other potential IPs someone could be using to scrape your data or attack your website. Chasing down every IP and finding every possible entry point is beyond frustrating – it’s downright maddening. Worst of all, it’s something thousands upon thousands of website owners are doing daily.
This was a problem we knew we wanted to solve when we first started building Distil Networks. If we were really going to help our customers, we had to do more than just automated IP blocking. With that in mind, we stopped looking at where the request was coming from and we started looking at what the bot was doing, what capabilities the bot had, and how the bot was traversing your website. This led to the generation of our bot signatures – IP address agnostic combinations of all this information that we can use to define a bot, regardless of where it came from originally.
Currently, we generate a signature for every device that accesses a page we protect. Once a bot has been detected, we take that bot’s specific signature and propagate it out to our worldwide network to ensure that no matter where a bot shows up, we can see it and stop it.
All of this probably makes it seem as though we don’t value IP information, but we do. If we don’t have any previous record of a device signature, we activate a level within our bot blocking technology that looks at the owner of a given IP address and uses that to determine a baseline likelihood that a request is coming from a bot. For example, if a request comes from Comcast in Washington DC, more often than not, it’s just a normal person browsing your webpage. If that same request came from Amazon East or SoftLayer Dallas our network will begin to monitor the connection more closely since the requestor is far more likely to be a bot. It’s by no means the end-all-be-all of detection, but it is a data point we’d be foolish to ignore.
Basically, blocking by IP address can be an effective, but really a temporary-at-best solution. Unless you catch one of these attacks as they’re occurring, there’s no guarantee that the IP you spent hours scouring your server logs for is even the IP that the next bot attempt will come from. It’s like a game of Whack-A-Mole that never ends.
We truly believe that signature based blocking is going to be the future of bot security online. If you’re currently fighting a war against a bot that’s attacking or scraping your website, we can help.