WP A guide to site scraping

Archive

The Idiot’s Guide to Site Scraping

The Idiot’s Guide to Site Scraping

Web bots perform a host of simple and repetitive tasks automatically and effectively. They do a great job monitoring the internet, for example. They help search engines collect data, they keep an eye on your website’s health and security, and they provide valuable metrics along the way.

More controversial, however, are the bots that comb the web looking for content to copy or steal. This is called site scraping, and it’s becoming a huge problem.

Web Scrapers That Made the News

Two years ago, a (now defunct) startup ran afoul of Facebook when it continued accessing the social media site’s servers even after it was warned to stop. And currently, LinkedIn is in a contentious court battle to protect its public profiles from a small site scraping company.

Content is king on the internet. And collecting and connecting intellectual property is often a building block for an organization hoping to establish itself in the marketplace. Up until recently, site scraping was ubiquitous and therefore (somewhat) tolerated. But now it’s a critical issue for companies that want to safeguard their proprietary content and boost their security.

But don’t get us wrong; not all web scraping is bad. In many cases, data owners want to distribute their information to as many people as possible. So, in some cases, it’s okay to scrape content. Government agencies, for example, want public information disseminated openly. And aggregation websites for travel, hotels, and concert tickets, all hope their digital data reaches the widest audience. Site scraping helps with that.

It’s easy to see how companies like Facebook, LinkedIn, and other sites, have a problem with it. They view scraping as a way for someone to steal intellectual property, price lists, customer lists, insurance pricing and other datasets. Why should LinkedIn, for example, share information from its 467 million users? Especially when they themselves are selling the data to recruiters and other business operatives.

Web Scraping: A Double-edged Sword

Site scraping is a hot topic because it’s such a powerful tool. In the right hands, it automates the gathering and dissemination of valuable information. Every time you use a search engine looking for a recipe or a bus schedule, you can thank your friendly web bot for doing its job.

In the wrong hands, unfortunately, web scraping bots lead to theft of intellectual property or an unfair competitive edge. An online entity targeted by a scraper can suffer severe financial losses, especially if it’s a business that strongly relies on competitive pricing models or deals in content distribution. The security risks of an intrusive bot cannot be overstated.

Since all scraping bots have the same purpose (to access site data), it can sometimes be difficult to distinguish between legitimate and malicious bots. Here are two key ways to tell the difference.

  1. Legitimate bots are usually identified with the organization for which they scrape. Googlebot specifically identifies itself in its HTTP header as belonging to Google. Conversely, malicious bots impersonate legitimate traffic by creating a false HTTP agent.
  2. Legitimate bots will usually abide by a site’s txt file parameters. This file contains a list of pages a bot can and cannot access. Bots with a nefarious agenda will crawl the website regardless of what the site operator allows.

Beyond these two examples, what can WebOps teams do to protect content from site scrapers? To be honest, the increased sophistication in malicious scraper bots has rendered common security software ineffective. The headless browser bot, for example, can mask itself as a human and fly under the radar of most mitigation solutions.

As more money gets invested in the development of new, more sophisticated bots, that can better impersonate browsers or even other bots, traditional classification and mitigation tools are being rendered useless.

Scraping Is Resource Intensive and Expensive

Web scraper bots are resource-intensive, requiring servers with substantial processing power. Legitimate scraping bot operators invest heavily in computing resources.

A bad actor will often use a botnet to do the job of scraping a competitor’s site. Botnets are an aggregate of computers infected with the same malware and managed from a central hub. Computer owners who are unaware that they are part of an individual botnet supply the resource to power the botnet. By gathering a large number of infected computers, perps are accessing resources that let them conduct large scale scraping for little cost or for free. This makes it a lower investment for them than to legitimately scrape sites for content.

Protecting Against Web Scraping

To stay ahead of attackers, modern products need to combine different technologies to create a toolkit that can be frequently improved to face new threats. Unfortunately, as in many other security cat-and-mouse games, there’s no silver bullet.

Modern products usually implement some or all of the following technologies to be able to correctly classify those site scraping bots:

Device fingerprinting: By both passively looking at the HTTP request structure and also using active tools (such as JS) to map other attributes of the client modern products can create fingerprints of clients accessing the site, regardless of whether or not they are sharing the same IP address (and sometimes even compromised computer) with other clients. This information gives us a clue as to whether a visitor is a human or a bot, or whether it’s safe or malicious.

IP reputation: Information about the history of the source IP may also be valuable. Visits from IP addresses having a history of being used in assaults are treated suspiciously and are more likely to be scrutinized more closely.

Behavior analysis: Tracking the ways that visitors interact with a website can reveal abnormal behavioral patterns, such as a suspiciously aggressive rate of requests and illogical browsing patterns. This step helps identify bots that try to pass themselves off as human visitors.

In this post we reviewed different technologies that are being used to classify visitors to filter out those sophisticated site scraping bots. In future posts we will also address possible courses of action with regard to those bots once they get detected.

Web bots comprise nearly half of the activity on the internet these days. And our data tells us that malicious bot traffic (which includes site scraping) represents 66 percent of that chunk. You can find out more about protecting your site from harmful bot traffic with our bot mitigation security solution.