The customer is the leading Directory Service in Belgium. Owning a database of more than 8 million telephone numbers and addresses (both private and business listings), the customer provides directory services mainly via 2 channels: call center with operators who can deliver personalized service and self-service search via website.
Bot operators jeopardizing the integrity of the directory assistance database
The customer Directory Service has a mandate from the Belgian Institute for Postal services and Telecommunications to provide a service to help people look up the names, phone numbers and addresses of Belgian residents and business entities. 1207/1307 makes this information available in a printed phone book, call center, and various websites and mobile applications. The multilanguage websites allow people to query the information for themselves. This web-based service is free for consumers and offered as a subscription service for businesses.
The data that populates the directory services database comes from all telecom operators in Belgium. Although the data is ultimately available to the public without charge via the printed phone book and the web, there is value in maintaining the integrity of the database because the customer has a commercial service to sell data enrichment to business customers. In other words, the customer loses revenue if bot operators scrape the database.
Data loss devaluing the commercial service that 1207/1307 is chartered to provide
1207/1307 was experiencing frequent and extensive attempt of data theft via automated bots that would repeatedly query the database and scrape the HTML data returned by each query. Tom, IT Project Manager, surmises that data stolen by illicit business entities could be used to provide a competing directory assistance service. “so we think our data might be used in a call center elsewhere,” says Tom. It’s also possible the data would be sold on the black market. Either way, the loss of data could devalue the commercial service that Directory Services is chartered to provide.
The only way the company knew the attempt of data theft was happening was by observing excessive interaction in the database search engine. “We have a person who monitors the data warehouse,” says Tom. “He would notice an elevated amount of query traffic. For example, he would observe that someone was doing an excessive number of queries on, say, bakeries. Whereas on a typical day we might have tens of queries on bakeries, we would see 800 queries in a day. By observing those kinds of strange behaviors, we could detect the scraping behavior. But this was not an efficient way of working.”
Manual bot blocking efforts, and legal actions were both expensive and ineffective
Tom says they had bots coming in from everywhere, but they had no accurate, real-time detection mechanism. “We didn’t have a clue of the magnitude of the problem at the time.” he says. “When we did detect bots, we made some legal cases out of them but lawsuits are problematic because they take a lot of time and money to research. The lawyers directed us to find a technological way to just stop the bot activity and the thefts.”
Tom says they tried IP blocking, but it proved to be counterproductive. “By blocking web visitors at the IP level, we’d end up trapping large swaths of people working large organizations,” says Tom. “Because they are behind a natted firewall, the users all have the same IP address. So, we’d get a lot of traffic from one IP address and we couldn’t tell who was really behind it. We had to abandon the IP check, and we didn’t have anything better to stop our bot problem.”
“We needed a solution that would help us early in the process— something that would stop the bots from getting on the website in the first place,” explains Tom. His team brainstormed on how they could do this on their own, but they couldn’t come up with a universal approach to solve the problem. They decided to look for specialized software to stop bots at the website’s front door.
Imperva’s deep experience in stopping data theft from web scraping bots proved invaluable
They identified potential providers of a suitable solution. The first company they contacted was Imperva. According to Tom, “We got a demo from Imperva and then did a trial of the software, which went very well. We were happy with the results and went into negotiations to bring the Imperva Bot Management (formerly Distil Networks) product in-house. We installed the virtual appliance in our own data center, so now we’re up and running, quite happily.” Tom says the full implementation took about a week.
Tom is most pleased with the hands-off way that Imperva Bot Management solution works. “We wanted something that would run unattended, to automatically stop the bots,” explains Tom. “We don’t have to monitor what’s happening all the time. We look at the portal to see that it’s working and how much bot traffic it’s stopping. We were astounded by the amount. Imperva Bot Management is very good at its job.”
“We did have one incident where a legitimate user was blocked from doing its searches as it exceeded our device-based rate limits,” Tom says. “This legitimate user does a lot of searches on our website, so we went into the Imperva portal and whitelisted it. That resolved the issue very quickly. I expect we’ll have to do the same for other legitimate users, but that’s an easy fix.”
Imperva reports reveal some interesting facts behind the bot attacks
The customer operates two websites – one in French and one in Dutch – but the data behind them is the same. An Imperva report revealed that the French-language website is attacked much more often. “We do get more scraping on the French side, but that’s to be expected since the French-speaking world is bigger than the Dutch-speaking world,” says Tom. It stands to reason that the bots would be aimed at the website of the more prominent language.
Imperva reports also revealed some unexpected information. “What was curious to us is that we have mobile versions of our sites, which are a lot simpler. The structure would make the mobile versions easier to scrape, but we learned the bots were focused on our desktop sites. There’s a lot less bad bot activity on our mobile websites. We didn’t know that before,” says Tom.
From an IT perspective, the Imperva solution has completely resolved customer data theft risk. The bad bots are stopped automatically before they can enter the websites. No one needs to be hands-on with the product for it to do its job. As Tom says with delight, “It just works.”
One-third of the web activity – all of the bad traffic – is now gone. This provides a performance boost to the infrastructure behind the directory assistance websites. Now the infrastructure can easily support the projected 20% annual growth of legitimate web traffic.
Paolo, the Channel Manager for Electronic Media at customer Directory Service looks at the value of the Imperva software from a business perspective. “If someone comes in and steals our data through scraping, we don’t know exactly what they do with the data. They would probably resell it to their own customers, so they would make a profit on our data—and they don’t pay a cent to us to do that. Of course, that’s totally illegal. The big value of the Imperva software is that it allows us to block these people, so we don’t lose our data or the revenue from selling it.”
In addition, the customer no longer devotes time and resources to lengthy investigations and lawsuits to go after data thieves because the thefts have simply stopped. This, alone, saves the company upwards of €100,000 per year (not including legal costs).
Recently, Tom made an internal presentation to the members of the customer IT staff. The team was very impressed with the fact that Imperva provides full visibility into false positives. For example, over the 30 day period shown below, Imperva bot Management served over 8 million CAPTCHAs, but only 96 were solved. That’s a false positive rate of .00001173%.
Bot-driven data theft using web scraping is a significant problem across the Internet. Any organization that has valuable content presented via a public website is vulnerable to attacks resulting in loss of data and, often, loss of revenue. The best way to solve the problem is to prevent the bots from ever reaching the target website in the first place. For the customer, Imperva Bot Management filters out the malicious bot traffic before it can reach the Directory Service websites, preserving the integrity of the data for legitimate users and customers.