Recently I had a chance to observe a 1st generation bot mitigator in a content scraping technology bakeoff and thought it worthwhile to share the outcome of this particular nasty attack campaign.
The Problems 1st Generation Bot Mitigators Face
When a scraping attack targets an API endpoint, 1st generation tools struggle because there is no web page or mobile SDK on which to install the necessary agent. Let’s start with why that is a problem:
- Scraping can happen through APIs – You can’t get the device telemetry the old Bot Mitigators require from an API consumer because you cannot install anything on the API. (this leads to some devastating outcomes)
The Situation Observed
We were working with a large eCommerce site that consistently found their original content on a competitor’s site, and they wanted to find a quick solution. They deployed both a 1st generation bot mitigation tool and the Cequence Application Security Platform to see how well each solution could prevent content scraping attacks.
To mitigate problems 1 and 2 above, the 1st generation bot mitigation tool was heavily customized to work as follows:
- Every HTTP GET request to anyURL was intercepted by an inline device and inspected to determine if a cookie from the Bot Mitigation tool was present.
- If a cookie was present and still valid, the request was forwarded upstream to the origin server.
- The 2nd HTTP GET request is intercepted by the 1st gen Bot Mitigation solution and the end-user device telemetry is analyzed in real-time to determine if the requester is a bot or not.
- If the request is deemed to be coming from a bot, then a response is sent based on custom-configured policies.
- If the request is deemed to be coming from a legitimate user, the response contains a newly-generated HTTP Cookie for subsequent requests. This cookie is valid for 24 hours.
- Upon receiving the HTTP Cookie as a response to the 2nd HTTP GET request, the system generates a 3rd HTTP request, which contains the HTTP cookie that allows the 1st gen Bot Mitigation solution to validate that the request is coming from a legitimate user.
The Results from the 1st Generation Bot Mitigation Test
Notwithstanding the need for a diagram to fully understand the complex workflow put into place by the 1st gen bot mitigation solution, the first result was an undesired one. The solution had a significant impact on customer experience. The page load time for a first-time visitor to the website increased more than 7X to 1.6 seconds on average. For retailers, this is an unacceptable user delay as it may result in lost shoppers. In addition, it is a significant financial penalty considering the page optimization investments in CDNs and other tools to improve page load time by mere milliseconds. The bottom line – in an attempt to prevent bot attacks, all page load time improvements were lost.
Despite the significant page load impact, there was still a hope that the 1st gen bot mitigation solution could successfully stop scraping. There was an immediate and clear drop in scraping attacks at the outset of the solution deployment, but it was very short-lived. The attackers quickly figured out how they were being blocked and they retooled their methods.
Could 1st Gen Tool Stop the Retooled Attack?
Within a week, the attackers came back with a vengeance. It seemed as if there were multiple actors involved in the follow-on attacks, and they utilized different techniques at different times.
- The first technique – leverage cookies: Attackers figured out that to not impact page load time for all pages for all users, they took advantage of the fact that the 1st gen bot mitigator-supplied cookies were valid for 24 hours. The attackers used real browsers to generate a batch of cookies and they then instrumented the cookies into their automated scraping scripts (aka bad bots). These cookie-bearing bots were used very aggressively without being blocked for 24-hour periods. They continued to repeat this process at will.
- The second technique – use the mobile endpoints: The second set of bad actors took advantage of the fact that there was no protection on the mobile endpoints. Even after putting in some basic rate-limits and geo-fencing, the attackers used residential proxy networks and spread their attacks over millions of IPs to circumvent those controls.
- The third technique – go direct to the APIs: The last observed technique attacked the APIs, the most vulnerable threat vector because it had almost no protection. Very soon, 57% of their API traffic was scrapers. It was easy for the bots to instrument and enumerate through the item IDs and scrape all the relevant content associated with those items.
The most devastating result from the test with the 1st generation bot mitigation tool was that it was stopping 43% of requests from search engine bots, which negatively impacted SEO and resulted in a decline in user-activity and overall demand.
The Wrap-up and the Alternative
In summary, 1st generation bot mitigation tools that rely heavily on signals from end-user devices can’t defend against web content scraping. It’s like fitting a square peg in a round hole. You run the risk of compromising user experience, potentially breaking your SEO, and still not solving the scraping problem – all while burning precious cycles deploying a heavy-handed solution.
You can read more about how Bot Defense stops content scraping here.