The Danger of Web Scraping – And How to Prevent It

June 13, 2019 | by Ameya Talwalkar

web scraping attacks

Many of today’s hyper-connected organizations are faced with the challenge of how to address web scraping attacks in an efficient and scalable manner. The impact of this attack can be wide-ranging, starting from overspending on infrastructure to devastating loss of intellectual property. Of all the automated business logic abuse attacks, content scraping is the most difficult to prevent. Here are three reasons why:

  1. Scraping attacks can happen anywhere within the domain. Whereas other automated forms of business logic abuse are targeted at certain applications and related endpoints, scraping can be directed at any application or endpoint within the domain. For example, account takeover/credential stuffing attacks target applications that are user credential-based; denial of inventory attacks are focused on checkout applications; scraping is more wide-reaching in its end goal. The challenge with preventing a scraping attack becomes one of breadth – can your detection and mitigation approach encompass all of your public facing applications – even on the application endpoints that have dynamically generated URIs?If you are trying to prevent scraping with a bot mitigation tool that requires application instrumentation, you are forced into the position of injecting an agent into every web application and endpoint within your domain. The impacts of this approach are twofold:
    • If the URI is dynamically generated, page load times may limit the ability to add an agent and the associated processing burden.
    • The injection of an agent to the page adds delay and complexities to the application development and deployment workflow.
  2. Scraping attacks are (primarily) HTTP GET-based. Automated web scraping attacks execute by sending a simple HTTP GET request to the targeted URIs. On a typical domain, the HTTP GET requests represent 99% of all transactions which means that your bot mitigation approach must have the capacity to process all HTTP GET transactions. This approach introduces both scalability and efficacy impacts.
    • Scale: Most bot mitigation approaches include an appliance component that is designed with HTTP POST transaction capacities in mind and as such, cannot scale or require significant oversizing to handle all site/domain traffic especially for medium to large sites.
    • Efficacy: The emphasis on HTTP POST to send device fingerprinting logic means that they will miss the majority of the attack signals emanating from an HTTP GET.
  3. Scraping attacks leverage application APIs and endpoints. The use of API endpoints is rapidly becoming a critical element in the move towards a more rapid, iterative application development workflow. The same information that may be consumed by mobile customers, partners, aggregators from a rich web-based interface is also available via the API endpoints. When scraping attack faces resistance from Web applications, they simply switch to using API endpoints to achieve their goal. etc.The challenge facing first-generation bot mitigation tools in preventing scraping attacks targeting the API endpoints is that there is no page or SDK to install an agent on. The API consumers are themselves bots, so it’s almost impossible to integrate JScript or a Mobile SDK.

How Cequence Security Prevents Scraping

Our award-winning Application Security Platform (ASP) uses CQAI, a patented machine learning analytics engine to automatically analyze and profile all the transactions hitting your web, mobile and API-based applications. This architectural approach eliminates the need for application instrumentation and provides you with the insight and intelligence to detect and prevent automated bot attacks and application vulnerability exploits targeting your public facing applications. Deployed as a Docker-based, software-only application, Cequence ASP scales easily to address capacity demands and it can be deployed in the data center, the cloud or a hybrid environment.

  • Discover: Deployed in front of your public-facing applications, typically the DMZ, CQAI analyzes ALL transactions for ALL applications paths being used by clients. This allows us to correlate information across the entire application tier, tracking user and device behavior across the entire site. This means we have complete visibility into all of the potential scraping targets – web, mobile and API-based – within your domain.
  • Detect: Our software-based approach allows the deployment to be sized for your environment, but also easily scale to address any spikes in transaction volume. This means we are able to analyze both the HTTP GET and HTTP POST methods across the entire application tier, detecting clusters of common/repeated behavior indicative of scraping, independent of geo-location, IP and device information presented by the clients. Since our approach does not require application instrumentation, there is zero performance impact to the actual application from scraping detection.
  • Defend: In some cases, scraping is both allowed and encouraged, while in others, it is viewed as malicious. Financial institutes must allow users to aggregate their information – in some
    countries, there are laws for this. Travel and hospitality sites allow aggregate shopping sites to scrape information to promote sales. In contrast, competitive scraping where the entire site is copied to more aggressively compete be stopped. Our mitigation mechanisms allow you to select the response that makes the most sense: slowing down aggregators using rate-limits, sending fake information to competitors using our HoneyTrapTM technology, or outright blocking if the scraping happens to be too volumetric.

Cequence Scraping Prevention in Action

One of our social media customers is processing over 1 billion transactions per day to prevent Account Take Overs (ATO), Fake Account Creation and Reputation Bombing (fake likes, fake comments, etc.). This customer always suspected that they had a scraping problem, but had no visibility into it. Like most social media sites, they allowed search engine and social network crawlers to crawl their site. Using our platform, CQAI was able to uncover competitive scraping from China hiding among legitimate crawlers. These competitive scraping bots were using the same toolkits and common libraries as the legitimate crawlers, down to the same User-Agent strings in their requests, so they appeared to be legitimate crawlers. CQAI was able to split the (legitimate) crawlers from the (competitive) scrapers and generate a unique bot fingerprint that allowed this customer to block the attack, even as the bad actors changed IP addresses and other request parameters.

Watch the video below to learn more about how Cequence ASP can help you protect your organization from content scraping and other automated threats:

Ameya Talwalkar

Ameya Talwalkar

President, Chief Executive Officer & Founder

Additional Resources

New Research Discovers More Than 30% of All Malicious Attacks Target Shadow APIs. Learn more Arrow icon