What is Web Scraping & How to Prevent It

Many of today’s hyper-connected organizations are faced with the challenge of how to detect and prevent web scraping attacks in an efficient and scalable manner. In this blog, we’ll share how a comprehensive approach involving API security and bot management can help mitigate this problem that leverages behavioral fingerprinting to continuously track sophisticated attacks, supported by an API threat intelligence database made up of over 100 million records.

What is Web Scraping?

Web scraping is a method to gather content from a website, typically using automated bots. Scraping bots are everywhere these days, from search engine bots, to price comparison bots, to news aggregator bots. There are also malicious uses of web scraping, such as undercutting the prices of a competitor, theft of intellectual property, and more recently, scraping by AI bots ingesting content to train its models. Malicious web scraping bots can have impacts beyond the obvious, such as increased infrastructure costs and skewed marketing and sales metrics.

The Impacts of Web Scraping Attacks

The impact of web scraping attacks can be wide-ranging, from overspending on infrastructure to devastating data extraction and loss of intellectual property. Of all the automated business logic abuse attacks, content scraping is the most difficult to prevent. Here are three reasons why.

They Can Happen Anywhere Within the Domain

Whereas other automated forms of business logic abuse are targeted at certain applications and related endpoints, scraping can be directed at any application or endpoint within the domain. For example, credential stuffing and other account takeover attacks target applications that are user credential-based; denial of inventory attacks are focused on checkout applications and their API requests; scraping is more wide-reaching in its end goal. The challenge with preventing a web scraper attack becomes one of breadth – can your detection and mitigation approach encompass all your public facing applications – even on the application endpoints that have dynamically generated URIs? If you are trying to prevent scraping with a bot management tool that requires application instrumentation, you are forced into the position of injecting an agent into every web application and endpoint within your domain. The impacts of this approach are manyfold:

If the URI is dynamically generated, page load times may limit the ability to add an agent and the associated processing burden.
The injection of an agent to the page adds delay and complexities to the application development and deployment workflow.
Applications and APIs that can’t be modified with JavaScript or SDKs remain unprotected.

They’re Primarily HTTP GET-Based

Automated web scraper attacks execute by sending a simple HTTP GET request to the targeted URIs. On a typical domain, the HTTP GET requests represent 99% of all transactions which means that your bot mitigation approach must have the capacity to process all HTTP GET transactions. This approach introduces both scalability and efficacy impacts.

Scale: Most bot mitigation approaches cannot scale or require significant oversizing to handle all site/domain traffic especially for medium to large sites.
Efficacy: The emphasis on HTTP POST to send device fingerprinting logic means that they will miss most of the attack signals emanating from an HTTP GET.

Bad Bots Leverage Application APIs & Endpoints

The use of API endpoints has become becoming a critical element in the move towards a more rapid, iterative application development workflow. The same information that may be consumed by mobile customers, partners, and aggregators from a rich web-based interface is also available via API endpoints. When a web– or data–scraping attack faces resistance from web applications, the attacker simply shifts to using API endpoints to achieve their goal. The challenge facing traditional bot mitigation tools in preventing web scraper attacks targeting the API endpoints is that there is no page or SDK to install an agent on. The API consumers are themselves bots, so it’s almost impossible to integrate JavaScript or a mobile SDK.

How Cequence Security Prevents Web Scraping with Accurate Bot Detection

Cequence Unified API Protection (UAP) keeps business logic abuse from striking at your web apps, mobile apps, and their underlying API infrastructure.

Behavioral Fingerprinting: A Bot Detection Tool

Cequence leverages behavioral fingerprinting to continuously track sophisticated attacks, even as adversaries retool to avoid detection. Cequence’s behavioral fingerprinting distinguishes human from synthetic traffic, good bots from bad bots, and anomalous from malicious sessions. Traffic with similar behavioral characteristics are grouped together, enabling accurate detection of malicious activity. This approach is far more accurate than relying on IP addresses or other easily avoidable techniques.

Employing analysis powered by artificial intelligence and machine learning, Cequence analyzes incoming traffic to detect even hard-to-spot business logic abuse targeting your web, mobile and API-based applications. This analysis is translated into out-of-the-box policies that provide highly effective protection on day one. Cequence’s network-based approach eliminates the need for application instrumentation and provides you with the insight and intelligence to detect and prevent automated bot attacks and application vulnerability exploits targeting your public facing applications.

Cequence Web Scraping Prevention Deployment

Cequence can protect your APIs and web applications in as little as 15 minutes and can immediately begin reducing the operational burden associated with preventing attacks that can result in fraud, data loss, and business disruption.

Deployed in front of your public-facing applications, typically the DMZ, Cequence analyzes ALL transactions for ALL applications paths being used by clients. This allows us to correlate information across the entire application tier, tracking user and device access and behavior across the entire site. This means we have complete visibility into all the potential scraping targets – web–, mobile– and API-based – within your network.

The Cequence UAP network-based approach enables deployments to be sized for your environment, but also easily scale to address any spikes in transaction volume. This means we can analyze both the HTTP GET and HTTP POST methods across the entire application tier, detecting clusters of common/repeated behavior indicative of scraping, independent of geo-location, IP and device information presented by the clients. Since our approach does not require application instrumentation, there is zero performance impact to the actual application from scraping detection.

In some cases, scraping is both allowed and encouraged, while in others, it is viewed as malicious. Financial institutes must allow users to access and aggregate their information – in some countries, there are laws for this. Travel and hospitality sites allow aggregate shopping sites to scrape information to promote sales. In contrast, competitive scraping where the entire site is copied to compete more aggressively be stopped. Our mitigation mechanisms allow you to select the response that makes the most sense: slowing down aggregators using rate-limits, sending fake information to competitors, or outright blocking if the scraping happens to be too volumetric.

Web Scraping Prevention in the Social Media Industry

One of our social media customers is processing over one billion transactions per day to prevent account take overs (ATO), fake account creation, and reputation bombing involving fake likes and fake comments. This customer always suspected that they had a scraping problem but had no visibility into it. Like most web pages and social media sites, they allowed search engine and social network crawlers to crawl their site. Cequence enabled the customer was able to uncover competitive scraping bots from China hiding among legitimate crawlers. These bots were using the same toolkits and common libraries as the legitimate crawlers, down to the same user agent strings in their requests, so they appeared to be legitimate crawlers. With Cequence, the customer was able to distinguish the legitimate bots from the malicious scrapers and generate a unique bot behavioral fingerprint that allowed the customer to block the attack, even as the bad actors changed IP addresses and other request parameters.

Contact us for a personalized demo – let us show you how the Cequence approach is the most effective for API security and bot management – including web scraping.

What is API Security?

What is API Security?

How to Prevent Web Scraping Attacks and Block Malicious Bots