What is scraping?
Web scraping, also known as web harvesting or web data extraction is the use of automation to collect data or processed output from an application for use elsewhere.
How does scraping work?
Scraping uses automation or bots which are targeted at specific URLs or APIs to perform either static or dynamic scraping for either a malicious or a positive outcome.
- Static scraping: Bots will traverse all the paths from a certain base or root URL, scraping all of the content from each path to create a local replica of the site. The goal is to recreate the site or the page for the bad actor to then use.
- Dynamic scraping: A more advanced level of scraping that requires interaction (e.g., location, enumerating IDs) with the web application via the backend APIs to generate the targeted scraping content. Dynamic scraping might be used to generate the price for an airline ticket, or a shared ride where the origin and destination are required to determine the scraping target (the price in this case).
In cases where the content targeted for scraping (dynamic or static) requires authorized access, the bad actor will first establish a fake account using bots to fill in the account signup form, or more commonly, use automation directly against the related APIs. This two-pronged approach to scraping may have been how bad actors collected 60 million LinkedIn profiles, some of which included emails.
What are scraping goals and commonly targeted industries?
The goal is to simply copy content from websites in order to create fake, duplicate websites and content elsewhere. Commonly targeted industries include retail and social media.
- Retail: Bad actors will use static scraping against an entire retail site to create a replica in some other part of the world. Their motivation is to sell identical, yet fake goods for the same, or possibly lower price.
- Social Media: Combining fake account creation with both static and dynamic scraping, bad actors will target established social media organizations of all types (e.g., professional networks, dating networks, job/contract work sites, friend networks) to steal profiles to artificially expand their user base.
The goal is to collect sensitive information such as pricing to gain a competitive advantage, primarily in the retail industry. Using dynamic scraping, bad actors will routinely collect pricing and sale information for high demand, low inventory items from competitive retail sites. The information is then used to undercut the competitor with a deeper discount.
Intellectual Property Theft
Using a combination of static scraping and fake account creation, the bad actor will steal intellectual property, proprietary or trademarked goods and services including media files (e.g., movies, songs or TV shows, other forms of media). Online media companies are commonly targeted with a bad actor creating a fake account, then downloading the files for their own profitable use elsewhere.
Dynamic scraping is used to collect pricing information from a wide range of sites, indexing the merchandise and their prices from various sources for comparison by end-users. In some industries, aggregation is encouraged while in others, it is discouraged.
- Travel and hospitality: Scraping by aggregators is encouraged because it facilitates sales, helping to maximize inventory depletion.
- Retail: Dynamic scraping is used to aggregate data from multiple online retailers and suggest the best price for consumers, in some cases, based on the aggregator user query itself.
Static scraping and legitimate accounts are employed to aggregate financial information for individuals into a single place. As with travel and hospitality, aggregation is encouraged by many banks, credit card companies, retirement management organizations. Financial aggregators are encouraged to use dedicated APIs like OFX but will often times bypass it, scraping content from the main account pages, in order to get more contextual information about users, which they sell to advertisers.
How does scraping impact an organization?
Scraping negatively impacts organizations in several ways.
- Lost sales and weakening of competitive advantage to competitors who are using stolen content including high resolution images to market and sell the same or counterfeit products at lower prices.
- Weakened brand awareness and loss of customer confidence when the popular products and services are more readily available at a lower price (or for free) elsewhere.
In cases where scraping is supported and encouraged, the impact will strike a balance between allowing aggregation while ensuring that the infrastructure, and ultimately the customer experience, is not negatively impacted. In some cases, scraping is only allowed in off-peak hours.
Preventing Web Scraping with Cequence Security
Cequence Application Security Platform prevents scraping using a patented machine-learning analytics engine that characterizes your entire web application, including the associated APIs allowing you then to detect and mitigate malicious scraping by blocking it. Scraping activity that is encouraged can be rate limited, thereby minimizing the impact on the existing infrastructure.