Web scraping, also known as content scraping, web harvesting or web data extraction is the use of automation to collect data or processed output from a public-facing application. Common content scraping targets include retail, financial services, travel, hospitality and social media. In some cases, content scraping is encouraged, requiring organizations to strike a balance between allowing the practice while ensuring that the infrastructure, and ultimately the customer experience, is not negatively impacted. In cases like this, where content scraping is deemed malicious, the result can include stolen intellectual property, sales decline, lost customers and brand degradation.
Retail Content Scraping: Legitimate or Malicious?
Digitally-connected organizations often experience a combination of malicious scraping and search engine abuse that exhibit the following characteristics:
- Multiple masking or evasive techniques disguise the activity including browser spoofing, forgery and sophisticated user agent rotation.
- The search queries target every single web application URI across multiple locations and the patterns appear too perfect and too fast to be human.
- The queries are distributed across a wide range of locations that don’t match the locations of the search queries themselves.
- Many of the targeted URIs and inventory items queried do not exist, placing significant strain on their infrastructure.
Taken collectively, the findings provide strong evidence that the intent of the scraping and search activity is malicious.
Content scraping can be highly automated, targeting APIs directly, as well as mobile or web apps.