The two sides of web scraping: When data collection becomes a double-edged sword

Emerging AI technology often relies on methods of data collection – such as web-scraping – which can become a double edged sword when not used with safeguards and transparency or in ways that are unlawful. These methods have been used to achieve several key victories for digital rights, but can also be exploitative.

By Hermes Center (guest author) · April 17, 2024

Everyone wants a piece of OpenAI’s innovative generative AI pie. This has come at the expense of two dearly-valued core principles of privacy and digital rights civil society organisations :

  1. Accountability for the spread of harmful or false information,
  2. User control over personal data.

When deploying their latest technologies, companies have been largely overlooking these principles, presenting them as allegedly-necessary trade-offs. Still, nothing seems to be able to burst the generative AI bubble.

This blog will address the second core-principle, namely user control over personal data.

The reflections in this article have been triggered by the Italian Data Protection Authority’s public consultation regarding webscraping. This is the same authority that once imposed the national ban of ChatGPT because it was being trained on personal data, a practice violating data protection laws. Italian citizens had not expressed consent to OpenAI’s use of their data, and other legal grounds had not been met by the company.

In March 2024, the same Authority asked OpenAI whether the training process for Sora (the new, soon-to-be-released generative AI video tool) had exploited personal data. In an interview with the Wall Street Journal, Mira Murati, OpenAI’s Chief Technical Officer (CTO) claimed she didn’t know where the videos came from. This response was preposterous. No CTO of a company whose success depends on the quality of the data used to train its models can pretend they don’t know the answer.

We can all agree that ethical training methods exist, ones that don’t involve secrecy, and which respect and protect people’s personal data. But the competitive tech landscape often rewards companies that avoid criticism, even if it means ignoring ethical concerns. Between the transparent-and-honest and Closed-and-abusive approaches towards AI development, the current context of cutthroat venture capital investors seems to reward the latter.

Scraping vs. Spidering. Not just a lexical mistake

One of the reasons the open web is so functional is that it allows for the indexing of public content. This is done by constantly fetching new content to extract relevant keywords. It’s the search engines’ very job, and it’s automatic, periodic, and necessary for the functioning of the modern web.

This practice is called spidering, also sometimes referred to as crawling (check out this tiktok video). Spidering is a technique employed by search engines Google, Bing, Internet Archive.

It became increasingly controversial after the rise of CommonCrawl – an open repository of web crawl data – whose data collection was one of the basic training sets for GPT3 and 4.

The fact that website owners never give consent to have their content ‘spidered’ through has been willingly ignored by companies aiming to build large language models. These companies seem to misinterpret that the freely given information in the interest of the author on the website to mean that this information can be freely used for any technological experiments, such as training Large Language Models (LLMs).

Among the six legal grounds that could in theory justify processing these personal data, the General Data Protection Regulation (GDPR) includes this notion of consent, even if it’s clearly not a value that crawlers and their customers ever respected. The debate about how AI training can be a legitimate processing of personal data is about how it can be forced by profit-driven companies to fit into one or more of those six legal grounds. But for us, it’s just a forced interpretation of the law, aimed at fitting the AI narrative and the society that cheers it on.

A similar abusive situation occurred when ClearviewAI exploited the idea that what is online can be freely taken and carelessly used against the interests of those who share it. ClearviewAI’s product revolves around profiling people and selling their data, OpenAI’s technology amasses huge amounts of data without people’s consent. In the public perception, ClearviewAI’s product is meant for wealthy people to profit off of, while OpenAI’s product is freely available to the masses.

This type of data collection is called uninvited spidering. Because, arguably, the intent of the website wanting to be indexed is to get traffic, not to have its knowledge extracted and its traffic hijacked by a chatbot.

Being hijacked is the opposite of a website’s interest, which is for their content to be consumed on their website. OpenAI, and the standard Large Language Model (LLM) of today, also do not report any references. It’s clear to see the exploitation that gives nothing back to the original author. This cannot be justified with the same reasoning of being indexed.

Scraping is the action often blamed for this, although, technically speaking, that’s not the correct terminology. The most tech-savvy readers might also consider the complex pipeline that includes parsing, spidering, data mining, and data enrichment. Let’s unpack the differences of these two technical actions – spidering and scraping.

Spidering (wikipedia) is a massive automated action applied over websites. A web crawler can work over any website, let’s say edri.org, and expects to find some standard HTML, from which it extracts some meaning, and especially new links to crawl, recursively.

Scraping, on the other hand, is something site-specific; if we scrape products from amazon.com, it is because we can extract the name of the seller, the price of a product, and basically produce machine-readable information. Scraping edri.org would require a different configuration. Scraping is a selective process that targets specific pages to extract and semantically enrich data.

Over the years, numerous projects have drawn attention for using scraping as an investigative evidence-gathering tactic. As a method, it has been useful for academic researchers, journalists, and non-profit organisations to derive meaningful insights from otherwise unstructured web pages. Such evidence produces informative material for general audiences or reports for democratic authorities to better outline and understand the hidden logic of big tech algorithms. Here are some examples:

The utility of scraping is crucial for scrutinising and holding accountable the algorithms that govern all online platforms – large and small.

If the tech is content-agnostic, look at the values

If scraping can both be exploitative but also useful, how do we differentiate between the problematic scraping and the one we need to protect? We can do so by considering the size of the data, objectives of scraping, and the safeguards used while doing it.

Abuse happens when scraping is massive and scaled with the goal of collecting as much as possible. It is a problem because it is untargeted. If the scraping output is a personal profile, it’s a problem because you can’t process people’s profiles based on data collected without consent. If scraping is used to harvest content regardless of its source, ignoring accuracies, licences, timeliness, it would produce an untrustworthy result, a mashup of facts, factoids, and explicit disinformation.

If we use scraping instead to examine an institution of power (it could be a health department’s COVID-19 statistics, or the manipulative logic of Instagram), so long as it’s done in accordance with data protection rules, that’s a positive use of scraping.  If the goal is not to study people and attribute something to them, but to understand content, patterns, and trends, that’s acceptable, too.

Context is key to differentiate between the problematic and the useful kind of scraping. It’s similar to when we talk about the concept of ‘transparency’. It plays a valid role when applied to an institution of power, but has a completely different flavour when used against citizens. The former is a political act meant to guarantee oversight, while the latter is simply abusive surveillance to make people feel less safe.

Similarly, consider reverse engineering, which is another technique often labelled as a tool to steal corporate secrets. However, among digital rights defenders, it is one of the few affordable mechanisms to investigate closed source software.

Drawing this line to see the differences between the various ways tech is used – to hurt or to help – is key, especially when it comes to potent methods which are double-edged swords. Indexing can offer positive societal benefits, such as holding power to account. But its use by companies and governments has frequently been abusive and unlawful, leading to mass abuses of our personal data. By understanding both the tools and the legal framework in which they sit, we can make sure that in the era of ChatGPT, our entire digital histories don’t become lunch for LLMs.

Contribution by:

Contribution by: Alessandra Bormioli and Claudio Agosti, EDRi member, Hermes Center for Transparency and Digital Human Rights