WP The Legality of Web Scraping: The New York Times vs. OpenAI | Imperva

The New York Times vs. OpenAI: A Turning Point for Web Scraping?

The New York Times vs. OpenAI: A Turning Point for Web Scraping?

In a recent blog, we covered the blurry lines of legality surrounding web scraping and how the advent of artificial intelligence (AI) and large language models (LLMs) further complicates the matter. Shortly after publishing the blog, a significant legal development began unfolding: The New York Times (NYT) filed a lawsuit against OpenAI and Microsoft over alleged copyright infringement. This landmark legal battle that is taking shape could redefine the boundaries of copyright laws and AI. It underscores, once more, just how real and immediate the issue of web scraping is becoming. 

This follow-up blog will examine whether this lawsuit can set the benchmark for the legality of scraping copyrighted work to train AI. And what about other cases of scraping copyrighted work? How can policymakers update copyright laws to balance protecting creativity and fostering AI development?

The Case

On December 27, 2023, in a scenario that could potentially set a new precedent for web scraping, The New York Times sued OpenAI, the creator of the popular chatbot ChatGPT. The newspaper alleges that OpenAI used its material without permission to train the AI model. The Times is seeking “fair value” for the use of its content, which it believes has not been met in its negotiations with OpenAI.

OpenAI, on the other hand, argues that its mass scraping of the internet, including articles from The Times, is protected under the legal doctrine of “fair use” in the US Copyright Act. This doctrine allows for the reuse of materials without permission in certain instances, including for research and teaching. The lawsuit states that OpenAI believes its conduct is permitted as “fair use” because their unlicensed use of copyrighted content to train GenAI models serves a new “transformative” purpose. 

However, The Times argues that OpenAI’s use of its content does not meet the “transformative” criteria required for fair use. In the lawsuit, The Times’s lawyers state, “There is nothing ‘transformative’ about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it.”

The Significance of the Lawsuit

This lawsuit is a significant development in the ongoing debate about the legality and ethics of web scraping, particularly in the context of AI and LLMs. It highlights organizations’ real-world challenges in protecting their content and intellectual property in the digital age.

The New York Times’ lawsuit raises critical questions about the boundaries of data collection for AI training. While AI models like GPT-3 and Copilot rely on vast amounts of data to generate coherent outputs, the source of this data and its use can have significant legal and ethical implications.

Moreover, this lawsuit underscores the potential risks associated with web scraping. As AI systems and LLMs are trained on scraped data, they may inadvertently amplify and proliferate private information, posing potential risks to individuals and society. The lack of transparency in how this data is used and the difficulty in removing data once it’s been incorporated into a model raises additional ethical concerns.

The lawsuit also highlights the issue of “hallucinations” by AI models, where they produce information that sounds believable but is completely fabricated. These “hallucinations” can cause significant harm, especially when such information is amplified to millions through search engine results.

The Outcome and Its Potential Impact

The outcome of this lawsuit could have profound implications for both traditional media and AI development. If the court rules in favor of The New York Times, it could set a precedent that using copyrighted material for training AI models without the owner’s consent is illegal. At the same time, it could hamper the development of AI, as training these models requires vast amounts of data.

On the other hand, if OpenAI’s fair use defense is upheld, it could pave the way for more extensive use of copyrighted material in AI development. However, this could also lead to increased web scraping activities, potentially infringing upon the rights of content creators and publishers.

The Need for Updated Copyright Laws

As AI continues to evolve, it’s clear that our current copyright laws may not be equipped to handle the complexities that come with it. The New York Times vs. OpenAI case highlights the urgent need for policymakers to update these laws to protect content creators’ rights while facilitating AI innovation.

For instance, the EU’s AI Act places legal obligations on general-purpose AI models regarding the use of copyrighted works. Such measures could provide a framework for other jurisdictions to update their copyright laws, striking a balance between protecting creativity and fostering AI development.

Conclusion

While it’s clear that web scraping has become an integral part of AI development, it’s equally clear there needs to be more transparency and regulation around this practice. The New York Times’ lawsuit against OpenAI underscores the complex legal and ethical challenges posed by web scraping in the age of AI. As AI technologies continue to evolve and rely on vast amounts of data for training, it is crucial to establish clear legal guidelines that balance data needs with respect for copyright laws and privacy rights.

Whether or not The New York Times’ lawsuit will set a new standard for web scraping remains to be seen. Regardless of the outcome, it is clear that the conversation around the legality of web scraping is far from over. As businesses and legal systems grapple with these issues, staying informed and navigating this complex landscape with caution and responsibility is crucial.

Organizations Must Proactively Protect Their Data

While we await the court’s decision and its implications, it’s clear that businesses need to take proactive measures to protect their digital assets today. While the law attempts to catch up with the technology, companies must leverage technological solutions to prevent web scraping and data theft.

Imperva Advanced Bot Protection is a market-leading bot management solution that safeguards businesses from today’s most sophisticated bot attacks. It protects all entry points – websites, mobile apps, and APIs against every OWASP automated threat, including web scraping.

Our multi-layered approach to bot detection includes machine-learning models explicitly tailored to detect web scraping. We provide complete visibility and control over human, good, and bad bot traffic, offering multiple response options for each. Most importantly, we do this without imposing unnecessary friction on legitimate users, ensuring the smooth flow of business-critical traffic to your applications.