Blocking GPTBot and Other Means of Curbing Plagiarism in the AI Era

California Gazette Staff

Many people haven’t had to think about plagiarism since before they graduated. But for digital marketers in particular, the rise of AI has brought the issue of plagiarism back into the spotlight.

Site owners want their websites to be crawlable so they can be indexed by search engines like Google. But what about other crawlers—specifically, those that scour data to help train generative AI models like ChatGPT?

Should they be allowed to crawl any site with impunity? What about paywalled and membership areas? Firms have been scraping the web for years in order to feed their AI models, and recently, organizations have been pushing back by blocking GPTBot, Common Crawl, and other crawlers.

Is this a viable long-term solution for sites that don’t want their content repurposed without permission? Is the promise of generative AI so great that this type of sanctioned plagiarism will become the price of admission for every site on the web?

Here’s a look at how AI-driven plagiarism is impacting the online ecosystem.

Why Are Websites Blocking GPTBot and Other Crawlers?

Not every entity has the same level of interest in preventing plagiarism, whether it’s carried out by people or AI.

But for journalistic organizations such as Reuters, the New York Times, and many others whose bread and butter is original content, plagiarism poses a risk to their very business model. After all, these businesses invest heavily in producing high-quality content, and they have a right to charge for access to that content.

To fully understand the risks of plagiarism stemming from the rise of AI, the state of that industry must also be taken into account.

Legacy media companies have uniformly faced significant challenges in the form of revenue declines and resulting cutbacks since their core audiences started seeking information online before opening a newspaper or turning on the TV or radio.

AI-powered search and generative AI tools are both valid threats to content creators. But the dollars-and-cents risk isn’t the only reason organizations are choosing to block crawlers that scrape data for AI training.

There are also plenty of ethical concerns over what exactly AI firms do with that data.

Even for content that’s not behind a paywall or within a guarded membership area—content that still drives traffic and revenue for the site, by the way—many publishers believe they should have a say in whether and how that content is used, especially when it comes to training AI, which was clearly never a use contemplated by the content creators.

AI Plagiarism Can Harm Any Website

Consider an exclusive news story, for example. Whether it’s repurposed in whole without permission on a competing site (classic plagiarism), or in part as an element of an answer generated in response to a search query, the effect on the original publisher is the same: users no longer need to visit their site to access that information.

Sure, the search engine providing the AI-powered answer might include a link to the original story. But given that such answers are tailored to each user’s specific query, how likely is it that, having read the answer to their question, a user would still have a reason to click through to the publisher’s site?

News publishers aren’t the only ones at risk. Suppose a skincare company invests in SEO by adding blog posts, FAQs, and other high-quality content to their eCommerce site in order to capture better SERP rankings and ultimately, higher click-through and conversion rates.

It’s a move that’s likely to have the desired effects. But if this brand’s content that users find so valuable can be read elsewhere—be it on a competitor’s site or as part of a generative search response—the impact is the same: they lose valuable traffic and revenue.

Forthcoming Plagiarism Control Initiatives

The concerns surrounding this issue are rooted in principles as well as economics.

When regulators and legislators finally get around to having hearings about AI-driven plagiarism, the conversation will likely focus on the anti-competitive impact, since this is the more tangible of the two angles.

What plagiarism control measures will ultimately look like depends on a variety of factors. But it wouldn’t be surprising to see the private sector work proactively to address the issue, as this could help rebuild some of the public’s eroded confidence in big tech.

Ask anyone working at Google or Microsoft and they’ll tell you that plagiarism is already not permissible—policies are in place and violators are held to account.

But when it comes to AI-driven plagiarism, there’s a whole lot of gray area.

As to how the private sector will approach the issue, keep in mind that search engines in particular will want to prioritize the user experience. Search leaders view AI-powered search and AI assistants on devices as a massive value add for users, so a vigorous defense of this tech can be expected.

The Simplest Way to Avoid Plagiarism Issues

There’s no question that AI can help digital marketers and SEO professionals work more efficiently and effectively. But when it comes to generative AI for content creation, special care must be taken.

It’s not just that, down the road, AI-generated content could run the risk of violating plagiarism policies. From a practical SEO standpoint, relying on quick, cheap content created by AI does little if anything to differentiate a brand from its competitors

Marketing teams can certainly use AI to brainstorm topic ideas, outline pieces of content, and even write summaries or descriptions. But it’s vital that a human editor review every piece of AI-generated content for accuracy, comprehension, and quality.

There’s an absolute glut of low-quality AI-written content out there right now. It all looks the same, and it’s all similarly ineffective because it delivers little to no value for users.

Regardless of what plagiarism controls we may see implemented in the future, there’s an easy way for marketers to avoid the issue entirely: Create unique, high-quality content without trying to borrow from or imitate your competitors.

This type of 10X content will continue to win the day and drive the most meaningful results across verticals.

To gain more insights, connect with VELOX Media: Website | LinkedIn | Instagram | Facebook