OpenAI Unveils GPTBot Web Crawler: How to Block it?

Open AI has released a new web crawler that goes by the name of GPTBot. It is a way Open AI will teach more new things to the GPT-3, 4, and 5 large language models (LLMs). To date, ChatGPT is one of the most capable AI systems ever built. Web crawlers come to your Bing and Google sites to collect data and to train future AI models with the same data.

If you allow access to the webcrawler GPTBot you are contributing to making it more efficient in a smaller period. Open AI makes a note of all that the GPTBot crawls. Using a web crawler is a means to an end huge amount of data can be channeled through LLMs that can easily learn it in no time.

The company has recently claimed that it will siphon out details of pages that have paywall access. They will collect information that personally identifies these sites and their owners as well as their site details. They will also look into text that is violating the Open AI policies currently.

Also Read: WhatsApp Introduces Video Call Screensharing Feature

But at the same time, they have also released data on how to block the webcrawler GPTBot. You can disallow GPTBot to crawl the Robot.txt files and can even block its IP address. When you block GPTBot from crawling your data, you are taking the first step towards inhibiting ChatGPT from using your data to train the LLMs.

Here’s what you can do to prevent using your data for training LLMs;

  • Add GPTBot to your site’s robot.txt.

User-agent: GPTBot 

Disallow: / 

  • If you want to block some parts of your website from being read by GPTBot 

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

After releasing it, Open AI is already surrounded by a pile of lawsuits. 

Also Read: AI: Listening to Keystrokes to Steal Passwords

What are the charges that webcrawler GPTBot have been accused of?

  1. AI toll stealing data from users without permission is one allegation that stands against them in the lawsuit.
  2. Even a copyright infringement case has been put forth against them.
  3. Some companies like Stack Overflow, Twitter, and Reddit plan to charge these AI companies for granting access to their data.
  4. Companies like Adobe plan to mark data as not for training by using an anti-impersonation law.
  5. Open AI and other AI companies like them have already signed a deal with the White House. According to the deal, the companies would develop a watermarking system to intimate people if anything was AI generation.
  6. However, these companies made no promise whatsoever about using internet data or anything around stopping to use them any longer.

Author Profile

Ajay Kumar
Ajay Kumar is an accomplished writer known for crafting immersive and compelling stories that capture the imagination.

Leave a Comment