In order to resolve the privacy and copyright disputes related to extracting data from the public web environment, OpenAI announced the launch ofWeb crawling technology called GPTBot, will collect the data needed for artificial intelligence training in a more transparent way.
OpenAI said that GPTBot will use a full string and token to explain the identity of its crawler robot. At the same time, the public web information it crawls will only be used to improve future artificial intelligence models, and content that requires payment will be excluded.
However, if the webpage operator does not want GPTBot to crawl its content, for example, if the webpage may contain a large amount of content involving personal privacy, they only need to add a "GPTBot" description to the robots.txt file in the webpage structure, or customize the content that GPTBot can crawl. OpenAI even provides a way to directly prohibit GPTBot from crawling web page data by restricting IP access range, allowing webpage operators to prevent their content from being crawled by GPTBot.
In the past, many websites were configured to prevent search engines from crawling web data. With the continued growth of artificial intelligence (AI) technology, more and more AI training relies on large amounts of public data for learning. This has heightened concerns among many website operators about their content being used for AI training, potentially impacting valuable data or privacy. Therefore, they are requiring AI technology providers to access web data in a reasonable manner.


