AI Companies Found to be Ignoring Website Scraping Protocols, Continuing to Extract Data

Tech & AI | June 22, 2024, 3:03 p.m.

Perplexity, a self-proclaimed "free AI search engine," is facing backlash for allegedly scraping content from websites without permission. After Forbes and Wired accused the company of stealing and republishing their stories, Reuters revealed that other AI firms are also bypassing robots.txt files to gather data for training their technologies. TollBit, a startup connecting publishers with AI companies, warned publishers of this practice. OpenAI and Anthropic, creators of popular chatbots, were also implicated. Perplexity's CEO, Aravind Srinivas, denied ignoring robots.txt but admitted to using third-party crawlers. Wired's investigation found that Perplexity's chatbot produced content closely resembling articles, sometimes inaccurately. Srinivas defended the company's actions, claiming the Robots Exclusion Protocol is not legally binding and suggesting a need for new publisher-company relationships. He also insinuated that Wired purposely manipulated prompts for negative results.