Google announced this morning it has posted a Request for Comments to the Internet Engineering Task Force to formalize the Robots Exclusion Protocol specification after it being an informal 25-year-old standard for the internet.
The announcement. Google wrote on its blog: “Together with the original author of the protocol, webmasters, and other search engines, we’ve documented how the REP is used on the modern web, and submitted it to the IETF. The proposed REP draft reflects over 20 years of real-world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP.”
Nothing is changing. I asked Gary Illyes from Google, who was part of this announcement, if anything is changing and he said: “No, nothing at all.”
So why do this? Since the Robots Exclusion Protocol has never been a formal standard there is no official or definitive guide for keeping it up-to-date or making sure a specific syntax must be followed. Every major search engine has adopted robots.txt as a crawling directive but it isn’t even an official standard. That is going to change.
Google open sources its robots.txt parser. With that, Google announced they are open sourcing the portion of its robots.txt that parses the robots.txt file. “We open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files,” Google said. You can see this library on Github today if you like.
Why we care. Nothing is specifically changing today but with this change to make it a formal standard it does open up the chance for things to change. Keep in mind, the internet has been using this as a standard for 25 years without this being an official standard. So it isn’t clear what will or may change in the future. But for now, if you are building your own crawler, you can use Google’s robots.txt parser to help you.