Google announced on its Webmaster Central Blog that the company will no longer support a number of robots.txt directives starting September 1, 2019. Robots.txt is a file used by websites to regulate and restrict content harvested by crawlers and indexers. Not all crawlers and indexers are desirable. Some gather content to enable its users to find content on your site through their search service, such as Google or Bing, and in turn directs traffic back to the content source. On the other hand, others merely extract content for their own benefit.
Webmasters can deter abusive traffic using rate-limiting tools.
Google indexing a site is usually a good thing because it’s the No.1 search engine on the internet. But indexing and crawling comes at a price… bandwidth. Each time a robot (automated programs), or simply “bot”, crawls a website it adds to the traffic. Not everyone has unlimited bandwidth and resources. So when a bot bombards a site with requests simultaneously the site can become temporarily unavailable for real users, it can even crash a site temporarily. This can pose an issue to service providers. Robot.txt helps manage bots requests. Directives such as Crawl-Delay gives webmasters the ability to regulate the interval for each bot request. Google has unlimited resources, so it has no problem drawing content en mass. But for websites that runs on limited resources, it’s a pain.
Here’s a snapshot of a Google robot crawl session. The first column indicates the date and time it made the request, the second is the interval from the last request. There are hundreds of these requests made consecutively for an average three-second interval. This isn’t fun.
At bluebay700.com, robots from search engines and “media partners” are welcome to index the site, but they have to retain a nominal traffic footprint, with at least thirty-second interval between requests. The site wasn’t built to cater to bots. Google is so adamant at manipulating internet content producers to succumb to their bidding, yet they aren’t always diplomatic. This is yet another example.
Yesterday we announced that we’re open-sourcing Google’s production robots.txt parser. It was an exciting moment that paves the road for potential Search open sourcing projects in the future! Feedback is helpful, and we’re eagerly collecting questions from developers and webmasters alike. One question stood out, which we’ll address in this post:
Why isn’t a code handler for other rules like crawl-delay included in the code?
The internet draft we published yesterday provides an extensible architecture for rules that are not part of the standard. This means that if a crawler wanted to support their own line like “unicorns: allowed”, they could. To demonstrate how this would look in a parser, we included a very common line, sitemap, in our open-source robots.txt parser.
While open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular, we focused on rules unsupported by the internet draft, such as crawl-delay, nofollow, and noindex. Since these rules were never documented by Google, naturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage was contradicted by other rules in all but 0.001% of all robots.txt files on the internet. These mistakes hurt websites’ presence in Google’s search results in ways we don’t think webmasters intended.
In the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we’re retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019.— Gary/Webmasters Google Blog
Fortunately, Robots.txt isn’t the only tool at our disposal to manage abusive visitors. We’re removing Googlebots from our Indexer Whitelist. Meaning, its traffic will be subjected to our rate-limiting gatekeeper, which will effectively temporarily deny successive rapid requests. And if it persists, the gatekeeper will automatically block it permanently.
Here at bluebay700.com, we’re not concerned about search engine impressions. We prefer organic growth like word of mouth or referrals from loyal readers. Frankly, it seems Bing.com is doing a better job at indexing and impressions — with Bing, webmasters and developers don’t have to alter their designs to conform to a corporate standard that keeps changing. Remember, Google is not the internet. Google wouldn’t care, and so do we.
This latest move by Google is disappointing but not entirely surprising. With great power comes great ego.