The Ever Changing Design of the Web
Some updates have been made to this story.
The most important thing, though, is that both Bing and Google no longer search for the string .well-known, always turning it into "well known", making this entire post kinda dumb... (Which is bizzare for at one time Bing had for the top link a reference to the RFC noted below.) [December 22, 2017]
"Layers" make up the underlying protocols of the Internet, but the connections between machines are more like "Webs". Many such Webs make up the Word Wide Web. They have been constantly changing: from HTML meta data definitions to file formats like
There has come CSS, SVG and WOFF, more media support, and, of course mobile devices.
Google "warned" everybody about their search engine support of mobile, giving a specific date. All the impact seen here (the website developers POV) was googlebot making requests for:
Which shows how Google's search engine does not read web pages and follows the links found therein. Sometimes it makes requests for resources (files) of it's own accord, like the few mentioned already.
I noticed the other day that googlebot had some more things it was looking for:
.well-known/apple-app-site-association /apple-app-site-association .well-known/assetlinks.json
sigh, here they go again
This time they freaked out many web developers as can be seen in the many threads about these requests at sites like Server Fault, Stack Exchange, Stack Overflow, Apple Developer and even Google Products Forum. The basic questions were mostly of the "What are these? Should I be worried?" nature.
For what is the ".well-known"-folder? @ Server Fault What is the .well-known/ directory ... @ Stack Exchange Requests to /.well-known/apple-app-site-association ... @ Stack Overflow Incoming requests for /.well-known/apple-app-site ... @ Apple Developer
Sometimes an answer came along pointing to the reason – which will be below, but sometimes people were, as I call these happenings, stumbling around in the dark.
Such A Drag
Such requests by Google (and whatever other bots doing so, though Google seem to be the only one hitting our websites) are harmless, except that they cane be such a drag – figuratively and literally.
A drag on some people's time as they immediately turn to the "Worried Wide Web" for answers, often thinking something is wrong. (And often, perhaps, using Google.) Many an hour spent typing into web forum textareas...
A thread on The Joomla! Forum, a question about what to do with "a lot of these requests in [the] log", had this warning:
"Either create that file or create a 403 forbidden rule in your .htaccess file. Don't leave it like this as 404s in Joomla result in additional load."
(Too many such CMS-ware load their entire code base, upward to tens of megabytes as is Joomla!, before figuring out a request is a "Not Found". More on that later...)
Joomla's website has since been updated to not issue their "Not Found" page for /.well-known/.
A Quick Fix
Add to your
robots.txt file these:
User-agent: * Disallow: /.well-known/ Disallow: /apple-app-site-association/
And cross your fingers that googlebot adheres to them. (They do.)
What These Are
Oh, yeah. What these are. They are defined in Well-Known Uniform Resource Identifiers - RFC-5785:
It is increasingly common for Web-based protocols to require the discovery of policy or other information about a host ("site-wide metadata") before making a request. For example, the Robots Exclusion Protocol http://www.robotstxt.org/...
The recently released iOS 9.3 update implements RFC 5785. Because of this, devices running iOS 9.3 will first request /.well-known/apple-app-site-association for the apple-app-site-association file that is required to implement Universal Links and Shared Web Credentials. If the file is not found in this location, then the device will request the file in the root of the web server, as with prior releases of iOS 9.
What Is Wrong
So, there is the how and why of them. But there are several things wrong here.
- Google's sudden (though it started in 2016) requesting millions of non-existing resources throughout the Web. ("Of course there are going to be millions of 404s!")
- The penchant of Web Developers to not design the code to gracefully, and quickly, detect and handle 404s.
- Google's not informing anyone about these requests.
For the last item, obviously many people were involved with the design and implementation of "Well-Known Uniform Resource Identifiers". Google calls them "Google Digital Asset Links" (as if the idea/design was/is Google's).
That not every Web Developer follows all the incremental design changes to the Web can not be considered shocking. But, as even some people participating in Google's Webmaster program were blindsided by this, Google could have done more to explain what they did/do. (They may have sometime, somewhere...)
Google could have (should have in my opinion) changed their User-Agent string from it's default:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
to something more targeted (split to fit the page):
Mozilla/5.0 (compatible; GoogleDigitalAssets/1.0; +https://developers.google.com/digital-asset-links/v1/)
Slowing Down The Web
Many such Web resources are being "farmed out", or "cloud hosted", which means millions of websites all linking to singly hosted corporate URIs, meaning the need of hundreds of server farms growing to hundreds of acres all over the world just to keep up.
And They Know Who You Are
To say nothing about "Advertisers, Traffic Trackers, Web Counters and Web Analytics" of all kinds.
With the result being that when you visit one website, you are also visiting dozens more at the same, each one keeping track or your whereabouts on the Web at all times.
Thanks Google, you started all this. Thanks alot.