During the holidays in the United States, some articles were shared regarding an alleged data leak related to Google rankings. Early articles about the leaks focused on “confirming” Rand Fishkin’s long-held beliefs, but little attention was paid to the context of the information and what it actually meant.
Context matters: AI Warehouse document
The leaked document shares a relationship with a public Google Cloud platform called Document AI Warehouse, used to analyze, organize, search and store data. This public documentation is titled Document AI Warehouse Overview. A Facebook post states that the “leaked” data is the “internal version” of the publicly viewable Document AI Warehouse documentation. This is the context of this data.
Screenshot: AI Document Warehouse
@DavidGQuaid tweeted:
“I think this is clearly an external API for creating a document warehouse, as the name suggests”
This seems to throw cold water on the idea that the “leaked” data represents internal Google Search information.
As far as we know at this time, the “leaked data” shares a similarity with what is in the public Document AI Warehouse page.
Leak of internal research data?
The original post on SparkToro does not say that the data comes from Google Search. It says the person who sent the data to Rand Fishkin is the one who made this claim.
One of the things I admire about Rand Fishkin is that he is meticulously precise in his writing, especially when it comes to warnings. Rand specifically notes that it is the person who provided the data who claims the data comes from Google Search. There is no proof, only an assertion.
He writes:
“I received an email from someone claiming to have access to a massive leak of API documentation from Google’s Search division.”
Fishkin himself does not claim that the data has been confirmed by former Googlers as coming from Google Search. He writes that the person who emailed the data made this claim.
“The email further claimed that these leaked documents had been confirmed as authentic by former Google employees, and that these ex-employees and others had shared additional private information about Google’s search operations. “
Fishkin writes about a later video meeting in which the leaker revealed that his contact with former Googlers took place in the context of their meeting at a search industry event. Again, we’ll have to take the leakers’ word about the ex-Googlers and that what they said was after carefully reviewing the data and not a casual comment.
Fishkin writes that he contacted three former Googlers about this. What is notable is that these former Googlers did not explicitly confirm that the data was internal to Google Search. They only confirmed that the data looks like internal Google information, not that it comes from Google Search.
Fishkin writes what former Googlers told him:
- “I didn’t have access to this code when I worked there. But it certainly seems legit.
- “It has all the characteristics of an internal Google API.”
- “It’s a Java-based API. And someone spent a lot of time following Google’s internal standards for documentation and naming.
- “I would need more time to be sure, but it matches the internal documentation that I am familiar with.”
- “Nothing I’ve seen in a brief review suggests this is anything but legitimate.”
Saying something came from Google Search and saying it came from Google are two different things.
Keep an open mind
It’s important to keep an open mind about the data, as much data is unconfirmed. For example, it is unclear whether this is an internal document of the research team. For this reason, it’s probably not a good idea to treat this data as actionable SEO advice.
Additionally, it is not advisable to analyze data to specifically confirm long-held beliefs. This is how we find ourselves trapped in confirmation bias.
A definition of confirmation bias:
“Confirmation bias is the tendency to seek out, interpret, favor, and recall information in a way that confirms or supports one’s prior beliefs or values. »
Confirmation bias will cause a person to deny things that are empirically true. For example, there is the decades-old idea that Google automatically prevents a new site from ranking, a theory called Sandbox. Every day, people report that their new sites and pages are ranking in the top ten of Google search results almost immediately.
But if you’re a strong believer in the sandbox, then an actual observable experience like that will be discounted, no matter how many people observe the opposite experience.
Brenda Malone, Independent Senior SEO Technical Strategist and Web Developer (LinkedIn profile), messaged me regarding the sandbox claims:
“I know personally, from experience, that the sandbox theory is false. I just indexed a personal blog with two articles in two days. There is no way a small two-article site could have been indexed using the sandbox theory.
The takeaway here is that if the documentation turns out to come from Google search, the wrong way to analyze the data is to look for confirmation of long-held beliefs.
What is the Google data leak?
There are five things to consider regarding the leaked data:
- The context of the information leak is unknown. Is this related to Google search? Is it for other purposes?
- The purpose of the data. Was the information used for actual research results? Or was it used for internal data management or manipulation?
- Former Googlers have not confirmed that the data is specific to Google Search. They have only confirmed that it appears to be from Google.
- Keep an open mind. If you go looking for justifications for long-held beliefs, guess what? You will find them everywhere. This is called confirmation bias.
- Evidence suggests that the data is linked to an external API to create a document warehouse.
What others are saying about the “leaked” documents
Ryan Jones, someone who not only has extensive SEO experience but also a tremendous understanding of IT, shared some reasonable observations about the so-called data leak.
Ryan tweeted:
“We don’t know if it’s for production or for testing. I guess it’s mainly for testing potential changes.
We don’t know what is used for web or other verticals. Some things can only be used for a Google home or news etc.
We don’t know what an input is to an ML algorithm and what we are training against. I assume clicks are not a direct input but are used to train a model to predict clickability. (Apart from trend boosts)
I’m also assuming that some of these fields only apply to training datasets and not to all sites.
Am I saying Google didn’t lie? No way. But let’s look at this leak reprehensibly and without bias.”
@DavidGQuaid tweeted:
“We also don’t know if this is a Google search or a Google Cloud document retrieval.
The APIs seem to choose – this is not how I expect the algorithm to be run – what if an engineer wants to ignore all these quality checks – it looks like I want to create a Content warehouse application for my company’s knowledge base.
Is the “leaked” data related to Google Search?
Currently, there is no hard evidence that this “leaked” data actually comes from Google Search. There is great ambiguity as to the purpose of the data. It’s worth noting that there is some evidence to suggest that this data is just an “external API for creating a document warehouse, as the name suggests”, and has nothing to do with how websites are ranked in Google search.
The conclusion that this data does not come from Google Search is not definitive at this time, but it is the direction in which the wind of evidence seems to be blowing.
Featured image by Shutterstock/Jaaak