How Google Indexing Works (Definitive Guide for SEO)
Let's scratch the surface of one of the fundamental topics of search about the search: How does the search indexing of the most popular web search engine work. This will help us make informed decisions during the keyword research and content writing processes.
After studying a lot of subjects, I came to the conclusion that if you want to learn faster and better, it is better to study the fundamentals of a subject. In the current dynamic, fast-paced world, we want to learn what's new and needed to accomplish our daily tasks. It seems to us that studying fundamentals can be a waste of time. However, I think that sometimes we need to slow a bit and study fundamentals so that we can move faster tomorrow..
Google Search
Before web pages appear in the Google search results they need to be
- Crawled
- Indexed
This guide is about the indexing part of the process. A thorough understanding of the indexing process will help you to create web pages that will rank higher in Google search and other search engines.
We will talk only about the indexing process here. You can follow our blog to read a detailed article on how Google crawling works.
Let's start with lexical indexing.
Lexical Indexing
Lexical or easily put textual indexing works with words without analyzing the meaning of the sentences and paragraphs in the article.
In the beginning, web search engines were lexical only search engines. They were looking for the occurrence of the keywords in articles.
This was the time when SEOs were using keyword stuffing mechanisms and ranking articles that weren't providing value or were even written about different topics (If you have seen or known invisible keywords, you will know what I am talking about).
Even though many indexing techniques are used in modern-day search engines, lexical search is an important part of algorithms. That makes sense, right?
Let's see how lexical indexing works.
One Word Indices
When Google indexes a page, it reads all the words inside the page (content + meta attributes). Afterward, it creates word to a page mapping for each unique word of the page.
When we search for one word these simple indexes will be used in the search engine result pages
We can see this if we search for one of the most common words in English. We will get over 20 Billion results for each result in the table. A search for "apple" will bring around 3 Billion results.
Search engine users search for one-word phrases in the following cases:
- They want to find the word definition
- Branded searches. They want to find the website of the brand.
- Navigational searches. We put the part of the web address (the domain name) in the browser's search bar, and Google provides the address in the first position of the first search engine result page. Then the users click on the first result and land on the website. They know the web address, but they are lazy to type the full web address.
If our article has 500 unique words, it will be included in 500 search indices with one-word keys.
Now let's consider two-word indices.
Two Word Indices
Let's start with an example and pick the part of this page's title: "How Google".
We get 12 billion results.
The search for "How Google" provides us the index of all pages that have the words "how" and "Google" in any part of the page (even far away from each other). Of course, the pages where the words "How" and "Google" will rank higher.
This two-word index can be easily constructed from two one-word indices in the following manner.
Pick all pages in the "How" index and "Google" index simultaneously.
Two-word, three-word, and many more indices can be constructed with this technique from the one-word indices. For each n-word index, previously constructed indices will be used, and pages' content does not need to be reread.
To continue the keyword popularity experiment let's search for "How Bing" phrase.
We get around 500 million results.
To return to the popularity of search phrases we can see that the index for "How Google" is around 24 times bigger than the index of "How Bing". This means that the words "How" and "Google" appear on 24x more pages than the phrases "How" and "Bing".
As I said, we can do this process for any n-word search.
Here is the illustration of indices up the 3-word indices
We can have meaningful sentences in the search queries starting from three-word phrases. This is where other kinds of searches kick in to help improve the generated results using only lexical indexes.
Semantical Indexing
Semantical indexing had started when search engines got smarter.
At some moment, Google started to understand the meaning of the sentences and the articles using Natural Language Processing (NLP) algorithms.
In addition, the Google knowledge graph (we will talk about this later) helped with the relationships between objects.
Semantical Indexing helps Google answer most of our queries. Let's look at some of the techniques of semantic indexing.
Synonym System
One of the first logical ideas for a comprehensive search is to use synonyms as is described in the Meaning of your query section of this guide from Google.
If you use the main keyword in your articles too many times, Google might think that you are intentionally stuffing the keyword, which may hurt your rankings.
For this reason, you might want to use synonyms.
By the way, there is a way you can use to find synonyms for your keywords.
When Google has synonyms for parts of our search query it highlights synonyms on the search engine result page (SERP)
The "Focus keyword synonym" field in the Yoast plugin is to help you to analyze the usage of the synonyms of the main keyword on the page. Here's how.
When you know the synonyms for your primary or focus keyword you can add them to the "Focus keyword synonyms" list of the Yoast plugin which is included in the Modern blogging solution or in a WordPress website if have installed the Yoast plugin.
This will help Yoast in the keyword usage evaluation.
Knowledge Graph
It is worth telling that Google's Knowledge Graph is enough to provide the answer to some queries.
For example, If we search "Who wrote the Harry Potter" the answer ("J.K. Rowling") will be provided directly from the knowledge graph.
The search for "Who directed the Harry Potter" will bring the names of directors of the Harry Potter movies again from the knowledge graph.
On the other hand, if we search for "who directed the last harry potter".
The answer would be "David Yates".
Here you can see how the answers were inferred for these three queries.
As we can see in all of the examples, nodes and edges of the graph representation helped answer the queries.
Hence, Google does not need to look for the index of the query to provide the answer. However, Google will provide the index with the answer provided from the knowledge graph.
Latent Semantic Indexing
The name of this term sounds scary. But the concept is not frightening and is easy to understand.
Note that LSI is just one technology for Semantic indexing developed by Bell Labs. Google doesn't use this technology. Google uses a technology similar to LSI.
We use "Latent Semantic Indexing" and "Semantic indexing" interchanging in this article because "Latent Semantic Indexing" and its abbreviation (LSI) are more popular. The majority of people search for these terms more than for "semantic indexing". I want to rank this article as the popular version of the term. That's why I use the term "Latent Semantic Indexing". This blog is an SEO blog, after all ;)
Semantic Indexing is the process of finding relationships between words and content.
For example, If I talk about "indexing" in my blog post, but the article does not contain words "search", "google", "bing" or other words that will hint the algorithm that my page is about search engine indexing, then the algorithm will be confused and may think that my page is about, database indexing, book indexing or any other type of indexing.
As a result, the algorithm may rank my article for "database indexing" queries.
On the other hand, if my page is about "database indexing" it is an excellent idea for me to have database vendor names like ("MySQL"), database terms like DB, database, etc., in my article.
The words that are related to the primary keyword with Latent Semantic Indexing are called LSI keywords or semantically relevant keywords.
LSI builds relationships between words that aren't synonyms.
This article published on WordStream provides a great explanation of what LSI keywords are and how they differ from synonyms. Here's an excerpt from the article.
For example, a synonym for the word “jacket” would be “coat”. However, LSI keywords for “jacket” would include words like: reversible, winter, feather down, warm, padded, puffer, and so on.
We can use LSI keyword research tools to find LSI keywords or figure them out ourselves.
It's a great idea to make sure that we use LSI keywords in important parts of our articles, namely, in headers, in the first paragraphs right after the subheaders, in image alt attributes, etc.
Note that if we write a thorough article around a subject, naturally, we will include LSI keywords in the text. However, keeping LSI in mind will help us with a great outline, headers and will help us to write articles that rank better.
Natural Language Processing
Natural Language Processing (NLP) is a field in machine learning. With NLP, machines try to understand human language.
Google uses NLP to understand both search queries and webpage content.
Using Google's Cloud Natural Language product demo, we can get a glimpse of how text processing works for the content and the search queries.
You can post an example search query or a paragraph from your content and see how the NLP algorithm understands the content.
You can switch between Entities, Sentiment, Syntax (Grammar), and Categories tabs to see how NLP algorithms understand the content.
NLP in search queries
Before touching on the topic of NLP analysis of the search queries, we need to imagine that there is a process that maps search queries to the keys of the index. For example, when we misspell words in the search query, and the index of the misspelled variation is empty, Google will show the index of the corrected query.
The misspelling detection algorithm is an NLP algorithm that works not only on the word level but also on the sentence level. The meaning of the sentence helps figure out the possible correct words that are misspelled.
NLP in content
There are many ways with which NLP helps with indexing the content. One of the most significant features is passage indexing. With passage indexing, Google finds passages in the web pages that answer some queries.
Passage indexing is highly beneficial for users because it tries to find the exact answers to the answered questions. Many times the answers do appear in the Featured Snippets.
If you are looking for a short answer, this might be enough.
You can visit the pages in the search engine result page (SERP) if you look for more than the short answer.
Sometimes, the passage is highlighted by Google in the page selected for the short answer in the Featured Snippet. Note that the passage can be anywhere in the article, and highlighting it will help us find it. Here is an example of an article in the Featured Snippet.
When we click on the article that provides the answer to our search query "featured snippet", the answer will be highlighted in the result page text.
Pages that provide the best answers (according to Google) to the questions about the topic that they cover can appear on what SEOs call 0-position (Win the Featured Snippet).
This is a massive win for a page. This is a fair game because pages that want to provide the best answers for the users get rewarded.
As we have said, there are many ways that NLP is used for content indexing. I think that the usage of NLP in content indexing will continue to grow, which will reward content writers who write the best articles for their users.
Conclusion
We have covered only the surface of the extremely simplified projection of how Google indexing works. The purpose of this article was to help you write optimized articles both for search engines and readers.
Usually, we use reverse keyword research before writing content. We check what people are searching for and what keywords the articles on the first page of Google use.
This article and further understanding of the indexing logic may help us make informed decisions and combine it with the tools and techniques which we already use.