What Are Search Engine Crawlers, and How Do They Work?
If we were to try and personify a search engine crawler, it would be a pirate. When you think about it they’re very similar - pirates journey out and follow clues to find treasure, and search engine crawl-arrrs follow links on pages to add content to a search index.
A pirate is only as good as their collection of treasure, and the same is true for search engines and the contents of their index. If a search engine crawler can’t find some pages by following links, your users won’t find them from a search.
If you want your users to find as many new and updated pages as possible from a search, you need to ensure that your search index contains the latest content from your site. But staying on top of this can get tricky. Some search solutions offer periodic sitewide recrawling, while others require you to push up-to-date content via an API manually.
A standout feature that sets Sajari apart from other search providers is Instant Indexing.
With Instant Indexing, any new content you publish or update on your site will be added to your index (what we call search Collections) immediately. Similarly, any time you delete content and mark this page with a valid 404 error code, Sajari will automatically remove this page from your Collection.
Instant Indexing is a fantastic tool for marketers to ensure that their search index and site content match, making site search as relevant as possible for the end user. It’s a pirate’s dream, being told where new, undiscovered treasure lies.
But how does it work? This post will give an overview of what search engine crawling is, how Sajari’s Instant Indexing works, and why it is a feature you must consider when selecting a search provider.
What is a search crawler?
A search crawler is a bot that scans web pages and adds these to search indexes. “Scanning” means getting a copy of the HTML on each page, and then using this to determine relevance for a search query.
When a site is indexed for the first time the Sajari crawler will visit a nominated domain and sitemap (more on these to come). The crawler will first scan this domain’s homepage and add this to a search index. Then the crawler will visit every link on this page, and add each of these pages too. The crawler will continue scanning pages and visiting links until all accessible pages are in your Collection.
In a few seconds, the crawler can visit hundreds of pages and add these to a search index. This method of following links to scan new pages is exactly how web search engines like Google work too.
An overview of how Instant Indexing crawls pages, ready to be searched using Sajari.
However, the downside to this is that some pages on your site won’t be linked to from elsewhere, so the crawler won’t be able to add these pages or products to your Collection. We call these pages “content islands” as they are a section of your site that the crawler can’t reach.
. . .
So our pirate crawler has raised the sails, assembled his crew, and gets ready to crawl the seven seas for booty. He hears about an island with buried treasure, so he sails across and grabs as much as he can. Here he finds a clue to go to another island, so he sails home, drops off his treasure, and voyages to this new island.
He keeps on following directions to new islands, grabbing all the treasure, and sailing home and dropping it off. However, what he doesn’t know is that there is an even bigger island, filled with even more treasure, that he has no idea about.
. . .
It’s easy to forget about linking to each of your pages on your site. We find that some of our users’ sites have hidden subdomains, and the crawler can’t follow a link to add to these to a search Collection.
A good way to ensure that a search crawler finds as many pages as possible is by including an up-to-date XML sitemap on your site, accessible from the homepage. An XML sitemap is a file that lists all of the URLs on your site and was first introduced as a way to ensure search crawlers find as many web pages as possible.
. . .
Now when our pirate visits a new island on his quest, he finds an XML treasure map that shows him where ALL the islands are. He is able to chart a course to every island, grab all the treasure from each, and bring it back home.
. . .
The advantage of Instant Indexing though is that the crawler doesn’t rely on following page links or crawling an up-to-date XML sitemap periodically - instead, the crawler is specifically alerted when a page has been created or changed.
How does Instant Indexing work?
To start using Sajari Instant Indexing, all that’s needed is copying the code from inside your Console, and pasting this on your site. It works like this:
You add Instant Indexing code to your site’s page template. You publish a new page on your site, and you view this page for the first time. When this page loads, the Instant Indexing code on the page will “ping” the Sajari crawler. The crawler will immediately come to your site, scan this page, and add it your search Collection.
Similarly, you may have an existing page that also contains Sajari Instant Indexing code. If this page has been changed and the URL is linked from another page that is loaded, the crawler will be pinged immediately and add the updated version to your Collection.
The “ping” here is a snippet of Javascript that contains your individual Project and Collection details. When this code is loaded in a browser, it will alert our crawler to come along and say hello.
. . .
Now our pirate is moored in his favorite cove, resting and enjoying his large collection of treasure. Out of nowhere, he receives a letter with the coordinates of a brand new island that contains more treasure. He follows these coordinates, weighs anchor immediately, grabs this treasure, and comes straight back home. Easy!
. . .
If you have Instant Indexing code across your site, you don’t need to worry about a crawler following links to find all of your pages. It will be alerted when a page has been published and viewed, or when the content on that page has changed.
The Sajari crawler will also automatically delete pages from a Collection if these contain a valid 404 error code. Usually, this happens when a site is reindexed, as the crawler will detect the error and strip these pages from your Collection.
If you have Instant Indexing code on your site though, this process happens automatically. The crawler will detect if a page’s metadata has been changed (in this case, a page has been marked as a 404). Sajari’s crawler will come to this page, identify it as deleted, and then remove it from your Collection.
Instant Indexing is necessary for site search.
Much like a pirate, you want your crawler to find as much treasure as possible. If a crawler can’t find your new or updated content or products, these won’t be found by your users from a search.
Instead of relying on periodic crawling or remembering to manually add a URL to a search index, Instant Indexing ensures that your search index is up-to-date and matches the content on your site.
Sajari will still regularly crawl your site every three to seven days if no instant updates are detected, and you still have the option to trigger a reindex with our crawler whenever you like. If you’re looking to index a larger site with more pages, Enterprise plans give the option to have your own dedicated indexing queue. This way you can set how often your site is crawled, what pages are included, and how many pages are indexed at any one time.
The bottom line is though, you need to have Instant Indexing on your site. We built Sajari to be as painless and automated as possible, as site search shouldn’t be a device that needs constant work and maintenance to work optimally.
Instant Indexing takes this manual configuration away to make sure new and updated pages are always included in your Collection, with zero work for you. Instant Indexing means your brand’s content is surfaced by users immediately - and any pirate worth their salt wants to find as much buried treasure as possible.