You add the best, most thoughtfully created content on your website, and nobody notices it. Not even the search engines!
Even the best content, unless it is viewed by a relevant audience, will not add any value to your business. Unless you are running ads to direct traffic to your content, hoping that search engines notice your content, index it, and ultimately rank it, is your best bet for attracting traffic through your awesome content.
However, in a lot of cases, search engines overlook some amazing content on many websites, and that’s usually because of poor management of the crawl budget by the website owners.
What Is Crawl Budget?
In simple terms, the crawl budget of a website refers to the number of pages of that website that a search engine will crawl through in a specific amount of time. Search engines ‘crawl’ websites to understand the content on their pages in order to index them in the correct categories and rank them for the relevant keywords.
Having said that, owners of small websites usually don’t need to worry about their crawl budget. Google’s (and other search engines’) crawling abilities are so advanced that the crawl budget usually doesn’t become a problem unless a website has over 10,000 indexed pages.
Why do search engines need a crawl budget?
Search engines like Google are powered by immense computing power. They have an incredible amount of capability to crawl several websites at once. However, there are over 56 billion web pages indexed through the Google Search Engine. Crawling them all at the same frequency, with consistency, is simply not possible.
This is why search engines must assign crawl budgets to different websites.
How is the crawl budget assigned?
Search engines primarily use two factors while deciding the crawl budget:
- – Host Load or Crawl Limit: This is the amount of crawling that a website can possibly handle, without messing with the user experience of the actual, human visitors. This factor also depends on the website owner’s preferences.
- – Crawl Scheduling or Crawl Demand: Using crawl demand, the search engine decides which URLs are worth crawling or recrawling. This factor is usually dependent on the popularity of the web page and the frequency at which it is updated.
It is also worth noting that the crawl budget doesn’t really have everything to do with web pages. We are referring to web pages in this guide for the ease of understanding. In reality, the crawl budget is used to crawl all sorts of files associated with different web pages. These include CSS and HTML files, and even mobile page variants, hreflang variants and PDF files.
What is crawl limit/host load?
The crawl limit of a website is calculated by the search engines using a variety of factors. However, the performance of the website and the host of the website are the two most crucial factors here.
If the requested URLs of a website frequently return server errors or timeout, then the crawl limit of the website is usually lowered. Similarly, if you have a large website that is hosted on a shared server, your crawl limit will be minimal, because crawl limit is decided at the host level. In other words, all the websites on a shared server will share the crawl limit or host load.
Meaning, migrating your website to a dedicated server may enable you to bring a significant improvement to your crawl limit. This also means that if you have a desktop version of your website and a mobile version, and both are hosted on the same server, their crawl limit will be shared.
What is crawl demand/crawl scheduling?
As mentioned earlier, the crawl demand or crawl scheduling is about deciding whether or not a specific webpage is worth recrawling or not. Just like crawl crawl limit is decided based on a variety of factors, crawl demand also depends on multiple factors. However, some of the most prominent ones are:
- – The popularity of the webpage. If the page has a lot of external backlinks and internal links pointing to it, the search engine will assume that the page is an important one. Similarly, the popularity of the webpage is also measured by the number of queries that it is ranking for in the search results.
- – The frequency with which content on the page is updated. Pages that have not been updated for a while don’t need to be recrawled and the crawl budget is better spent on other pages.
Besides the above-mentioned frequency, the likelihood of a page receiving an update is a factor that influences the crawl demand for a page. For instance, a terms and conditions page on a website is usually not expected to be updated in the near future. On the other hand, a product category page may be updated frequently. So, the category page will be higher in the priority list when deciding crawl scheduling.
By now, the importance of a crawl budget for a search engine should be clear. However, keeping track of the crawl budget (and optimizing it) is almost equally important from a website owner’s perspective.
Want to know why? Read on.
Crawl Budget From A Website Owner’s Perspective
Remember the scenario we discussed at the very beginning for this article?
When your crawl budget is not correctly optimized, the search engines may not pay the necessary attention to the right pages. As a result, your website may fail to attract traffic through your best content.
Another way low or incorrectly optimized crawl budget hurts your website’s SEO performance is by delaying your progress. If the right pages are not prioritized, it may take a long time for search engines to take notice of the new updates on your pages. As a result, the updates on your website may not even get indexed for a long time, let alone bring about more organic traffic and better rankings.
Simply put, a poorly optimized crawl budget will hurt the SEO performance of your website because it will prevent the search engines from efficiently and effectively crawling your website.
On the other end of the spectrum, simply optimizing your crawl budget and making sure you enable the search engines to maximize the budget allocated to you can lead to significant improvements in your website’s SEO performance.
This is especially true for large websites that have pages running in the thousands. Having said that, the same website optimization steps that lead to a maximized crawl budget will definitely help smaller websites improve their SEO performance.
So even if you don’t have a large, every-day-new-content-publishing website, hang on tight. The upcoming sections are going to be very relevant to SEO growth.
How To Check Crawl Budget?
The simplest way to check crawl budget is to head to your Google Search Console. Once you have logged in and chose the concerned website, choose the “Crawl” option present in the left-hand side menu. In the drop-down menu, click on the “Crawl Stats” option.
You will be able to see the number of pages Google crawls on your website every day.
Alternatively, you can look at the server logs of your website. There, you can count the number of hits you are getting from the Google crawlers in a day.
Factors That Waste Crawl Budget
There are a number of reasons your website may be wasting it’s crawl budget. Take a look at the following list and make sure you are not making these mistakes:
Poor Internal Linking Structure
In one of the previous sections, we discussed how the popularity of a page influences its crawl demand. Looking at the number of links pointing to a page is one of Google’s favorite ways to measure the popularity of web pages. This is why your internal linking structure can have a huge influence over how your website is crawled.
In other words, webpages that have a lot of internal links pointing to them will be considered “important”, and will be crawled more often. Following the same logic, pages that have fewer links pointing to them will be given much less importance, and will be crawled much less often.
This is why, from a crawl budget perspective, it is not recommended to build a hierarchical internal linking structure. This is because such a structure is only able to place a few web pages at the top of the hierarchy. Most of the crawl budget is spent on these handful of pages while even the pages that are directly below in hierarchy don’t get the attention they need to boost the SEO performance of the entire website.
Instead of following a hierarchy based approach, try to maintain a “flat” website architecture. In other words, set up your internal linking in such a way that all of your important web pages have some internal links pointing to them.
Since recently crawled pages tend to perform better in search engine results, it is still a good idea to have more links pointed to your most important pages.
High Load Times
Ensuring lower load times for your pages is an underrated strategy for maximizing your crawl budget. Think about it, if even a handful of the pages on your website take a long time to load, the search engine will be able to crawl fewer pages in a given day.
It is also important to remember that longer load times will also negatively influence your website’s overall SEO performance. Load time or speed is, after all, a known ranking factor.
Longer load times don’t just hurt your crawl budget, they also mess with the user experience on your website. A poor website user experience can result in lower conversion rates.
For these reasons, it is important for website owners and SEOs to stay updated with the load times of different pages of the websites they are managing. To do this, you can use a variety of tools like GTMetrix and Pingdom.
This information is also available in the “Site Speed” section under “Behavior” category in your Google Search Console.
Just like slow loading pages, pages that often timeout can also mess with your crawl budget. You can check for them in the “Crawl Stats” section of your Google Search Console.
Incorrect URLs in XML Sitemap
Providing the search engine with a well-laid out sitemap, containing links to all of your important pages is a great way to help spend your crawl budget efficiently.
If you have already created and submitted a sitemap, then the first course of action for you should eb to check for XML sitemap issues in Google Search Console. You can find these issues listed in the “Crawl” section, under the “Sitemaps” tab.
If you are creating a sitemap from scratch, it is recommended that you break it down into smaller sitemaps, each dedicated to a section of your website. This way, making sure all important pages that need to be indexed are present in the XML sitemap will be easier.
Lots Of Non-Indexable Pages
If a page is non-indexable, and the search engine is still spending time crawling it, then that’s a clear wastage of the crawl budget.
But what are non-indexable pages?
While these will obviously include pages with the noindex attribute attached to them, there may be other pages that can be considered non-indexable. Such as:
- – Pages that can’t be found/accessed (4XX errors)
- – Pages that show server errors (5XX errors)
Once you have made a list of all the non-indexable pages on your website, make sure you are removing all references to them. This includes getting rid of the links pointing to these pages, removing their URLs from the XML sitemap, and removing all Hreflang and pagination references.
Having said that, there is one more category of pages that can be considered non-indexable. Let’s learn about them in the next section.
It is not difficult to imagine how broken links that lead to nowhere waste the search engines’ time and your website’s crawl budget. Clearly, these add to the number of non-indexable pages on being crawled on your website’s crawl budget. Therefore, getting rid of broken links is one of the most basic steps one takes when it comes to optimizing the crawl budget.
However, redirects, despite often being just as harmful to the crawl budget as broken links, are often overlooked. If a redirect leads to a legit page that you want to get crawled, then there may not be much of a problem. However, long chains of redirects tend to waste the crawl budget because Google’s bots typically don’t follow more than five redirects.
Not to forget, chains of redirects often lead to longer page load times, which further wastes your crawl budget and negatively affects your website’s overall SEO performance. For these reasons, it is best to get rid of all instances of long chains of redirects on your website.
Poor, Shallow Content
It seems that having high quality, genuinely valuable content on your website is important from almost every aspect of SEO, including even technical aspects like crawl budget.
Pages with poor content or shallow content rarely ever rank well in search results. That’s why, crawl budget is often wasted on such pages, as it takes away precious time that could have been spent crawling or recrawling pages that have great content and an actual, real chance at ranking well.
It is important for website owners to understand that Google’s (and other search engines’) algorithms are smart enough to identify whether or not the updates made to a website’s pages are actually adding value.
In other words, forcing the crawlers to repeatedly crawl your website is not a good idea.
Instead, it is better to focus on adding more relevant content to your pages, content that adds more value to the information already present on those pages. As the scope of the information on that page grows, it will automatically start triggering responses for more search terms. Meaning, it will start gaining importance and popularity. As a result it will probably get crawled more often.
Accessible URLs With Parameters
URLs with parameters are usually generated to accommodate user actions, such as adding a filter to a search on an eCommerce website. Since these aren’t really new pages, crawling them makes no real difference and thus, the crawl budget is wasted on such pages.
Moreover, when search engines gain access to URLs with parameters, they can end up generating a virtually infinite number of URLs, further wasting precious crawling time. Having said that, using URLs with parameters is often important on the websites that actually use them.
That’s why, it isn’t necessary to get rid of them, as long as you can prevent the search engines from gaining access to them. To ensure this, you can take the following two steps:
- Use Robot.txt file to command search engines to not access URLs with parameters. Alternatively, you can use the parameter handling settings in Google Search Console to specify the URLs that you don’t want the search engine to crawl.
- On the filter links, make use of the no-follow attribute. This will tell the search engine to not follow the link placed there.
Duplicate content, if not handled correctly, can hurt your website’s SEO performance in more ways than one. Having search engine crawlers crawl through duplicate content will waste your crawl budget and will take away the necessary attention from your important pages that contain original content.
For this reason, it is recommended to minimize instances of content duplicacy on your website. Here’s a list of top duplicate content checkers that will make the job easier and quicker.
For the duplicate content that cannot be removed, you can take the following steps to ensure the search engines are not crawling through it:
- Make internal search pages inaccessible to search engine crawlers by using the Robot.txt files.
If you have multiple domain variants (www and non-www variants), make sure you are using website redirects on all of them.
- Prevent the creation of dedicated pages for images.
Taking the above-mentioned actions will help you maximize the crawl budget allotted to your website by the search engines. However, none of these tips will help you actually increase the crawl budget allotted to your website.
Is that even possible?
The short answer is ‘yes’. Continue to the next section of this guide for the long answer.
Increasing The Crawl Budget For Your Website
Since the search engines decide how they distribute the crawl budget between websites and between the specific pages of a website, website owners can only do so much to truly increase their crawl budget.
With that said, as a website owner, you aren’t completely helpless.
In an interview held over a decade ago, Matt Cutts of Google said the following:
“The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.
Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank. The pages that get linked to a lot tend to get discovered and crawled quite quickly. The lower PageRank pages are likely to be crawled not quite as often.”
While Google has discontinued publicly showing the PageRank of different webpages, the above excerpt makes it clear that Domain and Page Authority are somehow correlated to crawl budget.
In other words, domains and web pages with higher authority will get crawled more often.
Now, there are many ways to build this authority, with collecting backlinks from other relevant and high authority websites topping the list.
For more ways to increase your website’s authority, you can check out this list of SEO best practices.