Understanding Google Webmaster Tools 404 Errors

Posted on 09. May, 2012 by in Digital Marketing, Thoughts, Tips



Google Webmaster Tools is a great resource for site owners but its error reports can be quite scary to site owners or SEOs who don’t understand them. This lack of understanding can lead to additional work for little or no benefit. One of the main areas where this happens is in the “Not found” section. Returning an error code doesn’t necessarily mean there is an error – often a 404 is the correct response. Google reports all the URLs that it checks regardless of whether they are actually erroneous – that’s for you to decide.

Mark As Fixed button

GWT Mark As Fixed button

Google recently updated the crawl errors section. Previously you could download all of the errors via the web interface but this function is now limited to the “Top” 1,000 errors. “Top” is in quotes as this is how Google describes these errors. In reality, Google tends to show lots of less important URLs here, often URLs which only exist because Googlebot invented them in error whilst trying to crawl form field options or JavaScript.

The standard SEO response to 404 errors in webmaster tools is often “301 redirect all the errors”. Sometimes the SEO doesn’t even bother to investigate the errors. The idea that Google should never discover 404 pages on a site or that when it does, that page should always be redirected, is absurd. Add to this the likelihood that redirecting a page to a completely different page won’t pass on the value of any inbound links anyway and it’s obvious that a single solution is not going to reduce the errors to zero. Indeed that’s not a realistic or worthwhile aim anyway.

I suspect the sort of bad advice of redirecting all 404 errors has roots in something Matt Cutts said when Google started to report 404 errors back in October 2008.
In an article entitled “Free Links to Your Site”, Matt wrote the following:

“I can’t believe a new feature from Google isn’t getting more notice, because it converts already-existing links to your site into much higher quality links, for free.”

following it up with

“Some of the easiest links you’ll ever get are when people tried to link to you and just messed up.”

Over the years Matt has become accustomed to everything he says being misinterpreted by an industry that too often understands little beyond links=rankings. As a result he has become understandably more careful with his words. But I digress…

There is a school of thought that having a large number (or large percentage) of 404 errors reduces Google’s perception of the quality of a site, resulting in lower rankings. It’s difficult to confirm whether this is the case or not. Bill Slawski wrote about a relevant IBM patent here but Google has stated in the past that the fact of having URLs that return 404 errors does not harm rankings in their results. I’ve personally seen a site’s 404 errors increase overnight to show hundreds of thousands of errors with no ranking changes.

If some URLs on your site 404, this fact alone does not hurt you or count against you in Google’s search results.

Do 404s Hurt My Site?

Thinking about it from the search engine perspective, would it make sense that a site which is in every other way authoritative over a term should be ranked lower because part of the site has many low value pages which are removed when they become out of date? I personally think it’s unlikely due to the number of reasons that this could happen – there would be too many false positives. However there are many other reasons to investigate 404 errors and make changes accordingly.

If you remove important pages or the site goes offline regularly, displaying 404s in the process, this will affect your site traffic and may also affect rankings due to the loss of inbound and internal links and possibly a decrease in trust in the site (Google says this is not a problem as I mentioned above). Running the site in a consistent way and ensuring a good user experience will also work from a search engine perspective.

404 Errors Flow chart

404 Errors Flow chart

Some tips:

  1. First things first – make sure your 404 pages are set up correctly. Make them friendly and useful and ensure they return a 404 HTTP header. Here’s an excellent tutorial on creating a more useful 404 page.
  2. If or when the site must go offline for any reason, serve a 503 (Service Unavailable) response.
  3. Add tracking code to your 404 page. This way you know exactly which pages are receiving traffic, allowing you to prioritise any redirects. You’ll probably find errors that Google isn’t aware of too.
  4. Crawl your site regularly for broken links (internal and outbound) and fix them.
  5. Set up a system to download errors from the Webmaster Tools API. Google provides PHP to do this.
  6. Once downloaded, use a crawler such as Xenu or Screaming Frog to visit each of the errors. Remove any that return a 200 HTTP status from your list.
  7. You can use the “Mark as fixed” function to expedite the crawl of those pages but this might be tricky since Google only provides 1,000 of the errors. There is a handy filter function which can help where a common problem has been fixed.
  8. Filter the list of URLs into different types and investigate each type. I tend to find they fall into the following basic categories (you can probably find more categories and sub-categories):
    • Template / CMS issues – These are broken links on your site. In extreme cases this problem could exist site wide. The answer is generally to fix the links. Additionally, it might be worth implementing a rule to redirect all of the affected URLs but this probably isn’t necessary. Once fixed it’s unlikely there will be any traffic or links via these URLs.
    • Malformed external links – sort the URLs alphabetically and you’ll probably find links from external sites but with white space characters at the end or other strange characters and strings of characters. Example: http://www.example.com/%20 These might be worth redirecting, depending on the sites linking or traffic. You might be able to design some regex rules to cover the most common ones.
    • Google’s invented URLs – Google will take strings of characters from option fields and append them to your domain, it’ll do the same with JavaScript (even Google Analytics script) and even try to crawl random URLs it finds on the web. Since forums, blogs etc. often shorten these with full-stops, you might find URLs that look like http://www.example.com/stuff/an…things.aspx Once identified, these should be ignored. There is no need to redirect any of these types of errors or fix any code unless you are desperate to decrease the number of errors you see in Webmaster Tools. If that’s the case, you still shouldn’t ;-)
    • Retired URLs – Pages that no longer exist on the site. Again, these need to be prioritised by traffic and link value before deciding whether it’s worth redirecting them. Additionally, review what happens when pages are retired. Are they simply dumped with no consideration given to the user experience? Perhaps there’s something better that could be done, this might be more important than constantly adding new redirects. The solution needs to be tailored to the particular problem but there is generally some way to cut down on the number of users landing on 404 pages.
  9. Once you’ve identified the various types of problems you are then in a position to decide which ones to fix based on their importance (traffic, links blah, blah), how easy they are to fix and whether the fix scales across a large number of errors, for example a template problem.
  10. Use the Pareto Principle AKA the 80/20 rule. Most likely you can get most of the value from a small number of fixes. To get rid of all errors on a large, authoritative and old site will probably be a huge job.
  11. Having completed this work you will better understand why important errors are appearing and introduce strategies to combat the problem be it with CMS editor training, adjustments to dynamic pages, changing the way out of stock products are handled or a thousand other issues.
  12. If your site has been rebuilt in the past you might find URLs that were not been redirected. When building new sites or pages try to ensure that you put significant consideration into the URL structures to avoid changes in future. See Tim Berners-Lee’s Cool URIs Don’t Change which is as relevant now as it was in 1998 (on the same URI).

Richard Falconer

Tags: , , ,