Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. Philip@SearchBenefit.com
    SEO Chat Good Citizen (1000 - 1499 posts)

    Join Date
    Oct 2009
    Location
    Massachusetts, USA
    Posts
    1,388
    Rep Power
    1010

    Google Crawling and Indexation 101


    Last udated November 16, 2013.

    What are Google Crawling and Indexing
    Google finds, explores, stores and sorts all the indexable pages of the Web, to make them findable through search. The process of discovery and exploration is called Crawling. Google uses several robust programs called bots, or robots, to crawl the whole web. The main among them is Googlebot, but there are other major ones, including the Google blog bot. The process of storing the web pages for referencing in search, and sorting them in some appropriate order, is called indexing.

    Why Do They Matter?
    If your site is not crawled properly or if its pages aren't indexed, you will be unfndable on Google. An insufficient Google crawl rate and incomplete indexation are the scourge of many websites, especially those large and new. In this forum and others, many members report indexation issues and ask how to solve them. The same is the case with SEO clients. My advice here is focused on Google, but the same or very similar general principles apply to crawling and indexation by Bing and Yahoo! as well.

    Issues, Best Practices, Troubleshooting
    First of all, Google indexation is hard to measure for a large site. There can be false alarms having to do with people using Google's site: operator, supposed to report the site's indexation count. It works well for small sites but is wildly unreliable for large ones and tends to severely underreport the count. Webmaster Tools is better for this, but possibly also unreliable. If your site is enormous, there is simply no certain way of knowing how many pages Google has indexed. For additional helpful data, check Google Analytics to see the total number of pages that have received visits. I also recommend that you manually run cache: checks of all your most important pages and of various random secondary pages to get a further idea of how your site is doing on the indexation front.

    The Google crawl rate cannot be reliably controlled, but it can influenced by positive factors (listed here roughly in receding order of importance).
    Domain importance. Google’s Matt Cutts has recently admitted, interviewed by Eric Enge, that your site's crawl rate and depth of crawling are roughly proportional to PR. SEOs have long known this.
    Backlinks. PR is computed based on backlinks, which are absolutely central to indexation. If a site's page count is growing fast but the site is not earning enough new links, this may suggest to Google that the content is of low quality (guaranteed reduce your crawl and indexation rates).
    Deep Linking. Backlinks to individual pages (so-called "deep linking") are an effective way to ensure the indexation of those pages and their keep in the main Google index (as distinct from the supplementary index). Internal links to the same pages also help. Make sure that at least your most important pages get enough of both kinds of links. These need to be followed links (i.e. they should not contain the rel="nofollow" attribute).
    Site navigation and hierarchy. To the extent possible, a flat site hierarchy should be used. (An exemplary illustration is fanbase.com, with all the main categories appearing in the top-level navigation, enabling quick drilldown to individual pages.) This means (a) as few subdomains, subfolders and subdirectories as possible and (b) that all important pages must be reachable via the fewest clicks possible from the home page (more than 3-4 clicks is problematic).
    XML sitemaps. This a must. Here is one good tool -- xml-sitemaps.com -- for generating sitemaps; there are others too. Submit your sitemaps to the search engines via webmaster tools. Further notes:
    o Sitemaps generally support <changefreq> and <priority> attributes, whose use may influence the crawl, although the impact is likely to be minor.
    o Check WMT for sitemap errors and fix them.
    o Michael Gray has recommended that creating small sitemaps of (100 pages or less) to supplement your regular sitemaps can help get new content indexed faster. He has found using a dedicated sitemap for fresh content to be highly effective.
    • [b]In addition to sitemaps, you can use the "Fetch as Googlebot" feature in your Google webmaster tools: its effect on indexing can be similar to that of submitting a sitemap.
    Duplicate content reduction. In general, duplicate content on a site is not a significant problem and does not entail "Google penalties" even after the Panda updates, unless that duplicate content is spammy. Nevertheless, you should maintain a healthy economy and minimize duplicate content on your site. Especially on very large sites high-volume duplicated content (identical pages sitting under different URLs) can confuse Google and impede proper indexing. One classic example of duplication occurs under different forms of site URLs: those that include the www. subdomain and those that don't (e.g. example.com/file1.html and example.com/file1.html typically have the same content). The way to handle this and other kinds of duplication it is via some form of URL canonicalization (see next item).
    URL canonicalization means creating a single SEO-friendly and user-friendly URL for each page and letting Google know that that URL is canonical. SEO reasons for canonicalization are various go beyond indexation issues: (1) Google, in spite of occasional denial, may assigns less importance to pages that contain extra slashed (subdirectories); (2) Google may sometimes have difficulties with URLs that are parameter-laden; (3) long ugly URLs are a turnoff for site visitors; (4) a clear, well-structures consistent URL convention is best for the user, for branding and for SEO; (5) canonicalization consolidates PageRank and link equity to the canonical version of the page, giving it a better chance to rank. Depending on your platform, various rewrite engines (see en.wikipedia.org/wiki/Rewrite_engine) can be used to automate the rewriting of URLs from "ugly" into friendly ones. URL canonicalization can be performed in any of 3 different ways:
    o 301-redirect ("moved permanently") of all duplicate URLs to the canonical. IMHO this is the most reliable method of canonicalization, but it may have certain overheads.
    o rel="canonical": Place a link of the form <link rel="canonical" href="http://example.com/canonical-url-example.html"> at the end of the <head> of each duplicate page. (Yes, it's OK for the canonical version to include this link to itself; and no, there is no limit on how many canonical links you can have.)
    o "Display URLs as": the effect of this setting in the Google Webmaster Tools is similar to that of rel="canonical" and is the easiest option if you prefer not to write any code.
    URL stability and page uniqueness. While the issues surrounding duplicate content are fairly well known, one potential problem that is rarely discussed is the opposite. The term I have coined for it is multitasking URLs. Some applications may display different dynamically generated content under the same URL (for example, content specific the user's geographical location). Additionally, the title tags for such pages may also be generated on the fly and contradictory. I have seen this lead to a variety of indexation and search issues. For best results, the content of each page, whether dynamic or static, must be unique and must appear under its proper, unique and stable URL and title tag.
    Unique title tags. If you use the same title tags across multiple pages, Google may assume that those pages are duplicate and be reluctant to index them. Make your titles unique.
    Manual crawl rate setting. Google's Webmaster Tools offer a choice between letting Google determine the crawl rate automatically and setting it manually via a slide bar. Although setting it manually to max is unlikely to boost the crawl rate dramatically, it may brings about marginal improvement.
    Original content. It's good for all your important pages to have significant and unique original content.
    Updates, feeds, pinging. Frequent content updates both site-wide and on individual pages can significantly improve the crawl rate. Further, exporting RSS feeds and implementing automated search engine pinging have a beneficial effect. Pinging resources include pingomatic.com and pingler.com.
    Social Media. Links from social media, although they are nofollow, help Google discover and index new content. Including sharing buttons on your pages and promoting them on social media sites can help get your pages into the index faster.

    Technical Note
    The most important update to Google's indexing system have is Google Caffeine, first launched in August 2009 and completing its search index on June 8 , 2010. It has replaced the old multi-layered static index with a system that crawled and indexed the web dynamically, in manageable segments, and practically in real time. Caffeine started paying attention to signals from Facebook and Twitter.

    FURTHER DETAIL
    I have dated most of the sources below. As far as I know, the information in them is still current and accurate. If you find new relevant information, please let me know and I'll update this post.

    Google's official explanation of crawling and indexing basics.

    Google's official tips for troubleshooting crawling and indexing .
    support.google.com/webmasters/answer/34441?hl=en

    Additional reading. Eric Enge (a well-known SEO) interviews Matt Cutts on related matters.
    stonetemple.com/articles/interview-matt-cutts-012510.shtml
    (March 2010 but the info is still current.)

    Matt Cutts covers the basics of crawling and indexing here and goes into some interesting details:
    youtube.com/watch?v=KyCYyoGusqs
    (April 2012)

    It's a common myth that you must disallow the crawling of your scripts and CSS. Matt Currs says don't disallow the crawling of your JavaScript and CSS:
    youtube.com/watch?v=LW3pjQeCqqk
    (March 2013)

    Google has officially announced on Nov 1, 2013 that it is starting to index mobile applications as websites.
    googlewebmastercentral.blogspot.com/2013/10/indexing-apps-just-like-websites.html

    And here is a Matt Cutts video in which he covers some aspects of mobile indexing, and puts to rest worries about "duplicate content" arising the current state of mobile indexing.
    youtube.com/watch?v=mY9h3G8Lv4k

    Here he goes over the basics of crawling and covers details of the Google cache date:
    youtube.com/watch?v=8lmZS7TknQc
    (April 2011)

    Matt Cutts explores the importance of video indexing and video sitemaps.
    youtube.com/watch?v=NMe4qStOJyU

    When quoting from this sticky, please refer to the source.

    Comments on this post

    • prasunsen agrees : Good list
    • jsteele823 agrees
    • Jesus Nofollow agrees : didn't read, pre-emptive postive rep
    • Lb1878 agrees
    • pagi agrees : out of rep for you but nice info and a very manageable read!
    • googler agrees : Another great contribution!
    • himanshu160 agrees
    • manray agrees
    • seomonkeymanocp agrees
    • nichita2008 agrees : Great !
    • Lord Quas agrees : Very useful. Thanks for sharing.
    • ChillDot agrees
    • seoitc agrees : thanks!
    • number8pie agrees
    Last edited by PhilipSEO; Nov 16th, 2013 at 07:49 PM.
  2. #2
  3. Super Moderator
    SEO Chat Super Genius (4500 - 4999 posts)

    Join Date
    May 2007
    Posts
    4,503
    Rep Power
    1923
    Originally Posted by PhilipSEO
    @Fathom & @JSteele: this is as short as I think it should be, how'd I do?
    Better, but I'd cut out a few points that only have marginal effect.

    In fact, I'd cut it down to:

    If your site has indexation problems:
    1) Shoot for higher quality backlinks.
    2) double check duplicate content
    3) follow Google's Webmaster Guidelines

    But that may be too simple.
  4. #3
  5. Contributing User
    SEO Chat Adventurer (500 - 999 posts)

    Join Date
    Sep 2009
    Location
    Rotterdam, The Netherlands
    Posts
    992
    Rep Power
    327
    Sitemap:

    You can also submit a .rss feed with latest products that Google will convert to an xml-sitemap. Combined with pinging this will make sure Google knows about your newest content. Michael Gray is right as far as I can tell.

    Internal linking:

    Homepage -> List new products/articles

    Category page -> List new products/articles in that category

    Product/article page -> List related products/articles

    Use your stronger pages to link to fresh or deeper content, albeit relevant.

    General:
    Identify poor content. Poor content can be search results, tag pages, archives with dated content, and make sure your site architecture reflects this (either no-index, or giving it a lot less links/prominence)

    Short url structure notes:
    I'd really recommend working with at least one category, especially for larger sites. You can keep category pages in the site navigation for a little prominence. With an ultraflat site:

    site.com/article-title

    It is harder IMO to get a great structure and a way to list all articles. If you list your articles at:

    site.com/blog/

    It would be nonsense IMO to put the articles at the root.

    site.com/blog/article-title

    is flat enough, while providing peeling down of url's and structure.

    If you add a breadcrumb (always a good idea) IMO it should reflect the url structure: Home &raquo; Blog &raquo; Article Title.

    Using categories you can also put the 5 latest products from that category in the main navigation dropdown, ensuring sitewide prominence of new products:

    | Smartphone |
    Google Nexus 1
    Apple Iphone
    HTC Desire
    Blackberry Bold
    Samsung LG
    ...<a>More Smartphones</a>...

    Remove Broken Links

    Google might halt a crawl if it happens upon broken links.

    Xenu Linksleuth can find your broken links and make a rapport.

    Dynamic fresh unique content
    If Google revisits (parts of) your site, and no content has changed, then the revisit frequency will incrementially lower to reflect this. If Google happens upon lower quality copied or rewritten content, it might crawl further, but will put those sites in the secondary index.

    Provide unique content and keep it up to date, to force Google to revisit more often.

    Crawlrate setting
    As far as I can tell this setting is useless for a deeper crawl or revisit frequency. Setting this value too high for your server to handle (Googlebot might crawl thousands of pages, movies, images in a few minute), can cause your server to go down or slow to a very annoying speed.

    Pageloading/server speed
    Don't make Googlebot wait 5+ seconds to index a page. The bigger the site, the better server resources are required to provide a smooth crawl and user experience.

    Pagination and listings
    For listing products on a category page, remember that more is better. If you list 12 products per page, thats a lot more pages Google has to crawl and index than if you put 30+ products on a page.

    For pagination, an increased listing, should lower the number of pages. Write out all pages in the pagination, not [1] 2 3 ... 15 Next> Last >

    But [1] 2 3 4 5 6

    IT seems like smart sculpting the first way, but listing every page facilitates Googlebot more and allows for a deeper crawl.

    Comments on this post

    • Lb1878 agrees : Nice supplimental post
    • PhilipSEO agrees : excellent addenda!
    • aslamyasfeen agrees : Its good post and also i more need this type information right now
    Last edited by Jesus Nofollow; May 16th, 2010 at 01:18 PM.

    Following these recommendations should increase the likelihood that your site will show up consistently in the search results.
  6. #4
  7. Dancin with the devil
    SEO Chat Hero (2000 - 2499 posts)

    Join Date
    Mar 2007
    Location
    USA
    Posts
    2,301
    Rep Power
    515
    A rare look "behind the scenes" of a website. You don't see that too often around here as a lot of focus is often put on external/off-site practices. Well done and thanks!

    One thing about the amount of products on a page... I thought it was better practice to keep the pages shorter for your visitors as they may not make it to the end of a long page? I also thought that the same went for search engines but it does make sense to have a longer page vs a multi-page as the risk seems higher that you'll loose the bot by making it follow additional pages.

    Comments on this post

    • Maximlis agrees
    Last edited by Lb1878; May 16th, 2010 at 04:44 PM.
    "Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for a lifetime." Chinese Proverb
  8. #5
  9. Contributing User
    SEO Chat Adventurer (500 - 999 posts)

    Join Date
    Sep 2009
    Location
    Rotterdam, The Netherlands
    Posts
    992
    Rep Power
    327
    Originally Posted by Lb1878
    One thing about the amount of products on a page... I thought it was better practice to keep the pages shorter for your visitors as they may not make it to the end of a long page?
    This might be a usability issue. As far as I can tell for conversion you give as much information as needed for a purchage. Good effective landingpages can be very long (Amazon Kindle, SEOMoz subscribe)
    I also thought that the same went for search engines but it does make sense to have a longer page vs a multi-page as the risk seems higher that you'll loose the bot by making it follow additional pages.
    It is also:

    if you don't canonicalize your category product listings, you have pagination like:

    /widgets/
    /widgets/2/
    /widgets/3/

    /widgets/ is always much stronger than /widgets/2/ in a good site architecture. If you only list 12 products on a fast growing site then you are promoting too few productpages with your strongest page. On larger sites these category pages are strong, and throwing that away on listing just 12 products makes for a worse flow than could be.

    http://www.seomoz.org/blog/whiteboard-friday-a-farewell-to-pagination

    http://www.seomoz.org/blog/pagination-best-practices-for-seo-user-experience
  10. #6
  11. Dancin with the devil
    SEO Chat Hero (2000 - 2499 posts)

    Join Date
    Mar 2007
    Location
    USA
    Posts
    2,301
    Rep Power
    515
    Makes sense, point taken. Thanks for posting and giving additional references. I'll be reading that later.

    Comments on this post

    • Maximlis agrees : Helpful
    Last edited by Lb1878; May 17th, 2010 at 03:31 PM.
  12. #7
  13. Philip@SearchBenefit.com
    SEO Chat Good Citizen (1000 - 1499 posts)

    Join Date
    Oct 2009
    Location
    Massachusetts, USA
    Posts
    1,388
    Rep Power
    1010
    Jesus Nofollow, why do I get the nagging feeling that even between the two of us we have missed something? ;)
  14. #8
  15. No Profile Picture
    Registered User
    SEO Chat Explorer (0 - 99 posts)

    Join Date
    May 2010
    Posts
    4
    Rep Power
    0
    I thought google doesn't crawl nofollow links? How is this possible?
  16. #9
  17. Philip@SearchBenefit.com
    SEO Chat Good Citizen (1000 - 1499 posts)

    Join Date
    Oct 2009
    Location
    Massachusetts, USA
    Posts
    1,388
    Rep Power
    1010
    Originally Posted by rafderamas
    I thought google doesn't crawl nofollow links? How is this possible?
    It's a myth, nofollow links are followed just fine, they simply don't pass any link juice. That's all there is to it.

    Comments on this post

    • Nickfb76 agrees : beet me too it by a min - great info Philip!
    • manray agrees
  18. #10
  19. Contributing User
    SEO Chat Adventurer (500 - 999 posts)

    Join Date
    Jan 2009
    Posts
    809
    Rep Power
    232
    Originally Posted by rafderamas
    I thought google doesn't crawl nofollow links? How is this possible?
    Google says that they don't pass link value or "authority" to links with a no follow tag.

    Comments on this post

    • Maximlis agrees
    Follow me on Twitter for SEO tips @NickLeRoy
  20. #11
  21. Contributing User
    SEO Chat Adventurer (500 - 999 posts)

    Join Date
    Mar 2005
    Posts
    869
    Rep Power
    368
    It's good to see how many of these factors are directly within a webmaster's control. I don't think of domain importance/PR and backlinks as "directly" within a webmaster's control because you really can't force people to link to you; all you can do is post great content and hope the links follow. But practically everything else is directly within the hands of whoever has the authority to make the appropriate changes to the site.

    So if Google isn't crawling you as well as you'd like, you can at least go through this checklist to determine whether it's your own fault, and fix it.
  22. #12
  23. Extremely Googled
    SEO Chat Good Citizen (1000 - 1499 posts)

    Join Date
    Dec 2004
    Posts
    1,029
    Rep Power
    496
    Originally Posted by biggig
    thing with my website .. i see that there are only very few pages who actully get crawled . can this be problem with my site structure... my web url is kayaclinic.com and i have just redirected my homepage to index.aspx and the redirect is meta refresh .. can that be a problem for that?
    I hate to say it but you've done this entirely wrong.

    1. Google does not follow meta refresh. The preferred method is to have your web server return a 301 redirect
    2. Unless there's some reason to directly do so, I tend to leave URLs as http://www.domain.com, instead of redirecting to index. This makes things neater.

    Originally Posted by PhillipSEO
    I also recommend that you manually run cache: checks of all your most important pages and of various random secondary pages to get a further idea of how your site is doing on the indexation front.
    One tip on this: If you have the Google Toolbar and have enabled the PageRank meter, you can visit your page, click the meter and find cache: right there. I've always found this handy.

    Domain importance. Google’s Matt Cutts has recently admitted, interviewed by Eric Enge, that your site's crawl rate and depth of crawling are roughly proportional to PR. SEOs have long known this.
    Backlinks. PR is computed based on backlinks, which are absolutely central to indexation. If a site's page count is growing fast but the site is not earning enough new links, this may suggest to Google that the content is of low quality (guaranteed reduce your crawl and indexation rates).
    Deep Linking. Backlinks to individual pages (so-called "deep linking") are an effective way to ensure the indexation of those pages and their keep in the main Google index (as distinct from the supplementary index). Internal links to the same pages also help. Make sure that at least your most important pages get enough of both kinds of links. These need to be followed links (i.e. they should not contain the rel="nofollow" attribute).
    I spotted a typo (in red in quote).
    I've always seen links as a command to the bot to come crawl your page (not your site, just that page). While the bot is not obliged to do this in a 1:1 fashion, viewing links as a request to crawl your page, and not your site, means that you tend to view deep linking as more essential since it's likely that your deep pages are only linked in the structure of your site and not anywhere else (remember, internal linking structure is just as important as external). I hope that helps.

    Comments on this post

    • PhilipSEO agrees : thanks and i agree (& typo fixed)
  24. #13
  25. No Profile Picture
    Registered User
    SEO Chat Explorer (0 - 99 posts)

    Join Date
    Mar 2010
    Posts
    9
    Rep Power
    0
    "Use your stronger pages to link to fresh or deeper content, albeit relevant."


    I don't really understand this sentence . what do the "fresh "mean ?is that means the "news r articles that fresh everyday .or the newoducts pages?
  26. #14
  27. No Profile Picture
    SBR
    Contributing User
    SEO Chat Discoverer (100 - 499 posts)

    Join Date
    May 2010
    Location
    Boca Raton, Florida
    Posts
    253
    Rep Power
    54
    Originally Posted by dylansmith
    "Use your stronger pages to link to fresh or deeper content, albeit relevant."


    I don't really understand this sentence . what do the "fresh "mean ?is that means the "news r articles that fresh everyday .or the newoducts pages?
    Fresh in this context referrers to the newest content.

    In other words your best pages should link to your newest pages.

    Actually I think this is backwards, your newest stuff should reference your older material.
    Florida Internet Marketing

    Google Maps SEO and Local Optimization Guide

    “I don't know what this means. I don't think it means anything.”
    ~Eddie Vedder
  28. #15
  29. Super Moderator
    SEO Chat Super Genius (4500 - 4999 posts)

    Join Date
    May 2007
    Posts
    4,503
    Rep Power
    1923
    The "best pages" linking to the "newer pages" is to help them be crawled and indexed faster.

    Comments on this post

    • Devjeet agrees
    • Zagek agrees
    • VineetKumar agrees
Page 1 of 2 12 Last
  • Jump to page:

Similar Threads

  1. Search Engine Strategies -Munich (part one)
    By Webby in forum Search Engine Optimization
    Replies: 8
    Last Post: Nov 26th, 2003, 10:42 PM

IMN logo majestic logo threadwatch logo seochat tools logo