There has been a lot of discussion lately (especially since Google's last webmaster conference call) about noindex pages being indexed due to being blocked by robots.txt so the bots can't find the meta noindex in the file.
For a little background, Matt Cutts explains in comments here
http://www.mattcutts.com/blog/noind...#comment-135566
I've done some research into the matter and have come up with a theory on how this is handled by G (as well as Y and MSN). I posted on the blog, but thought I'd share here as well.
At delicious.com there is a meta noindex on the home page, but all files blocked by robots. So you do a search on Google for “delicious” and the new home page shows up, but with no snippet or meta description because the page wasn’t crawled, but there is a page title associated with it, not because it actually is the page title, but because it’s the domain name and it is used in anchor text (google will attribute a page title to any page that doesn’t have one, and it’s often the name of the root).
If you do search for “delicious” on MSN, the page doesn’t come up, so it appears that MSN is accessing the file (despite what robots.txt says) and finds the noindex meta.
Yahoo! gives page title and a description, which actually isn’t in the meta description or found anywhere on the page. It’s pulled from the Yahoo! directory listing of the site. Yahoo! tends to assign Y directory data to pages that don’t have it (and even often times when they do). So it would it appear that Y also follows the robots.txt directive. Of course delicious.com is a Yahoo! property, so you could draw a different conclusion, but this is my thought on the subject.
It's interesting to note how the three engines handle this differently. I don't know that this will have a major impact on any of my strategy going forward, but it could lend other insights into how the algos work. For example, notice how del.icio.us/ is still ranking #1 for the search even though there is 301 in place to the new URL at delicious.com. Very interesting. I wonder if this is common practice in these instances?
I'd love to hear anyone's thoughts, reactions, algo speculations as I am still trying to figure out for myself if this really points to anything of any importance or is just something to file under the "hmm, that's interesting" folder.