Search Engine Articles
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
 
User Name:
Password:
Remember me
Go Back   SEO Chat ForumsOtherSearch Engine Articles

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread SEO Chat Forums Sponsor:
  #1  
Old September 29th, 2003, 03:36 PM
ducani
Guest
SEO Chat Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
ROBOTS.TXT Primer

ROBOTS.TXT Primer

There is often confusion as to the role and usage of the robots.txt file. I thought it would be a good idea to dispel some myths and highlight what robots.txt files are all about.

Discuss this article in this thread. You can read the article here .

Reply With Quote
  #2  
Old September 30th, 2003, 09:42 PM
wimbledon wimbledon is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Location: London, U.K.
Posts: 84 wimbledon User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 Days 8 h 14 m 21 sec
Reputation Power: 6
My understanding is that robots.txt is a standard voluntarily supported by robots and as such, surely would not be honoured by malicious email harvesters. The article implies that, for instance, EmailCollector would keep out of your site by being disallowed in the robots.txt file. How about converting email addresses into ASCII strings. Would that work as an alternative? If a disallow does not keep out bad robots, what would be the best way to keep them out?

Reply With Quote
  #3  
Old October 2nd, 2003, 04:15 AM
iprogram iprogram is offline
Junior Member
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Location: box
Posts: 6 iprogram User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 32 m 14 sec
Reputation Power: 0
Since EmailCollector is the bad guy, will it check robots.txt before collect mail addresses?
I use the following method to block it in my .htaccess file:
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
deny from env=bad_bot
A full list below:
SetEnvIfNoCase User-Agent "^DigOut4U" bad_bot
SetEnvIfNoCase User-Agent "^DISCoFinder" bad_bot
SetEnvIfNoCase User-Agent "^eCatch" bad_bot
SetEnvIfNoCase User-Agent "^e-collector" bad_bot
SetEnvIfNoCase User-Agent "^EirGrabber" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf " bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^GetWebPage" bad_bot
SetEnvIfNoCase User-Agent "^Mister PiX " bad_bot
SetEnvIfNoCase User-Agent "^Offline Explorer" bad_bot
SetEnvIfNoCase User-Agent "^PageDown" bad_bot
SetEnvIfNoCase User-Agent "^SiteMapper" bad_bot
SetEnvIfNoCase User-Agent "^SiteSnagger " bad_bot
SetEnvIfNoCase User-Agent "^SuperBot " bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^Teleport Pro" bad_bot
SetEnvIfNoCase User-Agent "^Web2Map" bad_bot
SetEnvIfNoCase User-Agent "^WebAuto" bad_bot
SetEnvIfNoCase User-Agent "^WebCapture" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier" bad_bot
SetEnvIfNoCase User-Agent "^Web Downloader" bad_bot
SetEnvIfNoCase User-Agent "^Webdupe" bad_bot
SetEnvIfNoCase User-Agent "^WebFetch" bad_bot
SetEnvIfNoCase User-Agent "^webfetcher" bad_bot
SetEnvIfNoCase User-Agent "^WebFountain" bad_bot
SetEnvIfNoCase User-Agent "^WebHook " bad_bot
SetEnvIfNoCase User-Agent "^Web Image" bad_bot
SetEnvIfNoCase User-Agent "^WebMiner" bad_bot
SetEnvIfNoCase User-Agent "^WebMirror" bad_bot
SetEnvIfNoCase User-Agent "^WebReaper" bad_bot
SetEnvIfNoCase User-Agent "^WebSauger" bad_bot
SetEnvIfNoCase User-Agent "^Webster" bad_bot
SetEnvIfNoCase User-Agent "^WebStripper" bad_bot
SetEnvIfNoCase User-Agent "^Web Sucker" bad_bot
SetEnvIfNoCase User-Agent "^WebWhacker" bad_bot
SetEnvIfNoCase User-Agent "^Website eXtractor" bad_bot
SetEnvIfNoCase User-Agent "^WebZIP" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^Xaldon WebSpider" bad_bot

deny from env=bad_bot

Reply With Quote
  #4  
Old October 5th, 2003, 04:07 PM
Webby's Avatar
Webby Webby is offline
Moderator
SEO Chat Beginner (1000 - 1499 posts)
 
Join Date: Feb 2003
Location: Hannover, Germany
Posts: 1,384 Webby User rank is Lance Corporal (50 - 100 Reputation Level)Webby User rank is Lance Corporal (50 - 100 Reputation Level)Webby User rank is Lance Corporal (50 - 100 Reputation Level) 
Time spent in forums: 21 h 45 m 41 sec
Reputation Power: 7
Send a message via ICQ to Webby
Most actually do follow the robots exclusion protocol (including EmailCollector). Especially if they are commercial email harvesting scripts. Not to do so would result sin a lot of abuse emails and give the companby concerned a very bad rep.

Having said that, you are right that it is best to be sure by blocking a spider also with .htaccess if possible.
The major problem is that there are so many new ones, keeping up with them all is a pain. email harvesting is here to stay, unfortunately.

Reply With Quote
  #5  
Old October 7th, 2003, 09:32 PM
polarmate's Avatar
polarmate polarmate is offline
Junior Member
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Apr 2003
Location: Chicagoland
Posts: 2 polarmate User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 m 50 sec
Reputation Power: 0
Just curious, Webby, how do you feel about having part of the URL mentioned in your article, which I believe is your own web site, get linked to an ad that it totally unrelated to your web site? It does not show on Netscape or Opera but shows on IE6. The ad that I see now is AOL for Broadband. Earlier it was MSN as in the screeen capture below.
Attached Images
File Type: gif urlhotlink.gif (5.0 KB, 507 views)

Reply With Quote
  #6  
Old October 8th, 2003, 07:21 AM
Webby's Avatar
Webby Webby is offline
Moderator
SEO Chat Beginner (1000 - 1499 posts)
 
Join Date: Feb 2003
Location: Hannover, Germany
Posts: 1,384 Webby User rank is Lance Corporal (50 - 100 Reputation Level)Webby User rank is Lance Corporal (50 - 100 Reputation Level)Webby User rank is Lance Corporal (50 - 100 Reputation Level) 
Time spent in forums: 21 h 45 m 41 sec
Reputation Power: 7
Send a message via ICQ to Webby
Interesting. I'm not seeing it now. Did it used to link anywhere other than my robtos.txt?
We are talking about the article here right?

Reply With Quote
  #7  
Old October 8th, 2003, 01:30 PM
polarmate's Avatar
polarmate polarmate is offline
Junior Member
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Apr 2003
Location: Chicagoland
Posts: 2 polarmate User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 m 50 sec
Reputation Power: 0
Yes, it's only on the articles - on ALL the articles that I checked - and I see it right now on the word internet in your URL and other words like 'search engine, 'website', 'email', 'cgi' etc. The ad on the word internet is MSN TV again.

This 'technology' is called IntelliTXT from Vibrant Media
URL
Quote:
These unobtrusive and relevant links provide the reader with further product information while offering the publisher alternative revenue streams.

Our proprietary technology automates the analysis and categorization of content, identifies the most appropriate marketing message to deliver, and dynamically serves advertising messages to the right user at the right time.

The difference between this and scumware is that this is built into the web site itself and is occuring with the permission of the web site owner.

The web site owner has to insert the following code to make these appear: (below code taken from URL )
Code:
<!-- start Vibrant Media IntelliTxt script section -->
<style type="text/css">
.iTt{
  FONT-FAMILY:       Verdana, Arial, Helvetica;
  FONT-SIZE:         11px;
  FONT-STYLE:        normal;
  FONT-WEIGHT:       normal;
  COLOR:             black;
  BACKGROUND-COLOR:  lightyellow;
  BORDER:            black 1px solid;
  PADDING:           2px;
}
</style>
<span id="iTt" class="iTt" style="visibility:hidden;position:absolute;"></span>
<script defer="true" language="javascript" src="http://itxt.vibrantmedia.com/system/liveintellitxt.asp?IPID=110&MK=5&FG=black"></script>
<!-- end Vibrant Media IntelliTxt script section -->

Make sure your browser supports css - it shows on IE6 but not on Netscape or Opera.

Were you informed that this sort of linking would occur from words on your articles? I guess from now on, authors of articles will need to add another clause to the permissions they give to third party publishers.

Reply With Quote
Reply

Viewing: SEO Chat ForumsOtherSearch Engine Articles > ROBOTS.TXT Primer


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway
Stay green...Green IT