|
|
|||||||||
|
|||||||||
|
|||||||||
| |
||
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
ROBOTS.TXT Primer
ROBOTS.TXT Primer
There is often confusion as to the role and usage of the robots.txt file. I thought it would be a good idea to dispel some myths and highlight what robots.txt files are all about. Discuss this article in this thread. You can read the article here . |
|
#2
|
|||
|
|||
|
My understanding is that robots.txt is a standard voluntarily supported by robots and as such, surely would not be honoured by malicious email harvesters. The article implies that, for instance, EmailCollector would keep out of your site by being disallowed in the robots.txt file. How about converting email addresses into ASCII strings. Would that work as an alternative? If a disallow does not keep out bad robots, what would be the best way to keep them out?
|
|
#3
|
|||
|
|||
|
Since EmailCollector is the bad guy, will it check robots.txt before collect mail addresses?
I use the following method to block it in my .htaccess file: SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot deny from env=bad_bot A full list below: SetEnvIfNoCase User-Agent "^DigOut4U" bad_bot SetEnvIfNoCase User-Agent "^DISCoFinder" bad_bot SetEnvIfNoCase User-Agent "^eCatch" bad_bot SetEnvIfNoCase User-Agent "^e-collector" bad_bot SetEnvIfNoCase User-Agent "^EirGrabber" bad_bot SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot SetEnvIfNoCase User-Agent "^EmailWolf " bad_bot SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot SetEnvIfNoCase User-Agent "^GetWebPage" bad_bot SetEnvIfNoCase User-Agent "^Mister PiX " bad_bot SetEnvIfNoCase User-Agent "^Offline Explorer" bad_bot SetEnvIfNoCase User-Agent "^PageDown" bad_bot SetEnvIfNoCase User-Agent "^SiteMapper" bad_bot SetEnvIfNoCase User-Agent "^SiteSnagger " bad_bot SetEnvIfNoCase User-Agent "^SuperBot " bad_bot SetEnvIfNoCase User-Agent "^Teleport" bad_bot SetEnvIfNoCase User-Agent "^Teleport Pro" bad_bot SetEnvIfNoCase User-Agent "^Web2Map" bad_bot SetEnvIfNoCase User-Agent "^WebAuto" bad_bot SetEnvIfNoCase User-Agent "^WebCapture" bad_bot SetEnvIfNoCase User-Agent "^WebCopier" bad_bot SetEnvIfNoCase User-Agent "^Web Downloader" bad_bot SetEnvIfNoCase User-Agent "^Webdupe" bad_bot SetEnvIfNoCase User-Agent "^WebFetch" bad_bot SetEnvIfNoCase User-Agent "^webfetcher" bad_bot SetEnvIfNoCase User-Agent "^WebFountain" bad_bot SetEnvIfNoCase User-Agent "^WebHook " bad_bot SetEnvIfNoCase User-Agent "^Web Image" bad_bot SetEnvIfNoCase User-Agent "^WebMiner" bad_bot SetEnvIfNoCase User-Agent "^WebMirror" bad_bot SetEnvIfNoCase User-Agent "^WebReaper" bad_bot SetEnvIfNoCase User-Agent "^WebSauger" bad_bot SetEnvIfNoCase User-Agent "^Webster" bad_bot SetEnvIfNoCase User-Agent "^WebStripper" bad_bot SetEnvIfNoCase User-Agent "^Web Sucker" bad_bot SetEnvIfNoCase User-Agent "^WebWhacker" bad_bot SetEnvIfNoCase User-Agent "^Website eXtractor" bad_bot SetEnvIfNoCase User-Agent "^WebZIP" bad_bot SetEnvIfNoCase User-Agent "^Wget" bad_bot SetEnvIfNoCase User-Agent "^Xaldon WebSpider" bad_bot deny from env=bad_bot |
|
#4
|
||||
|
||||
|
Most actually do follow the robots exclusion protocol (including EmailCollector). Especially if they are commercial email harvesting scripts. Not to do so would result sin a lot of abuse emails and give the companby concerned a very bad rep.
Having said that, you are right that it is best to be sure by blocking a spider also with .htaccess if possible. The major problem is that there are so many new ones, keeping up with them all is a pain. email harvesting is here to stay, unfortunately. |
|
#5
|
||||
|
||||
|
Just curious, Webby, how do you feel about having part of the URL mentioned in your article, which I believe is your own web site, get linked to an ad that it totally unrelated to your web site? It does not show on Netscape or Opera but shows on IE6. The ad that I see now is AOL for Broadband. Earlier it was MSN as in the screeen capture below.
|
|
#6
|
||||
|
||||
|
Interesting. I'm not seeing it now. Did it used to link anywhere other than my robtos.txt?
We are talking about the article here right? |
|
#7
|
||||
|
||||
|
Yes, it's only on the articles - on ALL the articles that I checked - and I see it right now on the word internet in your URL and other words like 'search engine, 'website', 'email', 'cgi' etc. The ad on the word internet is MSN TV again.
This 'technology' is called IntelliTXT from Vibrant Media URL Quote:
The difference between this and scumware is that this is built into the web site itself and is occuring with the permission of the web site owner. The web site owner has to insert the following code to make these appear: (below code taken from URL ) Code:
<!-- start Vibrant Media IntelliTxt script section -->
<style type="text/css">
.iTt{
FONT-FAMILY: Verdana, Arial, Helvetica;
FONT-SIZE: 11px;
FONT-STYLE: normal;
FONT-WEIGHT: normal;
COLOR: black;
BACKGROUND-COLOR: lightyellow;
BORDER: black 1px solid;
PADDING: 2px;
}
</style>
<span id="iTt" class="iTt" style="visibility:hidden;position:absolute;"></span>
<script defer="true" language="javascript" src="http://itxt.vibrantmedia.com/system/liveintellitxt.asp?IPID=110&MK=5&FG=black"></script>
<!-- end Vibrant Media IntelliTxt script section -->
Make sure your browser supports css - it shows on IE6 but not on Netscape or Opera. Were you informed that this sort of linking would occur from words on your articles? I guess from now on, authors of articles will need to add another clause to the permissions they give to third party publishers. |
![]() |
| Viewing: SEO Chat Forums > Other > Search Engine Articles > ROBOTS.TXT Primer |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|