Search Engine Spiders
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
 
User Name:
Password:
Remember me
Go Back   SEO Chat ForumsSearch Engine StrategiesSearch Engine Spiders

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread SEO Chat Forums Sponsor:
Minimize the cost of deploying database applications. Advantage Database Server or Microsoft SQL Server – Which One is Right for You? Learn now!
  #1  
Old June 19th, 2007, 10:32 PM
zoddy zoddy is offline
Registered User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Nov 2006
Posts: 23 zoddy User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 13 m 48 sec
Reputation Power: 0
Robots.txt question

I have a wordpress blog and I am trying to set it up so certain directories such as cgi-bin and wp-admin are not indexed by spiders

I put the robots.txt file in the root of the domain and know that the following commands such as

Disallow:/cgi-bin/
Disallow: /wp-admin

for instance would work fine in normal circumstanaces, my question is my blog wasnt installed at the root but an folder in the root, so my location of wp-admin would actually be
/blog/wp-admin

If I wanted to disallow that, what is the correct way to do it?
does Disallow: /blog/wp-admin make sense or am I way off?

Reply With Quote
  #2  
Old June 19th, 2007, 10:37 PM
djstreet's Avatar
djstreet djstreet is offline
Google PR = poop rank
SEO Chat Regular (2000 - 2499 posts)
 
Join Date: Aug 2004
Location: Alberta
Posts: 2,254 djstreet User rank is Sergeant (500 - 2000 Reputation Level)djstreet User rank is Sergeant (500 - 2000 Reputation Level)djstreet User rank is Sergeant (500 - 2000 Reputation Level)djstreet User rank is Sergeant (500 - 2000 Reputation Level)djstreet User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 1 Week 6 Days 6 h 56 m 27 sec
Reputation Power: 20
Send a message via MSN to djstreet Send a message via Skype to djstreet
Generally you can use htaccess to block spiders from folder access.

User-agent: *
Disallow: /blog/wp-admin/

That's the way to go assuming it's been indexed already.

Reply With Quote
  #3  
Old June 19th, 2007, 10:45 PM
zoddy zoddy is offline
Registered User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Nov 2006
Posts: 23 zoddy User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 13 m 48 sec
Reputation Power: 0
You are saying to add
User-agent: *
Disallow: /blog/wp-admin/

to my htaccess? I never heard of that, is that what you meant? I am really just trying to figure out if having a sub directory causes problems or if the robot doesnt care and just wont index /wp-admin no matter what directory it may be in

can i do this

User-agent: *
Disallow: */wp-content/
Disallow: */wp-admin/
Disallow: */wp-includes/
Disallow: */wp-
Disallow: */feed/
Disallow: /trackback/
Disallow: /cgi-bin/


QUOTE=djstreet]Generally you can use htaccess to block spiders from folder access.

User-agent: *
Disallow: /blog/wp-admin/

That's the way to go assuming it's been indexed already.[/QUOTE]

Last edited by zoddy : June 19th, 2007 at 11:18 PM.

Reply With Quote
  #4  
Old June 20th, 2007, 06:09 AM
Doodlebug's Avatar
Doodlebug Doodlebug is offline
Don't Panic!
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Apr 2007
Location: UK
Posts: 288 Doodlebug User rank is Corporal (100 - 500 Reputation Level)Doodlebug User rank is Corporal (100 - 500 Reputation Level)Doodlebug User rank is Corporal (100 - 500 Reputation Level)Doodlebug User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 1 Week 2 Days 8 h 31 m 23 sec
Reputation Power: 4
Anybody know of a way to block a crawler that seems to ignore robots.txt - I have included a particular crawler to be blocked but it still keeps visiting.

Reply With Quote
  #5  
Old June 20th, 2007, 06:37 AM
blade007 blade007 is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: May 2007
Location: Stoke-on-Trent, UK
Posts: 47 blade007 User rank is Private First Class (20 - 50 Reputation Level)blade007 User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 7 h 28 m 12 sec
Reputation Power: 1
Hi

one way I can think of is using password protection on those directories. If you have Linux hosting using Cpanel you can add a password and username to those directories and it will block EVERYTHING.

/wp-content/
/wp-admin/
/wp-includes/

Reply With Quote
  #6  
Old June 20th, 2007, 10:45 AM
dzine's Avatar
dzine dzine is offline
Vergruizer: Vot tebe khuy
SEO Chat Intermediate (1500 - 1999 posts)
 
Join Date: Oct 2005
Location: in a life preserver @ seorefugee
Posts: 1,841 dzine User rank is Sergeant (500 - 2000 Reputation Level)dzine User rank is Sergeant (500 - 2000 Reputation Level)dzine User rank is Sergeant (500 - 2000 Reputation Level)dzine User rank is Sergeant (500 - 2000 Reputation Level)dzine User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 1 Month 4 Days 14 h 41 m
Reputation Power: 20
Quote:
Originally Posted by zoddy
...
Disallow: */whatever
...

No you can't. Don't use wildcards in the Disallow rules if you want to produce a valid and universally effective robots.txt file.
__________________
Learn about the Sopranos...Check your gender ...

Reply With Quote
  #7  
Old June 20th, 2007, 06:47 PM
googler's Avatar
googler googler is offline
Cool Dude
SEO Chat Beginner (1000 - 1499 posts)
 
Join Date: Aug 2003
Location: Vancouver, Washington, U.S.A.
Posts: 1,389 googler User rank is Sergeant (500 - 2000 Reputation Level)googler User rank is Sergeant (500 - 2000 Reputation Level)googler User rank is Sergeant (500 - 2000 Reputation Level)googler User rank is Sergeant (500 - 2000 Reputation Level)googler User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 1 Week 16 h 30 m 59 sec
Reputation Power: 13
Send a message via AIM to googler Send a message via Google Talk to googler Send a message via XFire to googler
if you want to block certain directories from being indexed the best way is to lock them with a user and password.

putting the following in robots will work for spiders that listen to robots,txt for the ones that dont you will have to find their Ip address and block them

User-agent: *
Disallow: /blog/wp-admin/

Reply With Quote
  #8  
Old August 6th, 2007, 06:33 PM
awaken awaken is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Jul 2007
Posts: 41 awaken User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 15 h 22 m 40 sec
Reputation Power: 1
Quote:
Generally you can use htaccess to block spiders from folder access.


zoddy, i think djstreet meant to say robots.txt file, not htaccess.

Reply With Quote
  #9  
Old September 21st, 2007, 10:54 AM
jazajay jazajay is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Aug 2007
Posts: 71 jazajay User rank is Lance Corporal (50 - 100 Reputation Level)jazajay User rank is Lance Corporal (50 - 100 Reputation Level)jazajay User rank is Lance Corporal (50 - 100 Reputation Level) 
Time spent in forums: 1 Day 1 m 45 sec
Reputation Power: 1
Quote:
Originally Posted by Doodlebug
Anybody know of a way to block a crawler that seems to ignore robots.txt - I have included a particular crawler to be blocked but it still keeps visiting.


Yeah stick it in your .htaccess file as:

IndexIgnore *
RewriteCond %{HTTP_USER_AGENT} ^(bot you want to block)
RewriteRule ^.* - [F,L]

That way all the bot will get is a 404 or 500 if it tries to request any of your pages, I forget which and don't really care as long as they get one.

Heres a list of all the ones I know, they are even in alphabetical order, they are a load of spam bots. Feel free to copy and paste the list into your .htaccess file to help fight spam. This will block a load of spam bots that look for email addresses etc... to sell w/o your consent. Googlebot,Slurp(Yahoo),MSNbot, and ask's bot wont be blocked.

Whats yours so I can add it to my list?

Any one else know any I've missed so I can upload them to my .htaccess file to help fight spam?

If you add it to the list below make sure you add the [OR] at the end. If you don't you won't need the [OR].

IndexIgnore *
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Reply With Quote
  #10  
Old September 26th, 2007, 01:58 PM
wildernessD wildernessD is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 61 wildernessD User rank is Private First Class (20 - 50 Reputation Level)wildernessD User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 3 h 28 m 9 sec
Reputation Power: 2
Quote:
Originally Posted by Doodlebug
Anybody know of a way to block a crawler that seems to ignore robots.txt - I have included a particular crawler to be blocked but it still keeps visiting.


http://baremetal.com/gadgets/htaccess/
http://evolt.org/article/A_Cheesy_htaccess_Tutorial/18/226/evolt.org
http://www.edginet.org/techie/website/htaccess.html
http://www.javascriptkit.com/howto/htaccess.shtml
http://www.serverwatch.com/tutorials/article.php/10825_1127711_1
http://brainstormsandraves.com/mobile/20051009.195659.shtml
http://www.the-art-of-web.com/system/rewrite/

Regex can be a pwoerful tool when used properly in Rewrites.

Reply With Quote
  #11  
Old September 26th, 2007, 02:02 PM
wildernessD wildernessD is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 61 wildernessD User rank is Private First Class (20 - 50 Reputation Level)wildernessD User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 3 h 28 m 9 sec
Reputation Power: 2
Quote:
Originally Posted by jazajay
Yeah stick it in your .htaccess file as:

IndexIgnore *
RewriteCond %{HTTP_USER_AGENT} ^(bot you want to block)
RewriteRule ^.* - [F,L]

That way all the bot will get is a 404 or 500 if it tries to request any of your pages, I forget which and don't really care as long as they get one.

Heres a list of all the ones I know, they are even in alphabetical order, they are a load of spam bots. Feel free to copy and paste the list into your .htaccess file to help fight spam. This will block a load of spam bots that look for email addresses etc... to sell w/o your consent. Googlebot,Slurp(Yahoo),MSNbot, and ask's bot wont be blocked.

Whats yours so I can add it to my list?

Any one else know any I've missed so I can upload them to my .htaccess file to help fight spam?

If you add it to the list below make sure you add the [OR] at the end. If you don't you won't need the [OR].

IndexIgnore *
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]


All the above fourteen lines may replaced just as effectively with a SINGLE line.

RewriteCond %{HTTP_USER_AGENT} ^Web [OR]

Which reads IF User-Agent BEGINS with "Web", than deny.
It may even be made more effective by adding the [NC] flag prior to [OR], which means NOT case sensitive.

You have some similar and deprecated examples that may reduce other lines as well.

Reply With Quote
  #12  
Old September 26th, 2007, 02:07 PM
wildernessD wildernessD is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Aug 2006
Posts: 61 wildernessD User rank is Private First Class (20 - 50 Reputation Level)wildernessD User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 3 h 28 m 9 sec
Reputation Power: 2
Quote:
That way all the bot will get is a 404 or 500 if it tries to request any of your pages, I forget which and don't really care as long as they get one.


This is incorrect.
What they get is a 403, access denied.

404 is file does not exist, in the event that your rewrites are generating these?
Then you have a syntax error.

If anybody who visits your site (s) gets a 500 error, than EVERYBODY who visits your site as well will (yourself included).
500 means the server is not functioning.

After making adjustments to htaccess it is a formidable practice to verify that your site (s) still function, because the possibility of a simple charcter type in syntax may stop your entire server from working for anybody.
Comments on this post
dzine agrees!
jazajay agrees: Thanks for the correction

Reply With Quote
  #13  
Old September 26th, 2007, 08:53 PM
jazajay jazajay is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Aug 2007
Posts: 71 jazajay User rank is Lance Corporal (50 - 100 Reputation Level)jazajay User rank is Lance Corporal (50 - 100 Reputation Level)jazajay User rank is Lance Corporal (50 - 100 Reputation Level) 
Time spent in forums: 1 Day 1 m 45 sec
Reputation Power: 1
O thanks for the correction. It does makes sense they would get a 403 now you mention it cheers for the correction I owe you one.

As long as they get something except here please spam me I'm happy.

I'll implement that code ASAP cheers.

Any more you would like to add to the list?

Reply With Quote
Reply

Viewing: SEO Chat ForumsSearch Engine StrategiesSearch Engine Spiders > Robots.txt question


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump