|
|
|||||||||
|
|||||||||
|
|||||||||
| |
||
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Minimize the cost of deploying database applications. Advantage Database Server or Microsoft SQL Server – Which One is Right for You? Learn now! |
|
#1
|
|||
|
|||
|
Robots.txt question
I have a wordpress blog and I am trying to set it up so certain directories such as cgi-bin and wp-admin are not indexed by spiders
I put the robots.txt file in the root of the domain and know that the following commands such as Disallow:/cgi-bin/ Disallow: /wp-admin for instance would work fine in normal circumstanaces, my question is my blog wasnt installed at the root but an folder in the root, so my location of wp-admin would actually be /blog/wp-admin If I wanted to disallow that, what is the correct way to do it? does Disallow: /blog/wp-admin make sense or am I way off? |
|
#2
|
||||
|
||||
|
Generally you can use htaccess to block spiders from folder access.
User-agent: * Disallow: /blog/wp-admin/ That's the way to go assuming it's been indexed already.
__________________
Distinct SEO Consultation | SEO Blog - Planning Tools Currently Available For: SEO Training | SEO Web Reviews Economics & Personal Finance Forum |
|
#3
|
|||
|
|||
|
You are saying to add
User-agent: * Disallow: /blog/wp-admin/ to my htaccess? I never heard of that, is that what you meant? I am really just trying to figure out if having a sub directory causes problems or if the robot doesnt care and just wont index /wp-admin no matter what directory it may be in can i do this User-agent: * Disallow: */wp-content/ Disallow: */wp-admin/ Disallow: */wp-includes/ Disallow: */wp- Disallow: */feed/ Disallow: /trackback/ Disallow: /cgi-bin/ QUOTE=djstreet]Generally you can use htaccess to block spiders from folder access. User-agent: * Disallow: /blog/wp-admin/ That's the way to go assuming it's been indexed already.[/QUOTE] Last edited by zoddy : June 19th, 2007 at 11:18 PM. |
|
#4
|
||||
|
||||
|
Anybody know of a way to block a crawler that seems to ignore robots.txt - I have included a particular crawler to be blocked but it still keeps visiting.
|
|
#5
|
|||
|
|||
|
Hi
one way I can think of is using password protection on those directories. If you have Linux hosting using Cpanel you can add a password and username to those directories and it will block EVERYTHING. /wp-content/ /wp-admin/ /wp-includes/ |
|
#6
|
||||
|
||||
|
Quote:
No you can't. Don't use wildcards in the Disallow rules if you want to produce a valid and universally effective robots.txt file.
__________________
|
|
#7
|
||||
|
||||
|
if you want to block certain directories from being indexed the best way is to lock them with a user and password.
putting the following in robots will work for spiders that listen to robots,txt for the ones that dont you will have to find their Ip address and block them User-agent: * Disallow: /blog/wp-admin/ |
|
#8
|
|||
|
|||
|
Quote:
zoddy, i think djstreet meant to say robots.txt file, not htaccess. |
|
#9
|
|||
|
|||
|
Quote:
Yeah stick it in your .htaccess file as: IndexIgnore * RewriteCond %{HTTP_USER_AGENT} ^(bot you want to block) RewriteRule ^.* - [F,L] That way all the bot will get is a 404 or 500 if it tries to request any of your pages, I forget which and don't really care as long as they get one. Heres a list of all the ones I know, they are even in alphabetical order, they are a load of spam bots. Feel free to copy and paste the list into your .htaccess file to help fight spam. This will block a load of spam bots that look for email addresses etc... to sell w/o your consent. Googlebot,Slurp(Yahoo),MSNbot, and ask's bot wont be blocked. Whats yours so I can add it to my list? Any one else know any I've missed so I can upload them to my .htaccess file to help fight spam? If you add it to the list below make sure you add the [OR] at the end. If you don't you won't need the [OR]. IndexIgnore * RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.* - [F,L] |
|
#10
|
|||
|
|||
|
Quote:
http://baremetal.com/gadgets/htaccess/ http://evolt.org/article/A_Cheesy_htaccess_Tutorial/18/226/evolt.org http://www.edginet.org/techie/website/htaccess.html http://www.javascriptkit.com/howto/htaccess.shtml http://www.serverwatch.com/tutorials/article.php/10825_1127711_1 http://brainstormsandraves.com/mobile/20051009.195659.shtml http://www.the-art-of-web.com/system/rewrite/ Regex can be a pwoerful tool when used properly in Rewrites. |
|
#11
|
|||
|
|||
|
Quote:
All the above fourteen lines may replaced just as effectively with a SINGLE line. RewriteCond %{HTTP_USER_AGENT} ^Web [OR] Which reads IF User-Agent BEGINS with "Web", than deny. It may even be made more effective by adding the [NC] flag prior to [OR], which means NOT case sensitive. You have some similar and deprecated examples that may reduce other lines as well. |
|
#12
|
|||
|
|||
|
Quote:
This is incorrect. What they get is a 403, access denied. 404 is file does not exist, in the event that your rewrites are generating these? Then you have a syntax error. If anybody who visits your site (s) gets a 500 error, than EVERYBODY who visits your site as well will (yourself included). 500 means the server is not functioning. After making adjustments to htaccess it is a formidable practice to verify that your site (s) still function, because the possibility of a simple charcter type in syntax may stop your entire server from working for anybody. |
|
#13
|
|||
|
|||
|
O thanks for the correction. It does makes sense they would get a 403 now you mention it cheers for the correction I owe you one.
As long as they get something except here please spam me I'm happy. I'll implement that code ASAP cheers. Any more you would like to add to the list? |
![]() |
| Viewing: SEO Chat Forums > Search Engine Strategies > Search Engine Spiders > Robots.txt question |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|