|
|
|||||||||
|
|||||||||
|
|||||||||
| |
||
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Rogue bots
This isn't directly search related, so I hope this is an ok topic for me to post. If not, please delete this thread or whatever.
Anyway... I noticed regular activity from bots with UAs like this: Java/1.5.0_06 . That's the plain old Sun Java built-in URL get. Someone has built a crawling tool in Java. Great. What's not good at that is that I noticed it totally disregards robots.txt. Out of curiosity, I built a massive spider-trap Servlet which creates an infinite network of seemingly-meaningful HTML. I excluded that using robots.txt so no legitimate crawler would go to it. Lo and behold, these Java bots did fall into it, and could spend hours making one or two requests per second, loading meaningless pages with meaningless email addresses on them. I assume these are email address harvesters? If so they sure got a lot of addresses out of my bot trap. Address like williamtjones@... etc. This isn't relevant to SEO, it's just something I'm curious about. |
|
#2
|
||||
|
||||
|
I like the way you think. Flooding spam bots with bogus addresses reduces their effectiveness.
If you want to block any user agent you can in a .htaccess file. The robots.txt protocol isn't required to be followed but genuine bots do so. The user agent is only a parameter set on the client system so its easily spoofed. IF you so wished you could surf the web as Googlebot.
__________________
SEO Tutorials for Beginners, SEO News, SEO Testing IKROH SEO for UK Search Engine Optimisation call 01908 379938 |
|
#3
|
|||||||
|
|||||||
|
Quote:
Thank you! Quote:
Exactly. If even 1% of sites ran bogus email spam bot traps, it would be basically impossible for spam bots to harvest email addresses. It seems to be working for me. I think they have blacklisted my server, which is exactly what I wanted them to do, because I'm not seeing any of these bots. I think there is a shared blacklist because I used to get those connections from quite a few IP addresses. Now, none, because these bots would crawl literally 100,000 pages at a time on my server, all of which were full of junk. That would basically neuter the bot. Quote:
(Well, I'm not using Apache HTTPD, but anyway...) The Java UA is a perfectly legitimate UA. If if I have an applet that wants to get resources from my site, that's what it would use. If someone wrote a web browser or some type of web viewer in Java, it might use Java 1.... as a UA. That's fine. Instead of me blocking them, I'll give them plenty of pages to look at! Quote:
Any bot that doesn't follow it is going to find a LOT of web pages to crawl on my server. Quote:
Right, curl -A ' googlebot whatever' I'm just glad that my spider trap seems to have gotten my site on some spammer blacklist. By the way, all the fake emails were from my real domain. I'm hoping that if there are hundreds of thousands of emails from my real domain that are non-working they'll end up deleting my entire domain from their lists. I know that some of these bots are also content-scrapers. Well, they sure got a lot of fresh content from me! And some of them are rogue research tools, whether they are business analysis tools, RIAA copyright crawlers, whatever, I don't need them on my site if they can't respect robots.txt. |
|
#4
|
|||
|
|||
|
It's weird, those bots are back, and now they are avoiding the spam trap. I don't get why someone would care enough about my site to carefully configure the bot to crawl it and avoid the trap. It's not like my site has the Coca Cola recipe on it.
I guess I need to keep changing the trap URL on the site. It annoys me to have these bots totally violating the robots.txt. I wish I knew what that bot is. |
![]() |
| Viewing: SEO Chat Forums > Other > SEO Test and Experimentation > Rogue bots |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|
|