SEO Test and Experimentation
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
Go Back   SEO Chat ForumsOtherSEO Test and Experimentation

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread SEO Chat Forums Sponsor:
  #1  
Old July 23rd, 2007, 02:37 PM
javaweb javaweb is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Jul 2007
Posts: 35 javaweb User rank is Private First Class (20 - 50 Reputation Level)javaweb User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 3 h 11 m 54 sec
Reputation Power: 3
Rogue bots

This isn't directly search related, so I hope this is an ok topic for me to post. If not, please delete this thread or whatever.

Anyway...

I noticed regular activity from bots with UAs like this: Java/1.5.0_06 . That's the plain old Sun Java built-in URL get. Someone has built a crawling tool in Java. Great. What's not good at that is that I noticed it totally disregards robots.txt. Out of curiosity, I built a massive spider-trap Servlet which creates an infinite network of seemingly-meaningful HTML. I excluded that using robots.txt so no legitimate crawler would go to it. Lo and behold, these Java bots did fall into it, and could spend hours making one or two requests per second, loading meaningless pages with meaningless email addresses on them.

I assume these are email address harvesters? If so they sure got a lot of addresses out of my bot trap. Address like williamtjones@... etc.

This isn't relevant to SEO, it's just something I'm curious about.

Reply With Quote
  #2  
Old July 24th, 2007, 07:43 PM
tstolber's Avatar
tstolber tstolber is offline
Contributing User
SEO Chat Loyal (3000 - 3499 posts)
 
Join Date: Jul 2004
Location: Bedfordshire
Posts: 3,102 tstolber User rank is Second Lieutenant (5000 - 10000 Reputation Level)tstolber User rank is Second Lieutenant (5000 - 10000 Reputation Level)tstolber User rank is Second Lieutenant (5000 - 10000 Reputation Level)tstolber User rank is Second Lieutenant (5000 - 10000 Reputation Level)tstolber User rank is Second Lieutenant (5000 - 10000 Reputation Level)tstolber User rank is Second Lieutenant (5000 - 10000 Reputation Level)tstolber User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 2 Weeks 2 Days 10 h 11 m 53 sec
Reputation Power: 72
Send a message via Google Talk to tstolber Send a message via Skype to tstolber
I like the way you think. Flooding spam bots with bogus addresses reduces their effectiveness.

If you want to block any user agent you can in a .htaccess file.

The robots.txt protocol isn't required to be followed but genuine bots do so.

The user agent is only a parameter set on the client system so its easily spoofed. IF you so wished you could surf the web as Googlebot.

Reply With Quote
  #3  
Old July 26th, 2007, 06:11 PM
javaweb javaweb is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Jul 2007
Posts: 35 javaweb User rank is Private First Class (20 - 50 Reputation Level)javaweb User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 3 h 11 m 54 sec
Reputation Power: 3
Quote:
Originally Posted by tstolber
I like the way you think.


Thank you!

Quote:
Originally Posted by tstolber
Flooding spam bots with bogus addresses reduces their effectiveness.


Exactly. If even 1% of sites ran bogus email spam bot traps, it would be basically impossible for spam bots to harvest email addresses.

It seems to be working for me. I think they have blacklisted my server, which is exactly what I wanted them to do, because I'm not seeing any of these bots. I think there is a shared blacklist because I used to get those connections from quite a few IP addresses. Now, none, because these bots would crawl literally 100,000 pages at a time on my server, all of which were full of junk. That would basically neuter the bot.

Quote:
Originally Posted by tstolber
If you want to block any user agent you can in a .htaccess file.


(Well, I'm not using Apache HTTPD, but anyway...) The Java UA is a perfectly legitimate UA. If if I have an applet that wants to get resources from my site, that's what it would use. If someone wrote a web browser or some type of web viewer in Java, it might use Java 1.... as a UA. That's fine. Instead of me blocking them, I'll give them plenty of pages to look at!

Quote:
Originally Posted by tstolber
The robots.txt protocol isn't required to be followed but genuine bots do so.


Any bot that doesn't follow it is going to find a LOT of web pages to crawl on my server.

Quote:
Originally Posted by tstolber
The user agent is only a parameter set on the client system so its easily spoofed. IF you so wished you could surf the web as Googlebot.


Right, curl -A ' googlebot whatever'

I'm just glad that my spider trap seems to have gotten my site on some spammer blacklist.

By the way, all the fake emails were from my real domain. I'm hoping that if there are hundreds of thousands of emails from my real domain that are non-working they'll end up deleting my entire domain from their lists.

I know that some of these bots are also content-scrapers. Well, they sure got a lot of fresh content from me! And some of them are rogue research tools, whether they are business analysis tools, RIAA copyright crawlers, whatever, I don't need them on my site if they can't respect robots.txt.

Reply With Quote
  #4  
Old August 10th, 2007, 06:53 PM
javaweb javaweb is offline
Contributing User
SEO Chat Newbie (0 - 499 posts)
 
Join Date: Jul 2007
Posts: 35 javaweb User rank is Private First Class (20 - 50 Reputation Level)javaweb User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 3 h 11 m 54 sec
Reputation Power: 3
It's weird, those bots are back, and now they are avoiding the spam trap. I don't get why someone would care enough about my site to carefully configure the bot to crawl it and avoid the trap. It's not like my site has the Coca Cola recipe on it.

I guess I need to keep changing the trap URL on the site. It annoys me to have these bots totally violating the robots.txt.

I wish I knew what that bot is.

Reply With Quote
Reply

Viewing: SEO Chat ForumsOtherSEO Test and Experimentation > Rogue bots


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump




 Free IT White Papers!
 
Create the Optimal Architecture for your Critical Applications
Warburton's the largest independently owned bakery in the UK faced a number of difficult challenges in providing the most robust yet efficient IT infrastructure for their organization's success. IBM's services combined with their xSeries servers created the perfect platform for their SAP environment with sufficient flexibility, and did so in very time effective fashion.

Request Your Free Technology Downloads!
 
Five Best Practices for Deploying a Successful Service-Oriented Architecture
This white paper describes the benefits you can expect with SOA, and how IBM can help take your business there.

Request Your Free Technology Downloads!
 
Gartner Magic Quadrant for Application Delivery Controllers
Gartner summarizes its view on Application Delivery Controllers, evaluates strengths and weaknesses of solutions, and provides Magic Quadrant reporting for a quick comparison across all vendors. Learn from Gartner how you can benefit from an all-in-one device like Citrix NetScaler that delivers the highest levels of availability, performance and security.

Request Your Free Technology Downloads!
 
Knowledge is Power
What you don't know can hurt you, and is likely costing you money and increasing your security risks during an era of scarce resources. This white paper proposes six key strategies that enterprise security managers can use to improve their network defense posture.

Request Your Free Technology Downloads!
 
Rationalizing the Multi-Tool Environment
The rationalized multi-tool approach is flexible, scalable and cost effective. It provides the necessary input to the IT service management business processes. It preserves prior investments in monitoring tools, empowers technologists to select the best tools with which to do their jobs, and enhances effective response to incidents.

Request Your Free Technology Downloads!
 

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 




© 2003-2010 by Developer Shed. All rights reserved. DS Cluster 10 Hosted by Hostway
For more Enterprise Application Development news, visit eWeek