#1
  1. SEO Consultant
    SEO Chat Genius (4000 - 4499 posts)

    Join Date
    Jul 2004
    Location
    Minneapolis, MN, USA
    Posts
    4,315
    Rep Power
    1248

    Standard robots.txt file


    Hi all

    I want to use a standard robots.txt file for every site I produce. Obviously I want all of the search engines to be able to view all of the site (in most cases).

    What I want to stop is rouge bandwidth eating spiders that obey the robots.txt protocol. There is one user agent type (with several variants) that I block currently.

    User-agent: Microsoft URL Control 6.00.8862
    Disallow: /

    User-agent: Microsoft URL Control 5.01.4511
    Disallow: /

    User-agent: Microsoft URL Control 6.00.8169
    Disallow: /

    User-agent: Microsoft URL Control
    Disallow: /

    User-agent: URL Control
    Disallow: /

    I was wondering, does anyone have their own list and or experiences of robots that obey this protocol but are not worth letting into your site?

    I know that I can block the ones that don't obey the protocol with .htaccess.

    Thanks
  2. #2
  3. Since 1984
    SEO Chat Discoverer (100 - 499 posts)

    Join Date
    Feb 2006
    Location
    Hastings, South East UK
    Posts
    356
    Rep Power
    14

    Here is a list of all the SE bots i could find.


    I guess you could use your stats programme or whatever you use to work out which ones provide you traffic and which ones dont. Heres the most comprehensive list of agents I could muster up:

    acme.spider
    ahoythehomepagefinder
    aleksika spider
    ia_archiver
    alkaline
    emcspider
    antibot
    arachnophilia
    architext
    aretha
    ariadne
    arks
    aspider
    atn.txt
    atomz
    auresys
    awbot
    backrub
    baiduspider
    bigbrother
    bjaaland
    blackwidow
    blogsphere
    isspider
    blogshares bot
    blogvisioneye
    blogwatcher
    blogwise.com-metachecker
    bloodhound
    bobby
    bordermanager
    boris
    bravobrian bstop
    brightnet
    bspider
    bumblebee
    catvschemistryspider
    calif[^r]
    cassandra
    ccgcrawl
    checkbot
    christcrawler
    churl
    cj spider
    cmc
    collective
    combine
    computer_and_automation_research_institute_crawler
    robi
    conceptbot
    coolbot
    cosmixcrawler
    crawlconvera
    cscrawler
    cusco
    cyberspyder
    cydralspyder
    daviesbot
    deepindex
    denmex websearch
    deweb
    blindekuh
    dienstspider
    digger
    webreader
    cgireader
    diibot
    digout4u
    directhit
    dnabot
    downes/referrers
    download_express
    dragonbot
    dwcp
    e-collector
    e-societyrobot
    ebiness
    echo
    eit
    elfinbot
    emacs
    enterprise_search
    esther
    evliyacelebi
    exabot
    exactseek
    exalead ng
    ezresult
    fangcrawl
    fast-webcrawler
    fastbuzz.com
    faxobot
    feedster crawler
    felix
    fetchrover
    fido
    [^a]fish
    flurry
    fdse
    fouineur
    franklin locator
    freecrawl
    frontier
    funnelweb
    gaisbot
    galaxybot
    gama
    gazz
    gcreep
    getbot
    puu
    geturl
    gigabot
    gnodspider
    golem
    googlebot
    gornker
    grapnel
    griffon
    gromit
    grub-client
    hambot
    hatena antenna
    havindex
    octopus
    hometown
    htdig
    htmlgobble
    pitkow
    hyperdecontextualizer
    finnish
    irobot
    iajabot
    ibm
    illinois state tech labs
    imagelock
    incywincy
    informant
    infoseek
    infoseeksidewinder
    infospider
    ilse
    ingrid
    slurp
    inspectorwww
    intelliagent
    cruiser
    internet ninja
    myweb
    internetseer
    iron33
    israelisearch
    javabee
    jbot
    jcrawler
    jeeves
    jennybot
    jetbot
    jobo
    jobot
    joebot
    jumpstation
    justview
    katipo
    kdd
    kilroy
    fireball
    ko_yappo_robot
    labelgrabber.txt
    larbin
    legs
    linkidator
    linkbot
    linkchecker
    linkfilter.net url verifier
    linkscan
    linkwalker
    lockon
    logo_gif
    lycos
    mac finder
    macworm
    magpie
    marvin
    mattie
    mediafox
    mediapartners-google
    mercator
    mercubot
    merzscope
    mindcrawler
    moget
    momspider
    monster
    mixcat
    motor
    mozdex
    msiecrawler
    msnbot
    muscatferret
    mwdsearch
    my little bot
    naverrobot
    naverbot
    meshexplorer
    nederland.zoek
    netresearchserver
    netcarta
    netcraft
    netmechanic
    netscoop
    newscan-online
    nextopiabot
    nhse
    nitle blog spider
    nomad
    gulliver
    npbot
    nutch
    nzexplorer
    obidos-bot
    occam
    sitegrabber
    openfind
    orb_search
    overture-webcrawler
    packrat
    pageboy
    parasite
    patric
    pegasus
    perlcrawler
    perman
    petersnews
    pka
    phantom
    piltdownman
    pimptrain
    pioneer
    pipeliner
    plumtreewebaccessor
    polybot
    pompos
    poppi
    iconoclast
    pjspider
    portalb
    psbot
    quepasacreep
    raven
    rbse
    redalert
    resumerobot
    roadrunner
    rhcs
    robbie
    robofox
    francoroute
    robozilla
    roverbot
    rules
    safetynetrobot
    scooter
    search_au
    searchprocess
    searchspider
    seekbot
    semanticdiscovery
    senrigan
    sgscout
    shaggy
    shaihulud
    sherlock-spider
    shoutcast
    sift
    simbot
    ssearcher
    site-valet
    sitespider
    sitetech
    slcrawler
    slysearch
    smartspider
    snooper
    solbot
    soziopath
    space bison
    spanner
    speedy
    spiderbot
    spiderline
    spiderman
    spiderview
    spider_monkey
    splatsearch.com
    spry
    steeler
    suke
    suntek
    surveybot
    sven
    syndic8
    szukacz
    tach_bw
    tarantula
    tarspider
    techbot
    technoratibot
    templeton
    teoma_agent1
    teradex
    jubii
    northstar
    w3index
    perignator
    python
    tkwww
    webmoose
    wombat
    webfoot
    wanderer
    worm
    timbobot
    titan
    titin
    tlspider
    turnitinbot
    ucsd
    udmsearch
    ultraseek
    unlost_web_crawler
    urlck
    vagabondo
    valkyrie
    victoria
    visionsearch
    voila
    voyager
    vspider
    vwbot
    w3m2
    wmir
    wapspider
    appie
    wallpaper
    waypath scout
    core
    web downloader
    webbandit
    webbase
    webcatcher
    webcompass
    webcopy
    webcraftboot
    webfetcher
    webfilter
    webgather
    weblayers
    weblinker
    webmirror
    webquest
    webrace
    webreaper
    websnarf
    webspider
    wolp
    webstripper
    webtrends link analyzer
    webvac
    webwalk
    webwalker
    webwatch
    wz101
    wget
    whatuseek
    whowhere
    ferret
    wired-digital
    wisenutbot
    wwwc
    xenu link sleuth
    xget
    cosmos
    yahoo
    yandex
    zao
    zeus
    zyborg


    May not be of any use to you what so ever...
  4. #3
  5. SEO Consultant
    SEO Chat Genius (4000 - 4499 posts)

    Join Date
    Jul 2004
    Location
    Minneapolis, MN, USA
    Posts
    4,315
    Rep Power
    1248
    Thanks, thats a useful list but not for what I am looking for.

    Basicaly any of those could potentialy bring me traffic so I want them to be able to index my site.

    What I want to stop are known email harvesters and screen scrpaers etc that obey robots.txt - possibly not many.
    I just wondered wether anyone has a standard list or if anyone has had a particular experience with a bad bot consuming bandwidth.
    I did with the one I posted - It was consuming something like 42% of the total bandwidth! I blocked it out with robots.txt but it didn't listen so I had to htacess it.

    By the way from what I found its nothing to do with Microsoft - thought I should point that out.
  6. #4
  7. Since 1984
    SEO Chat Discoverer (100 - 499 posts)

    Join Date
    Feb 2006
    Location
    Hastings, South East UK
    Posts
    356
    Rep Power
    14
    Originally Posted by tstolber
    Thanks, thats a useful list but not for what I am looking for.
    Back to the drawing board...

    Im not going to pretend I understand then what it is you are looking for. Thats not your fault btw, its mine, Im a links bod, i know very little about anything else... oh apart from World of Warcraft.. but thats not going to help you with your bandwidth now is it...

    Just tell me to shut up and i will

Similar Threads

  1. robots.txt file
    By clasione in forum Web Design, Coding and Programming
    Replies: 9
    Last Post: May 23rd, 2004, 10:07 AM
  2. SSI: Why and How to Use Server Side Includes
    By amabaie in forum SEO Chat Articles
    Replies: 0
    Last Post: May 20th, 2004, 09:07 PM

IMN logo majestic logo threadwatch logo seochat tools logo