Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Googlebot ignores robots.txt? #20

Open
yarikoptic opened this issue Sep 20, 2018 · 5 comments
Open

Googlebot ignores robots.txt? #20

yarikoptic opened this issue Sep 20, 2018 · 5 comments

Comments

@yarikoptic
Copy link
Member

last line in apache log file:

66.249.65.212 - - [20/Sep/2018:12:29:58 -0400] "GET /crcns/.git/objects/29/f8a0ae8c2ad4e7534b12f3cb68b9e8247b1933 HTTP/1.1" 200 1745 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

$> cat robots.txt 
Agent: *
Disallow: /abide
Disallow: /abide2
Disallow: /adhd200
Disallow: /allen-brain-observatory
Disallow: /balsa
Disallow: /corr
Disallow: /crcns
Disallow: /datapackage.json
Disallow: /dbic
Disallow: /devel
Disallow: /dicoms
Disallow: /.git
Disallow: /.gitattributes
Disallow: /.gitmodules
Disallow: /hbnssi
Disallow: /index.html
Disallow: /indi
Disallow: /kaggle
Disallow: /labs
Disallow: /neurovault
Disallow: /nidm
Disallow: /openfmri
Disallow: /singularity
Disallow: /workshops

$> whois 66.249.65.212

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# https://www.arin.net/resources/whois_reporting/index.html
#
# Copyright 1997-2018, American Registry for Internet Numbers, Ltd.
#


NetRange:       66.249.64.0 - 66.249.95.255
CIDR:           66.249.64.0/19
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:         NET66 (NET-66-0-0-0-0)
...

and robots.txt is accessed by google bots:

$> grep robots.txt datasets.datalad.org-access-comb.log | grep Google
66.249.79.206 - - [18/Sep/2018:05:31:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [19/Sep/2018:05:34:02 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.97 - - [19/Sep/2018:18:08:17 -0400] "GET /robots.txt HTTP/1.1" 200 4030 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [20/Sep/2018:05:36:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

@aqw - have a clue what is going on?

  • did I misspecify robots.txt may be?
  • access pattern from that host is interesting in that it is selectively accessing only some datasets, but may be it is just because it is all farmed out to a bunch of Googlebot instances

Overall goal is to forbid bots to crawl .git/ directories, but I found no way to disable that.

@aqw
Copy link

aqw commented Sep 20, 2018

@yarikoptic I believe it's supposed to be User-Agent: * rather than Agent: *

@aqw
Copy link

aqw commented Sep 20, 2018

As for .git directories, that should be easy with a wildcard. Untested, but something akin to:

Disallow: /*/.git/

@yarikoptic
Copy link
Member Author

@yarikoptic I believe it's supposed to be User-Agent: * rather than Agent: *

bill-murray-banging-head

@yarikoptic
Copy link
Member Author

As for .git directories, that should be easy with a wildcard. Untested, but something akin to:

Disallow: /*/.git/

Whenever I looked before I could not find clarity, e.g. from https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_%22*%22_match

Universal "*" match
The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: 
statement. Some crawlers like Googlebot recognize strings containing "*", while MSNbot and Teoma 
interpret it in different ways

but even there not clear how it is recognizing. e.g. we have .git across a number of levels. I guess I could add

Disallow: /.git/
Disallow: /*/.git/
Disallow: /*/*/.git/
Disallow: /*/*/*/.git/
Disallow: /*/*/*/*/.git/
Disallow: /*/*/*/*/*/.git/
Disallow: /*/*/*/*/*/*/.git/

to cover some

THANKS ;)

@aqw
Copy link

aqw commented Sep 21, 2018

Yeah, there isn't clarity, it's more of a living standard. Bots don't /have/ to follow your rules. They just should. And if they don't, you can ban them.

Wildcard support looks to be common, and globs across directories, so you shouldn't need a glob per level. Perhaps that will help some less sophisticated bots though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants