Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more entries to robots.txt #2044

Merged
merged 2 commits into from
Nov 26, 2024
Merged

Conversation

YoshiRulz
Copy link
Collaborator

It should be apparent why each of these isn't useful to crawl.

  • Might want to keep /Wiki/ViewSource for archival purposes?
  • Might want to block /Forum/Posts/User/ for humans who aren't logged in?
  • The false positive thing: spec, Google's nicer docs

@Masterjun3
Copy link
Collaborator

  • Why disallow /Account/? It makes sense to me that people would google "tasvideos register" to find how to register an account.
  • And /Forum/Topics/Create/ needs authorization so we don't need to block it here. Otherwise I wonder why you don't also block wiki editing, post creation, userfile creation, and all the other creation stuff. Seems too much.
  • Why block /Forum/Posts/User/ ?
  • I don't understand the TODO. We know each entry has an implicit trailing wildcard, that's why we can block /Movies- e.g.

@YoshiRulz
Copy link
Collaborator Author

  1. Oh right, I'll revert that.
  2. It was in a list of highest-trafficked pages you shared recently. I think those others should be blocked, but if you don't want them listed here, I did have a solution using rel="nofollow" in another branch.
  3. This was also in said list. It's not really a useful thing to index: crawling /Forum/Topics/{id} will include the posters' names with their posts, so in theory they should be searchable that way if someone wanted. But I think searching by poster is prone to abuse, hence my suggestion in OP.
  4. The "correct" way is to put a $ after the path fragment (and then if there are subpages, include a separate .../* entry). I haven't bothered.

@adelikat
Copy link
Collaborator

adelikat commented Nov 23, 2024

2. I would recommend not listing this or any page that is auth required as it is not necessary and there are too many others. Robots will not be logged in. Also, I think usage statistics doesn't need to play a high role here. If it makes sense for something to be crawled, it should be crawled. If it needs to be more performant as a result, we can address that

4. Then at least put that information in the comment, so that someone who isn't an expert in robot crawling can do the comment. But I would recommend not bothering with it

Copy link
Collaborator

@Masterjun3 Masterjun3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commited some removals as discussed, so that this PR can be merged.

@adelikat adelikat merged commit ee2a5a8 into TASVideos:main Nov 26, 2024
1 check passed
@YoshiRulz YoshiRulz deleted the more-less-robots branch November 26, 2024 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants