Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognizing URLs delimited other than with whitespace? #293

Open
8573 opened this issue Mar 19, 2020 · 2 comments
Open

Recognizing URLs delimited other than with whitespace? #293

8573 opened this issue Mar 19, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@8573
Copy link

8573 commented Mar 19, 2020

I note that url-bot seems to decline to recognize a substring of an IRC message as a URL if that substring is not delimited on each side by whitespace (or by the start or end of the IRC message). Seeing no evidence in the existing issue tickets that this has been questioned previously, I would like to suggest that this may be overly conservative.

In particular, appendix C of the current IETF RFC on URLs, IETF RFC 3986, suggests, besides whitespace, delimiting URLs with double quotation marks or < and > (I tend to follow the latter suggestion) and recommends that,

For robustness, software that accepts user-typed URI [sic] should attempt to recognize and strip [...] delimiters [...]

@nuxeh
Copy link
Owner

nuxeh commented Apr 8, 2020

Hi, thanks for the suggestion, and yes, I totally agree. The parser used in url-bot currently is rather simplistic, only splitting message strings by whitespace. In any case it would clearly require a more complex parser than what we have to achieve good results, or to adhere more closely to the spec.

As it turns out, there seems to be a crate, urlocate, which seems to be designed for doing just what would be needed, extracting URLs from context, so that seems to be a good candidate. It also has no dependencies, which is nice. So I'm thinking that could be worth some investigation as a direction to go with this.

@nuxeh nuxeh added the enhancement New feature or request label Apr 8, 2020
@nuxeh
Copy link
Owner

nuxeh commented Mar 15, 2021

https://www.cbsnews.com/news/scientists-335-million-seed-sperm-egg-samples-moon-noahs-ark/--

is another example of where a better parser might offer improved performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants