Recognizing URLs delimited other than with whitespace? #293

8573 · 2020-03-19T01:42:54Z

I note that url-bot seems to decline to recognize a substring of an IRC message as a URL if that substring is not delimited on each side by whitespace (or by the start or end of the IRC message). Seeing no evidence in the existing issue tickets that this has been questioned previously, I would like to suggest that this may be overly conservative.

In particular, appendix C of the current IETF RFC on URLs, IETF RFC 3986, suggests, besides whitespace, delimiting URLs with double quotation marks or < and > (I tend to follow the latter suggestion) and recommends that,

For robustness, software that accepts user-typed URI [sic] should attempt to recognize and strip [...] delimiters [...]

The text was updated successfully, but these errors were encountered:

nuxeh · 2020-04-08T07:57:50Z

Hi, thanks for the suggestion, and yes, I totally agree. The parser used in url-bot currently is rather simplistic, only splitting message strings by whitespace. In any case it would clearly require a more complex parser than what we have to achieve good results, or to adhere more closely to the spec.

As it turns out, there seems to be a crate, urlocate, which seems to be designed for doing just what would be needed, extracting URLs from context, so that seems to be a good candidate. It also has no dependencies, which is nice. So I'm thinking that could be worth some investigation as a direction to go with this.

nuxeh · 2021-03-15T19:56:53Z

https://www.cbsnews.com/news/scientists-335-million-seed-sperm-egg-samples-moon-noahs-ark/--

is another example of where a better parser might offer improved performance.

nuxeh added the enhancement New feature or request label Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognizing URLs delimited other than with whitespace? #293

Recognizing URLs delimited other than with whitespace? #293

8573 commented Mar 19, 2020

nuxeh commented Apr 8, 2020

nuxeh commented Mar 15, 2021

Recognizing URLs delimited other than with whitespace? #293

Recognizing URLs delimited other than with whitespace? #293

Comments

8573 commented Mar 19, 2020

nuxeh commented Apr 8, 2020

nuxeh commented Mar 15, 2021