-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add published_at to the extracted attributes. #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution, greatly appreciated! Two small comments and one bigger, regarding performance and possible false positives. What do you think?
|
||
defp strategy(:text, html_tree) do | ||
html_tree | ||
|> Floki.find("div, p, span") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it too broad? Floki.text
has "deep: true" by default, so what you're doing here is a lot.
Eg. for
<div>
<p>Hello world</p>
<div>
<div>
<span>Testing</span>
</div>
<div>
You're trying to parse dates 5 times, checking the same text multiple times. You can imagine how inefficient it can get with some deeply-nested HTMLs. Not to mention with that approach any valid date is considered published at, which IMO seems a bit too much?
Eg in redability.js published_at seems to be extracted only from meta tags: https://github.com/mozilla/readability/pull/813/files.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree it's not ideal. There's just so many that don't include it in meta tags or <time>
elements, eg. the elixir example in the fixtures.
I did consider making it optional by allowing the consumer to optionally include the date_time_parser
package. What do you think? I'm happy to remove it if you'd rather not have it at all though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philipbrown I think I'd either remove it or make disabled by default?
Or maybe we can make it use Floki.text(doc, deep: false)
by default?
Sth like opts: [find_in_text: :deep | true | false]
.
Also. Wonder how DateTimeParser works exaclty. If there is a valid date inside the text, will it find it?
If yes, maybe we could take the nested text from the outer-most div and run parser on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm just going to remove it. It's more of a nice to have, and it's hard to be confident that it's going to work consistently so I think it's better to just not have it.
|
||
html_tree | ||
|> Floki.find(selector) | ||
|> Enum.flat_map(&Floki.attribute(&1, "content")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floki.find + flat_map(&Floki.attribute/2) = Floki.attribute/3 . This should simplify these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice 🙏 I didn't realise I could do that! I've pushed a fix for this.
defp parse(value) do | ||
Enum.find_value([:datetime, :date, :string], fn parser -> | ||
parse(parser, value) | ||
end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a moment to understand how find_value exactly works with that approach.
Why not
parse(:datetime, value) || parse(:date, value) || parse(:string, value)
or maybe even (just braindumping, not sure about that :D)
with {:error, _} <- DateTime.from_iso8601(value),
{:error, _} <- Date.from_iso8601(value),
{:error, _} <- DateTimeParser.parse(value) do
nil
else
{:ok, datetime, _} -> datetime
{:ok, datetime} -> datetime
_ -> nil
end
Probably separate functions is clearer and you can use only one of them 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, I've changed it to:
parse(:datetime, value) || parse(:date, value) || parse(:string, value)
I'm not a fan of with..else
5310cae
to
bdec438
Compare
bdec438
to
eb258f3
Compare
LGTM 💜 Will release next version as soon as I have a little bit of time, so in a day or two. Thank you @philipbrown ! |
I've added extracting the
published_at
from the article.