Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add published_at to the extracted attributes. #58

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

philipbrown
Copy link
Contributor

I've added extracting the published_at from the article.

Copy link
Collaborator

@Valian Valian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, greatly appreciated! Two small comments and one bigger, regarding performance and possible false positives. What do you think?


defp strategy(:text, html_tree) do
html_tree
|> Floki.find("div, p, span")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it too broad? Floki.text has "deep: true" by default, so what you're doing here is a lot.

Eg. for

<div>
  <p>Hello world</p>
  <div>
    <div>
       <span>Testing</span>
    </div>
  <div>

You're trying to parse dates 5 times, checking the same text multiple times. You can imagine how inefficient it can get with some deeply-nested HTMLs. Not to mention with that approach any valid date is considered published at, which IMO seems a bit too much?

Eg in redability.js published_at seems to be extracted only from meta tags: https://github.com/mozilla/readability/pull/813/files.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it's not ideal. There's just so many that don't include it in meta tags or <time> elements, eg. the elixir example in the fixtures.

I did consider making it optional by allowing the consumer to optionally include the date_time_parser package. What do you think? I'm happy to remove it if you'd rather not have it at all though!

Copy link
Collaborator

@Valian Valian Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipbrown I think I'd either remove it or make disabled by default?

Or maybe we can make it use Floki.text(doc, deep: false) by default?

Sth like opts: [find_in_text: :deep | true | false].

Also. Wonder how DateTimeParser works exaclty. If there is a valid date inside the text, will it find it?

If yes, maybe we could take the nested text from the outer-most div and run parser on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm just going to remove it. It's more of a nice to have, and it's hard to be confident that it's going to work consistently so I think it's better to just not have it.


html_tree
|> Floki.find(selector)
|> Enum.flat_map(&Floki.attribute(&1, "content"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floki.find + flat_map(&Floki.attribute/2) = Floki.attribute/3 . This should simplify these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice 🙏 I didn't realise I could do that! I've pushed a fix for this.

defp parse(value) do
Enum.find_value([:datetime, :date, :string], fn parser ->
parse(parser, value)
end)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a moment to understand how find_value exactly works with that approach.

Why not

parse(:datetime, value) || parse(:date, value) || parse(:string, value)

or maybe even (just braindumping, not sure about that :D)

with {:error, _} <- DateTime.from_iso8601(value),
     {:error, _} <- Date.from_iso8601(value), 
     {:error, _} <- DateTimeParser.parse(value) do
     nil
else 
     {:ok, datetime, _} -> datetime
     {:ok, datetime} -> datetime
     _ -> nil
 end       

Probably separate functions is clearer and you can use only one of them 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, I've changed it to:

parse(:datetime, value) || parse(:date, value) || parse(:string, value)

I'm not a fan of with..else

@Valian
Copy link
Collaborator

Valian commented Nov 5, 2024

LGTM 💜 Will release next version as soon as I have a little bit of time, so in a day or two. Thank you @philipbrown !

@Valian Valian merged commit e9a80fc into keepcosmos:master Nov 5, 2024
4 checks passed
@philipbrown philipbrown deleted the add-published-at branch November 6, 2024 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants