Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix low-res images in Medium pages that uses lazy loading #588

Merged
merged 11 commits into from
Apr 3, 2020
Merged

Fix low-res images in Medium pages that uses lazy loading #588

merged 11 commits into from
Apr 3, 2020

Conversation

RadhiFadlillah
Copy link
Contributor

@RadhiFadlillah RadhiFadlillah commented Mar 26, 2020

Several months ago, an issue is submitted to my project that uses go-readability, the port of readability.js for Go. I think I've fixed it there, so might as well put it here as well.

The summary of that issue is, Medium pages lazy load the image using JavaScript. However, they also provide the high quality image inside noscript tag in case the user disabled the JS in their browser. Unfortunately, right now Readability removes all noscript tags without exception.

By the way, this issue is only happened when Readability is used in server, and works fine in Firefox reader mode.


To fix this issue, I decide to unwrap img from inside noscript if it fulfills two criteria :

  • the previous direct element sibling of the noscript is img; and
  • noscript only contains exactly one img element.

To prevent wrong unwrapping, I also did these steps before unwrapping the noscript :

  • Find all div that contains exactly one image, then put it out outside the div. This is done to
    prevent case where a noscript has direct previous sibling of a div that doesn't have any use except containing a single img.
  • Find all img that doesn't have any sources. This is done, because in Medium they use empty img as placeholder. If it doesn't removed, after unwrapping we will ended up with two image, both the low quality and high quality image.

For sample, I use this Medium article and here is the result :

Current Readability.js After PR

Unfortunately, the test is failed for 5 pages :

  • bug-1255978
  • citylab-1
  • mozilla-1
  • seattletimes-1
  • yahoo-1

However, from those failed tests, all changes is only happened in its DOM structure, yet visually there are (almost) no change as can be seen below :

bug-1255978
Current Readability.js After PR
citylab-1
Current Readability.js After PR
mozilla-1
Current Readability.js After PR
seattletimes-1
Current Readability.js After PR
yahoo-1
Current Readability.js After PR

So, what do you think ? If you think it's fine, I will update the expected result for those tests later.

@RadhiFadlillah RadhiFadlillah marked this pull request as ready for review March 26, 2020 14:25
@gijsk
Copy link
Contributor

gijsk commented Mar 27, 2020

Thanks for the patch, this looks interesting!

I'm a bit confused because the yahoo testcase seems to miss images in your screenshot?

I don't mind the small markup changes, but I'd prefer to see the diffs of the resulting expected.html cases (which you can regenerate with npm run test/generate-testcase.js <slug> ) as part of this patch.

@RadhiFadlillah
Copy link
Contributor Author

@gijsk I've updated the test for those pages.

I'm a bit confused because the yahoo testcase seems to miss images in your screenshot?

No, it just Puppeeter being weird when taking screenshot. If you view it in browser normally, the Youtube image is still there.

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting! I left some comments, mostly relating to some grammar/spelling, but there's a few concrete issues:

  • this won't improve which image we use everywhere - though I think we can take the patch, it'd be nice if we kept the source of the image we dropped, so that consumers of the readability markup could choose to e.g. do a HEAD request on both images to determine their size, or apply some heuristic to the URL or something.
  • there are some tricky issues with looping over elements and removing them...

<p>Get fast and easy access to the features you use most in the new menu. Open the “Customize” panel to add, move or remove any button you want. Keep your favorite features — add-ons, private browsing, Sync and more — one quick click away.</p>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned-high-res.6efd60766484.png" data-processed="false" data-src="//mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned.fbd3ee9402e6.png" data-high-res="true" data-high-res-src="//mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned-high-res.6efd60766484.png" alt="" id="designed-mobile" /> </p>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned.fbd3ee9402e6.png" alt="" id="designed-mobile" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's hard to know what would happen here today, because the original images no longer load and the page has disappeared off google, but the file names here indicate that we're now loading the mobile version of the image instead of the high-res one. That's the opposite of what you wanted to accomplish, right?

I'm not sure there's a way to really solve both cases, short of trying to come up with heuristics to determine which URL is better - or creating some kind of pluggability where the consumer of the library can fetch both and determine which they want... which seems tricky, because it more or less entails starting to use await and promises in the library, which is its own project.

Could we keep the old/other source available on the image in a separate attribute, so that consumers of the resulting markup could still make their own decisions later? That might be an easier way of giving control to the library consumers without needing too many architectural changes inside readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d32f39f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit puzzled because the image for which this conversation was started still only has 1 attribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this conflicts with the code from #542 in _fixLazyImages, because your patch is replacing the original image with the one from inside the noscript element, without preserving the data-src etc. attributes.

I'll add a suggestion on how to fix this.

@RadhiFadlillah RadhiFadlillah requested a review from gijsk April 1, 2020 10:11
@gijsk
Copy link
Contributor

gijsk commented Apr 2, 2020

Apologies for the delay here; it's code freeze crunch time and I'm stretched a little thin at the moment. I hope to be able to look at this later today, worst case tomorrow.

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies again for the slow review. I have one more nitpick to really preserve high-res images here, and then I think we're good.

<p>Get fast and easy access to the features you use most in the new menu. Open the “Customize” panel to add, move or remove any button you want. Keep your favorite features — add-ons, private browsing, Sync and more — one quick click away.</p>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned-high-res.6efd60766484.png" data-processed="false" data-src="//mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned.fbd3ee9402e6.png" data-high-res="true" data-high-res-src="//mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned-high-res.6efd60766484.png" alt="" id="designed-mobile" /> </p>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned.fbd3ee9402e6.png" alt="" id="designed-mobile" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit puzzled because the image for which this conversation was started still only has 1 attribute?

<p>Get fast and easy access to the features you use most in the new menu. Open the “Customize” panel to add, move or remove any button you want. Keep your favorite features — add-ons, private browsing, Sync and more — one quick click away.</p>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned-high-res.6efd60766484.png" data-processed="false" data-src="//mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned.fbd3ee9402e6.png" data-high-res="true" data-high-res-src="//mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned-high-res.6efd60766484.png" alt="" id="designed-mobile" /> </p>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/designed-redesigned.fbd3ee9402e6.png" alt="" id="designed-mobile" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this conflicts with the code from #542 in _fixLazyImages, because your patch is replacing the original image with the one from inside the noscript element, without preserving the data-src etc. attributes.

I'll add a suggestion on how to fix this.

@gijsk
Copy link
Contributor

gijsk commented Apr 3, 2020

Looks great, thank you!

@gijsk gijsk merged commit 668a3a1 into mozilla:master Apr 3, 2020
@PalmerAL
Copy link
Contributor

I'm having a bit of trouble getting this to work as expected; when I run it against a Medium page, the output includes both the low-res and the high-res images.

If I regenerate the lazy-image-1 testcase, I get a different source file than what's in this PR, so it's possible that Medium's changed their site in the past few days:

@RadhiFadlillah I'll try to look into fixing this soon if you don't have time to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants