Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect loading of iframes + incorrect parse / converting tables #150

Open
ntrippar opened this issue Nov 1, 2024 · 1 comment
Open

Comments

@ntrippar
Copy link

ntrippar commented Nov 1, 2024

If we try too parse let say a medium post is not showing the gist code, I believe that this is because the iframes are not loaded instantly and you need to navigate thought the site to the browser to render them. One solution for this could be detecting every iframe on the site and scroll to each of them so they load.


curl 'https://r.jina.ai/https://edoconti.medium.com/offline-policy-evaluation-run-fewer-better-a-b-tests-60ce8f93fa15' \
	-H "Authorization: Bearer TOKEN” \
	-H "X-No-Cache: true" \
	-H "X-Timeout: 60" \
	-H "X-With-Iframe: true"

also checking the parser itself for the gist code for example in the case of the url above the iframe of some of the gist will be https://edoconti.medium.com/media/f10f007fac2ec7a4c0662ac12428a7fe

curl 'https://r.jina.ai/https://edoconti.medium.com/media/f10f007fac2ec7a4c0662ac12428a7fe' \
	-H "Authorization: Bearer TOKEN” \
	-H "X-No-Cache: true" \
	-H "X-Timeout: 60" \
	-H "X-With-Iframe: true"

it parses incorrectly and return the

html tags



Title: sample-push-notification-policy.py – Medium

URL Source: https://edoconti.medium.com/media/f10f007fac2ec7a4c0662ac12428a7fe

Markdown Content:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. [Learn more about bidirectional Unicode characters](https://github.co/hiddenchars)

[Show hidden characters](https://edoconti.medium.com/media/%7B%7BrevealButtonHref%7D%7D)

<table data-hpc="" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-lang="Python" data-tagsearch-path="sample-push-notification-policy.py"><tbody><tr><td id="file-sample-push-notification-policy-py-L1" data-line-number="1"></td><td id="file-sample-push-notification-policy-py-LC1"><span>def</span> <span>get_push_send_probabilities</span>(<span>context</span>):</td></tr><tr><td id="file-sample-push-notification-policy-py-L2" data-line-number="2"></td><td id="file-sample-push-notification-policy-py-LC2"><span>epsilon</span> <span>=</span> <span>0.10</span></td></tr><tr><td id="file-sample-push-notification-policy-py-L3" data-line-number="3"></td><td id="file-sample-push-notification-policy-py-LC3"></td></tr><tr><td id="file-sample-push-notification-policy-py-L4" data-line-number="4"></td><td id="file-sample-push-notification-policy-py-LC4"><span>if</span> <span>context</span>[<span>"days_since_app_open"</span>] <span>&gt;</span> <span>1</span>:</td></tr><tr><td id="file-sample-push-notification-policy-py-L5" data-line-number="5"></td><td id="file-sample-push-notification-policy-py-LC5"><span>return</span> {<span>"send"</span>: <span>1</span> <span>-</span> <span>epsilon</span>, <span>"dont_send"</span>: <span>epsilon</span>}</td></tr><tr><td id="file-sample-push-notification-policy-py-L6" data-line-number="6"></td><td id="file-sample-push-notification-policy-py-LC6"></td></tr><tr><td id="file-sample-push-notification-policy-py-L7" data-line-number="7"></td><td id="file-sample-push-notification-policy-py-LC7"><span>return</span> {<span>"send"</span>: <span>epsilon</span>, <span>"dont_send"</span>: <span>1</span> <span>-</span> <span>epsilon</span>}</td></tr></tbody></table>
@nomagick
Copy link
Member

Resources on this page are lazy-loaded. Also, the gist iframes use table for layout, so it cannot be transformed into a code block or to a typical markdown table.

We have introduced a script injection mechanism to our API.
Also inside the page, we provide these utility functions/event:

- waitForSelector(selector: string): Promise<HTMLElement> 
  waits for the selector to appear in the DOM
- simulateScroll(): void 
  simulates scrolling to the bottom of the page to trigger lazyload elements
- "mutationIdle" event on document 
  fires when the DOM mutation is idle in 200ms

For the gist formatting, we introduced a x-with-iframe: quoted parameter to inject iframe contents as blockquote sections.

So eventually use this script for your URL:

curl 'https://r.jina.ai/https://edoconti.medium.com/offline-policy-evaluation-run-fewer-better-a-b-tests-60ce8f93fa15' \
  -H 'x-with-iframe: quoted' \
  -H 'x-timeout: 60' \
  --data-urlencode 'injectPageScript=document.addEventListener("mutationIdle", window.simulateScroll);'

The new parameters are yet to be documented but are already usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants