GUIDE: No Need for JavaScript #61
dynabler
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
It's always worth checking if you can scrape without JavaScript rendering. Avoiding JavaScript means faster turn-around and less code required. Here are some hints to point you in the right direction.
Website Source Nodes
HTML = root node
HTML is called a root node, since it's the start of it all
HEAD
The head of a website is reserved for computers, bots, crawlers etc. It contains all kind of information NOT visible to humans.
BODY
The body of a website gives the shape of a website, how it looks and how information is presented. More often a CSS framework is used without any semantic meaning of elements (div, div, div instead of section, article, footer)
HEAD = clean data
You can find a lot of data in HEAD of a website, stripped from all the clutter you normally have to deal with when using CSS selectors for body. It's worth a look before moving on to body scraping.
You can find two types of data in HEAD:
OpenGraph Meta Tags are used by Facebook, Twitter/X and others to display an image cards when the page is shared by users.
titles
<title>
,<link rel='canonical'>
and<meta og:title>
can have clean title you're looking for.images
An icon is in
<link rel=icon>
If available, you can also get a good size icon if you look for<link rel=apple-touch-icon" size="144x144">
High resolution images are available in
<meta og:image>
. Some pages have multiple<meta og:image>
. It can be accompanied by aog:image:alt
.music & video data
OpenGraph provides website the ability to share info about music and video, OpenGrah Music and OpenGraph Video
No JavaScript Browser
Google Chrome allows you to have multiple profiles. Add a new one and disable JavaScript. You can use this profile to check for non-JavaScript version.
Stuff to look for
A website shows you a message that JavaScript is required, but you should check if that's really the case. More often than not, it's only shown, so the owner can guarantee a good user experience. You can browse a website with JavaScript, but scrape it without JavaScript.
Beta Was this translation helpful? Give feedback.
All reactions