Let's set an example.
Say we have a website https://example.com
that hosts galleries of pictures and videos curated by users. We want to download all the images and video files in the gallery, and we also want to save some text information, like the username of the curator.
An example gallery, located at https://example.com/gallery/1337
would look something like this:
<html>
<head>
<title>My awesome gallery</title>
</head>
<body>
<h1>My awesome gallery</h1>
<p>
Curated by <a href="https://example.com/user/everdrone">@everdrone</a>
</p>
<p>Wednesday, August 17, 2022</p>
<div>
<h1>images</h1>
<img src="https://cdn.example.com/img/jpg/94257478745" />
<img src="https://cdn.example.com/img/jpg/20239846093" />
<img src="https://cdn.example.com/img/jpg/39808447626" />
<video
src="https://cdn.example.com/video/1337/39808447626_small.mp4"
></video>
</div>
</body>
</html>
To get the gallery curator we could parse the url to get whatever comes after user/
. We also have the title both inside <title>
and inside <h1>
, the video and image urls are also easily accessible.
Let's start by generating a new configuration file using grab config generate
. Remove everything inside the new grab.hcl
file and write the following:
global {
location = "/home/<username>/Downloads/grab"
}
Replace the path with any directory you want, we will refer to this path as global.location
.
This will tell grab where to store everything: sites, images, videos, information.
Let's add a site
block called example
, then an attribute test
with the regular expression that will check if we actually are on https://example.com
.
In there we also want to have two asset
blocks, one for images and one for videos.
global {
location = "/home/<username>/Downloads/grab"
}
site "example" {
test = ":\\/\\/example\\.com"
asset "image" {
pattern = "<img\\ssrc=\"([^\"]+)"
capture = 1
find_all = true
}
asset "video" {
pattern = "<video\\ssrc=\"(?P<video_url>[^\"]+)"
capture = "video_url"
find_all = true
}
}
Note
To learn how to escaping regular expressions in HCL, refer to the section below.
If you're familiar with Regular Expressions, the patterns above should be pretty easy to understand, but we'll go through them here anyway.
The asset[image].pattern
expression captures whatever comes after <img src="
until it finds another double quote, so it will get us the entire image url.
The expression in asset[video].pattern
does the same thing but with a video
tag, and it uses a named capture group, for the sake of this example.
Notice how the capture
attribute is a number when the regular expression doesn't use named captures, and it's a string in the video
asset because we want to capture the (?P<video_url>)
group.
Warning
If thepattern
attribute contains named groups, you must set thecapture
to a string representing the name of the group.
Use an integercapture
only if yourpattern
does not contain named groups.To learn more about Go's regexp syntax see the official documentation.
We also want set find_all
to true
because we expect to find multiple video and image urls on the site page.
Now let's run the program and see what we get.
grab get https://example.com/gallery/1337
Internally Grab stores the following urls extracted with the regular expressions defined in each asset
block:
https://cdn.example.com/img/jpg/94257478745
https://cdn.example.com/img/jpg/20239846093
https://cdn.example.com/img/jpg/39808447626
https://cdn.example.com/video/1337/39808447626_small.mp4
After Grab finishes downloading everything, we can browse the global.location
directory to find out that a new example
directory has been created. This directory is named after the site "example"
block.
If we navigate inside the example
directory we will see the following files:
94257478745
20239846093
39808447626
39808447626_small.mp4
The files without extension are the images, and then we have one video file.
In addition to asset
blocks, site can also contain info
blocks.
global {
location = "/home/<username>/Downloads/grab"
}
site "example" {
test = ":\\/\\/example\\.com"
info "curator" {
pattern = "\\/user\\/(\\w+)"
capture = 1
from = body
}
info "gallery_id" {
pattern = "gallery\\/(\\d+)"
capture = 1
from = url
}
}
Info blocks behave like asset
blocks: the pattern
matches against some text, and capture
determines the group to extract.
In addition we can specify if we want to match the pattern
against the body
or the url
by setting the from
attribute. If nothing is specified, the url
will be used.
Some websites require a certain set of header to be specified to access a page, or even just for user tracking.
Grab allows us to specify a network
block, either at the root level (inside global
), site level (inside each site
) or asset level (inside each asset
).
global {
location = "/home/<username>/Downloads/grab"
network {
retries = 3
}
}
site "example" {
test = ":\\/\\/example\\.com"
network {
retries = 3
headers = {
"User-Agent" = "Mozilla/5.0 ..."
}
}
asset "video" {
pattern = "<video\\ssrc=\"(?P<video_url>[^\"]+)"
capture = "video_url"
find_all = true
network {
inherit = false
timeout = 10000
}
}
# ...
}
All network
blocks can contain the following attributes:
headers
-map[string]string
: a dictionary of headers to send along each request.retries
-int
: how many times the client will send the same request, if the first attempt fails.timeout
-int
: how long the client will wait before giving up on a request, in milliseconds.inherit
-bool
: should this network block inherit missing properties from the parentnetwork
block?
Note
Theinherit
property is not available for theglobal.network
block, since there is nothing to inherit from.
Let's organize our downloads by making Grab create subdirectories, so that for any other gallery than the one located at https://example.com/gallery/1337
, we get a directory named with the gallery id.
global {
location = "/home/<username>/Downloads/grab"
}
site "example" {
test = ":\\/\\/example\\.com"
asset "image" {
pattern = "<img\\ssrc=\"([^\"]+)"
capture = 1
find_all = true
}
asset "video" {
pattern = "<video\\ssrc=\"(?P<video_url>[^\"]+)"
capture = "video_url"
find_all = true
}
subdirectory {
pattern = "gallery\\/(\\d+)"
capture = 1
from = url
}
}
By adding a subdirectory
block inside our site block, we tell grab to create a new directory with the name extracted from the url, by capturing the first group of the expression gallery\\/(\\d+)
, so in this case, all the assets will be located at <global.location>/example/1337
.
Note
Thesubdirectory
block, if defined, will tell the program to download the assets into<global.location>/<site.name>/<subdirectory>/<filename>
If no subdirectory is specified, the assets will be saved to<global.location>/<site.name>/<filename>
There are still a few issues to uncover:
- The images have no extension, but the url suggests that they are encoded using the
jpeg
format. - The video url ends with
_small
, but we know that there's a better quality option ending with_large
.
So we will update our configuration file to transform the filename of the image
assets by moving the /jpg/
that comes before the name into the extension name. And for the video
block, we need to transform the url but replacing _small
with _large
, before the file is downloaded.
To achieve this, Grab offers the transform
block.
global {
location = "/home/<username>/Downloads/grab"
}
site "example" {
test = ":\\/\\/example\\.com"
asset "image" {
pattern = "<img\\ssrc=\"([^\"]+)"
capture = 1
find_all = true
transform filename {
pattern = "(https?[^\\?\\#]+)(.*)"
replace = "$${1}.jpg"
}
}
asset "video" {
pattern = "<video\\ssrc=\"(?P<video_url>[^\"]+)"
capture = "video_url"
find_all = true
transform url {
pattern = "(.+)_small(\\.\\w+)"
replace = "$${1}_large$${2}"
}
}
subdirectory {
pattern = "gallery\\/(\\d+)"
capture = 1
from = url
}
}
The transform
block is essentially just a RegExp replace operation performed on the URL.
To better explain the process of transform blocks, here are a flow diagram and some pseudocode:
stateDiagram-v2
A: Set default destination
B: Replace url
C: Download asset to destination
D: Replace destination
state if_filename <<choice>>
state if_url <<choice>>
[*] --> A
A --> if_url
if_url --> B : with "transform url"
if_url --> if_filename : without
B --> if_filename
if_filename --> C : without
if_filename --> D : with "transform filename"
D --> C
C --> [*]
Go pseudocode
// loop through every url matching the asset pattern
for _, url := range parsed_urls {
// by default, the destination is the last element of the original url
destination := basename(url)
// do we have a "transform url" block?
if has_transform_url {
// perform substitution of the original url with regex
url = regexp.Replace(pattern, replacement, url)
}
// do we have a "transform filename" block?
if has_transform_filename {
// perform substitution of the original url with regex
destination = regexp.Replace(pattern, replacement, url)
}
// download with the network options
response := fetch(url)
// save the file
return write_file(destination, response)
}
The replace
attribute uses the same syntax from Go's RegExp standard library package, and just like with backslash escapes, there's a gotcha about escaping.
As mentioned above, HCL offers multiple advantages over other configuration languages, including string interpolation or templating.
While very useful for multiple purposes, it can create problems when dealing with Regular Expressions, since the syntax often includes backslashes and dollar signs that would be interpreted differently by the HCL parser.
Some sequences and character classes in RegExp require a backslash. In HCL we should always escape backslashes, like we do with JSON. The following table demonstrates how some HCL strings are interpreted and passed down to Go (if valid).
HCL String | Go String | RegExp Token |
---|---|---|
\\ |
\ |
Backslash |
\ |
INVALID HCL | |
\\d |
\d |
Perl character class |
\" |
\" |
Double quote character |
\\. |
\. |
Dot character |
\. |
INVALID HCL | |
. |
. |
Any character |
\\n |
\n |
New line character |
\n |
INVALID HCL | |
(\") |
(\") |
Group capturing a double quote character |
(") |
INVALID HCL |
Take the following expression:
(https?)://(\w+)\.(\w+)
if applied to the string: https://github.com
, it produces the following groups:
group0 => https://github.com
group1 => https
group2 => github
group3 => com
Considering the example above, here's a conversion table from HCL to Go to its RegExp engine.
HCL String | Go String | RegExp Output |
---|---|---|
$1 |
$1 |
https |
${1} |
1 |
1 |
$${1} |
${1} |
https |
$${3}.$${2} |
${3}.${2} |
com.github |
${3}.${2} |
3_2 |
3_2 |
$${1}://$${2}.$${3} |
${1}://${2}.${3} |
https://github.com |
${1}://${2}.${3} |
1://2.3 |
1://2.3 |