Skip to content

Commit

Permalink
Add process overview to README
Browse files Browse the repository at this point in the history
  • Loading branch information
freddyheppell committed Oct 13, 2023
1 parent 1c776d2 commit 2735e3f
Showing 1 changed file with 65 additions and 2 deletions.
67 changes: 65 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ This tool takes a scrape of the API JSON and HTML pages of the site.

The scrape should be in a 'merged pages' format, i.e. the pages of the list endpoint should be iterated and each page merged together into one list. This can be done by a tool such as [WPJSONScraper](https://github.com/freddyheppell/wp-json-scraper).

The following files should be placed in a directory. They names may be prefixed by a consistent string (e.g. to record the date).
The following files should be placed in a directory. Their names may be prefixed by a consistent string (e.g. to record the date).

| File Name | Endpoint |
| ----------------- | -------------------------------------- |
Expand Down Expand Up @@ -106,4 +106,67 @@ The extractor can also be used as a library instead of on the command line.

Typically, you would instantiate a new [`WPExtractor`](src/extractor/extract.py) instance and call its `extract` method. The dataframes can then be accessed as class attributes or exported with the `export` method.

An example usage is available in the CLI script ([`extractor.cli`](src/extractor/cli.py))
An example usage is available in the CLI script ([`extractor.cli`](src/extractor/cli.py)).

When using this approach, it's possible to use customised translation pickers (see the `translation_pickers` argument of `WPExtractor`). These should be child classes of [`extractor.parse.translations.LangPicker`](src/extractor/parse/translations/_pickers.py).

## Extraction Overview

This section contains an overview of the extraction process for data.

### 1. Scrape Crawling

Website scraping tools may store a webpage at path which is not easy to derive from the URL, for reasons such as path length limits. For this reason, we crawl the scrape directory and build a mapping of URL to path.

For every html file at any depth in the scrape directory, we:
1. Perform a limited parse of only the link and meta tags in the file's head.
2. Attempt to extract a valid URL from a `link` tag with `rel="alternate"` or `canonical` meta tag
3. Check the URL has not previously been seen, warn and skip if it has
4. Add the URL to the map with the absolute path of the file

This map is then saved as `url_cache.json` in the scrape directory. If an existing cache file is detected, it will be used instead of scraping again.

### 2. Content Extraction

Each type of content (posts, pages, media etc) is now extracted in turn.

The extraction process is applied to all posts simultaneously in the following order:
1. Extract raw text from the HTML-formatted title and excerpt.
2. Parse the HTML content from the API response input.
3. Parse the HTML content from the scrape file, if it was found for the link during the crawl
4. Extract the post's language and translations from the scrape file
* Translations are detected using the translation pickers in the [`extractor.parse.translations`](src/extractor/parse/translations) module.
* Custom pickers can be added by [using this tool as a library](#using-as-a-library)
* Any extracted translations are stored as unresolved links
5. Add the post's link to the link registry
6. Using the parsed API content response, extract:
* Internal links (stored as unresolved links)
* External links (stored as resolved links)
* Embeds (`iframe` tags)
* Images (stored as unresolved media if internal, resolved media if external), including their source URL, alt text and caption (if they are in a `figure`)
* Raw text content, via the following process:
1. Remove tags which contain unwanted text (e.g. `figcaptions`)
2. Replace `<br>` tags and `<p>` tags with newline characters
3. Combine all page text

Other types are extracted in similar ways. Any additional user-supplied fields with HTML formatting (such as media captions) are also extracted as plain text.

### 4. Translation Normalisation and Link Resolution

Translations are normalised by checking that for every translation relation (e.g. `en` -> `fr`), the reverse exists. If not, it will be added.


After all types have been processed, the link registry is used to process the unresolved links, translations and media.

For every resolution, the following steps are performed:
1. Remove the `preview_id` query parameter from the URL if present
2. Attempt to look up the URL in the link registry
3. If unsuccessful, use a heuristic to detect category slugs in the URL and try without them
* We do this in case sites have removed category slugs from the URL at some point.
4. If unsuccessful, warn that the URL is unresolvable

For each resolved link, translation, or media, a destination is set containing its normalised URL, data type, and ID.

### 5. Export

The columns of each type are subset and exported as a JSON file each.

0 comments on commit 2735e3f

Please sign in to comment.