Skip to content

Commit

Permalink
Merge pull request #47 from GateNLP/release/1.1.0
Browse files Browse the repository at this point in the history
Release 1.1.0
  • Loading branch information
freddyheppell authored Aug 13, 2024
2 parents 35e80da + 1c73af1 commit 8328435
Show file tree
Hide file tree
Showing 149 changed files with 114,143 additions and 981 deletions.
18 changes: 18 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,24 @@ $ make lint

Both library code and tests are linted (although tests are slightly less restrictive, see `pyproject.toml`).

### Docstring Validation

Additional checks for missing `Returns` and `Raises` sections in the docstrings can be run with:

```shell-session
$ make doclint
```

This will raise false positives for returns in abstract methods, and should be considered experimental

## Type Checking

To check types with mypy, run:

```shell-session
$ make types
```

## Branch Management

Generally your contribution should be made to the `dev` branch. We will then merge it into `main` only when it's time to release.
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: ["main", "dev"]
pull_request:
branches: ["main", "dev"]
branches: ["main", "dev", "release/*"]

permissions:
contents: read
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: ["main", "dev"]
pull_request:
branches: ["main", "dev"]
branches: [ "main", "dev", "release/*" ]

permissions:
contents: read
Expand Down Expand Up @@ -32,5 +32,7 @@ jobs:
run: poetry install --no-interaction
- name: Build Project
run: poetry build
- name: Type check
run: poetry run mypy src/wpextract
- name: Run tests
run: poetry run pytest
10 changes: 9 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,25 @@ format:
lint:
poetry run ruff check --fix

.PHONY: doclint
doclint:
poetry run ruff check --preview --select DOC

.PHONY: docdev
docdev:
poetry run mkdocs serve --watch src

.PHONY: types
types:
poetry run mypy src/wpextract

.PHONY: testonly
testonly:
poetry run pytest

.PHONY: testcov
testcov:
poetry run coverage run -m pytest
poetry run coverage run --branch -m pytest

.PHONY: covreport
covreport: testcov
Expand Down
74 changes: 61 additions & 13 deletions docs/advanced/multilingual.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,15 @@ Currently the following plugins are supported:

??? example
```html
--8<-- "tests/parse/translations/test_pickers/polylang_widget.html:struct"
--8<-- "tests/parse/translations/data/polylang_widget.html:struct"
```


- Adding to the navbar as a custom dropdown[^dropdown]

??? example
```html
--8<-- "tests/parse/translations/test_pickers/polylang_custom_dropdown.html:struct"
--8<-- "tests/parse/translations/data/polylang_custom_dropdown.html:struct"
```

**Does not support**:
Expand All @@ -55,7 +55,7 @@ Currently the following plugins are supported:

## Adding Support

!!! info "See also"
!!! info "Note"
To use additional pickers, you must [use WPextract as a library](library.md).

Support can be added by creating a new picker definition inheriting from [`LangPicker`][wpextract.parse.translations.LangPicker], and passing to the `translation_pickers` argument of [`WPExtractor`][wpextract.WPExtractor]
Expand All @@ -69,29 +69,37 @@ More complicated pickers may need to override additional methods of the class, b

This section will show implementing a new picker with the following simplified markup:

```html
<ul class="translations">
<li><a href="/page/" class="lang current-lang" lang="en">English</a></li>
<li><a href="/de/seite/" class="lang" lang="de">Deutsch</a></li>
<li><a href="/page/" class="lang no-translation" lang="fr">Français</a></li>
</ul>
```
!!! example "Example picker markup"
```html
<ul class="translations">
<li><a href="/page/" class="lang current-lang" lang="en">English</a></li>
<li><a href="/de/seite/" class="lang" lang="de">Deutsch</a></li>
<li><a href="/page/" class="lang no-translation" lang="fr">Français</a></li>
</ul>
```

The correct parse of this picker should set the current language to English, add German as a translation, and ignore French.

### `get_root()`

!!! tip "Selector Support"

The `select()` and `select_one()` methods use [Soup Sieve](https://facelessuser.github.io/soupsieve/) internally.

This library supports many, but not all, CSS selectors. Supported selectors can be found [here](https://facelessuser.github.io/soupsieve/selectors/). Namespace selection is not supported as we use the `lxml` backend.

Using the `self.page_doc` attribute, a [`BeautifulSoup`][bs4.BeautifulSoup] object representing the page, the root element of the picker should be found and returned.

The `select_one` method is used to find the root element, and will return `None` if no element is found, which will be interpreted as the picker not being present on the page.

If a value is returned, the `self.root_el` attribute will be populated with the result of this method.

??? example "Example `get_root` implementation"
!!! example "Example `get_root` implementation"
```python

class MyPicker(LangPicker):
...
def get_root(self) -> Tag:
def get_root(self):
return self.page_doc.select_one('ul.translations')
```

Expand All @@ -104,7 +112,7 @@ Be careful to avoid:
- Adding the current language as a translation
- Adding languages which are listed but don't have translations

??? example "Example `extract` implementation"
!!! example "Example `extract` implementation"
```python

class MyPicker(LangPicker):
Expand All @@ -118,6 +126,46 @@ Be careful to avoid:
self.add_translation(lang_a.get('href'), lang_a.get('lang'))
```

### Parsing Robustness

BeautifulSoup's `select` and `select_one` methods will silently fail if no matching elements are found (returning `None` and an empty list respectively). In some cases this may be desirable, e.g. if the picker contains no languages, `select_one` returning an empty list will probably be the correct behaviour.

In cases where this _isn't_ right, like when retrieving an element which should always be present, you can instead use the [`LangPicker._root_select()`][wpextract.parse.translations.LangPicker._root_select] or [`LangPicker._root_select_one()`][wpextract.parse.translations.LangPicker._root_select_one] methods. These will raise an error if no element(s) are found.

Currently, this error will be caught and the page in question will be skipped as if no translation picker could be found. In future, this may instead result in other pickers being tried instead.

!!! example "Example more robust `extract` implementation"
```python

class MyPicker(LangPicker):
...
def extract(self):
# This should *always* be present, if it isn't this can't be the correct picker.
current_lang_a = self._root_select_one('li a.current-lang')
self.set_current_lang(current_lang_a.get('lang'))

# This could be empty
for lang_a in self.root_el.select('li a.lang'):
if (
'current-lang' not in lang_a.get('class')
and 'no-translation' not in lang_a.get('class')
):
self.add_translation(lang_a.get('href'), lang_a.get('lang'))
```

??? note "Using other selector methods"

BeautifulSoup's CSS selector methods should cover most use cases. However, if you need to use more complex selection logic, you can generate the same error by using the [`LangPicker._build_extraction_fail_err`][wpextract.parse.translations.LangPicker._build_extraction_fail_err] method.


### Performance

For each post in the site, the `get_root()` method of every `LangPicker` instance selected will be run, so the efficiency of this method is important.

* Try and make the test as specific as possible
* For more complex tests, consider splitting it up and failing early
* Minimise usage of dynamic patterns. SoupSieve internally caches the compiled form of each search pattern to improve performance, but any change will invalidate this. Consider splitting the static and dynamic parts of patterns to work around this.

### Contributing Pickers

We welcome contributions via a GitHub PR so long as the picker is not overly specific to a single site.
6 changes: 6 additions & 0 deletions docs/api/downloader.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
## Downloading

::: wpextract.WPDownloader
options:
members:
- download
- download_media_files

## Configuring Request Behaviour

Expand All @@ -11,3 +15,5 @@
members: false

::: wpextract.download.AuthorizationType

::: wpextract.download.requestsession.DEFAULT_UA
14 changes: 14 additions & 0 deletions docs/api/extractor.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,20 @@
## Multilingual Extraction

::: wpextract.parse.translations.LangPicker
options:
members:
- current_language
- page_doc
- root_el
- translations
- add_translation
- extract
- get_root
- matches
- set_current_lang
- _root_select_one
- _root_select
- _build_extraction_fail_err

::: wpextract.parse.translations.PickerListType

Expand Down
27 changes: 27 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,32 @@
# Changelog

## 1.1.0 (2024-08-13)

**Features & Improvements**

- WPextract is now completely type-annotated, and ships a `py.typed` file to indicate this
- Added `--user-agent` argument to `wpextract download` to allow customisation of the user agent string
- HTTP errors raised when downloading now all inherit from a common `HTTPError` class
- If an HTTP error is encountered while downloading, it will no longer end the whole scrape process. A warning will be logged and the scrape will continue, and if some data was obtained for that type, it will be saved as normal. HTTP transit errors (e.g. connection timeouts) will still end the scrape process.
- Improved the resiliency of HTML parsing and extraction by better checking for edge cases like missing attributes
- Translation picker extractors will now raise an exception if elements are missing during the extraction process.
- Simplified the WordPress API library by removing now-unused cache functionality. This will likely improve memory usage of the download process.
- Significantly more tests have been added, particularly for the download process


**Fixes**

- Fixed the scrape crawling step crashing if a page didn't have a canonical link or `og:url` meta tag
- Fixed the scrape crawling not correctly recognising when duplicate URLs were encountered. Previously duplicates would be included, but only one would be used. Now, they will be correctly logged. As a result of this change, the `SCRAPE_CRAWL_VERSION` has been incremented, meaning running extraction on a scrape will cause it to be re-crawled.
- Fixed the return type annotation `LangPicker.get_root()`: it is now annotated as (`bs4.Tag` or `None`) instead of `bs4.PageElement`. This shouldn't be a breaking change, as the expected return type of this function was always a `Tag` object or `None`.
- Type of `TranslationLink.lang` changed to reflect that it can accept a string to resolve or an already resolved `Language` instance
- Fixed downloading throwing an error stating the WordPress v2 API was not supported in other error cases
- Fixed the maximum redirects permitted not being set properly, meaning the effective value was always 30

**Documentation**

- Improved guide on translation parsing, correcting some errors and adding information on parse robustness and performance

## 1.0.3 (2024-08-06)

**Changes**
Expand Down
8 changes: 8 additions & 0 deletions docs/usage/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ $ wpextract download TARGET OUT_JSON
`--max-redirects MAX_REDIRECTS`
: Maximum number of redirects before giving up (default: 20)

`--user-agent USER_AGENT`
: User agent to use for requests. Default is a recent version of Chrome on Linux (see [`requestsession.DEFAULT_UA`][wpextract.download.requestsession.DEFAULT_UA])

**logging**

`--log FILE`, `-l FILE`
Expand Down Expand Up @@ -109,6 +112,11 @@ We would also suggest enabling the following options, with consideration for how
- `--wait` to space out requests
- `--random-wait` to vary the time between requests to avoid patterns

You may also wish to consider:

- The reputation of the IP used to make requests. IPs in ranges belonging to common VPS providers, e.g. DigitalOcean or AWS, may be more likely to be rate limited.
- `--user-agent` to set a custom user agent. The default is a recent version of Chrome on Linux, but this may become outdated. If using authentication, this may need to match the user agent of the browser used to log in.

### Error Handling

If an HTTP error occurs, the command will retry the request up to `--max-retries` times, with the backoff set by `--backoff-factor`. If the maximum number of retries is reached, the command will output the error, stop collecting the given data type, and start collecting the following data type. This is because it's presumed that if a given page is non-functional, the following one will be too.
Expand Down
Loading

0 comments on commit 8328435

Please sign in to comment.