Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typing and testing improvements #46

Merged
merged 29 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
861e7e3
Added user agent argument (#43)
freddyheppell Aug 7, 2024
aef2f45
types
freddyheppell Aug 8, 2024
bd160ef
ruff
freddyheppell Aug 8, 2024
c411f3e
docs improvements
freddyheppell Aug 8, 2024
e169b5e
update docs
freddyheppell Aug 8, 2024
5c0fcf0
add py.typed
freddyheppell Aug 8, 2024
1acf74a
add type check workflow
freddyheppell Aug 8, 2024
a185ab3
fix paramspec in py3.9
freddyheppell Aug 8, 2024
a53b9d9
add tests for picker select helpers
freddyheppell Aug 8, 2024
0610aa7
more tests for download/parse/extract
freddyheppell Aug 8, 2024
ead3a1e
fix flaky test
freddyheppell Aug 8, 2024
277b5a3
ruff
freddyheppell Aug 9, 2024
db55c70
wpapi crawler tests
freddyheppell Aug 9, 2024
cea25be
remove wpapi caching
freddyheppell Aug 9, 2024
ac8d7ad
test get_obj_list and remove search
freddyheppell Aug 9, 2024
fa95874
more tests
freddyheppell Aug 9, 2024
e4de46a
add e2e tests
freddyheppell Aug 12, 2024
1e13177
add extra info to e2e test docs
freddyheppell Aug 12, 2024
438922f
Merge branch 'release/1.0.4' into typing
freddyheppell Aug 12, 2024
5e49e8e
remove unused UA argument on downloader
freddyheppell Aug 12, 2024
76e4329
ruff
freddyheppell Aug 12, 2024
7825c1d
fix ua null behaviour
freddyheppell Aug 12, 2024
8f10459
ruff
freddyheppell Aug 12, 2024
55cd131
Improve download error handling on first request
freddyheppell Aug 12, 2024
11b4ee7
Revert type change to set_current_lang and instead concat inside extr…
freddyheppell Aug 12, 2024
9b48a24
Update changelog
freddyheppell Aug 12, 2024
87df809
fix incorrect docstring
freddyheppell Aug 12, 2024
ccae085
remove incorrect raises of not v2 error
freddyheppell Aug 12, 2024
730b5d8
requestsession tests
freddyheppell Aug 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,24 @@ $ make lint

Both library code and tests are linted (although tests are slightly less restrictive, see `pyproject.toml`).

### Docstring Validation

Additional checks for missing `Returns` and `Raises` sections in the docstrings can be run with:

```shell-session
$ make doclint
```

This will raise false positives for returns in abstract methods, and should be considered experimental

## Type Checking

To check types with mypy, run:

```shell-session
$ make types
```

## Branch Management

Generally your contribution should be made to the `dev` branch. We will then merge it into `main` only when it's time to release.
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: ["main", "dev"]
pull_request:
branches: ["main", "dev"]
branches: ["main", "dev", "release/*"]

permissions:
contents: read
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: ["main", "dev"]
pull_request:
branches: ["main", "dev"]
branches: [ "main", "dev", "release/*" ]

permissions:
contents: read
Expand Down Expand Up @@ -32,5 +32,7 @@ jobs:
run: poetry install --no-interaction
- name: Build Project
run: poetry build
- name: Type check
run: poetry run mypy src/wpextract
- name: Run tests
run: poetry run pytest
10 changes: 9 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,25 @@ format:
lint:
poetry run ruff check --fix

.PHONY: doclint
doclint:
poetry run ruff check --preview --select DOC

.PHONY: docdev
docdev:
poetry run mkdocs serve --watch src

.PHONY: types
types:
poetry run mypy src/wpextract

.PHONY: testonly
testonly:
poetry run pytest

.PHONY: testcov
testcov:
poetry run coverage run -m pytest
poetry run coverage run --branch -m pytest

.PHONY: covreport
covreport: testcov
Expand Down
77 changes: 63 additions & 14 deletions docs/advanced/multilingual.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Currently the following plugins are supported:

## Adding Support

!!! info "See also"
!!! info "Note"
To use additional pickers, you must [use WPextract as a library](library.md).

Support can be added by creating a new picker definition inheriting from [`LangPicker`][wpextract.parse.translations.LangPicker], and passing to the `translation_pickers` argument of [`WPExtractor`][wpextract.WPExtractor]
Expand All @@ -69,41 +69,50 @@ More complicated pickers may need to override additional methods of the class, b

This section will show implementing a new picker with the following simplified markup:

```html
<ul class="translations">
<li><a href="/page/" class="lang current-lang" lang="en">English</a></li>
<li><a href="/de/seite/" class="lang" lang="de">Deutsch</a></li>
<li><a href="/page/" class="lang no-translation" lang="fr">Français</a></li>
</ul>
```
!!! example "Example picker markup"
```html
<ul class="translations">
<li><a href="/page/" class="lang current-lang" lang="en">English</a></li>
<li><a href="/de/seite/" class="lang" lang="de">Deutsch</a></li>
<li><a href="/page/" class="lang no-translation" lang="fr">Français</a></li>
</ul>
```

The correct parse of this picker should set the current language to English, add German as a translation, and ignore French.

### `get_root()`

!!! tip "Selector Support"

The `select()` and `select_one()` methods use [Soup Sieve](https://facelessuser.github.io/soupsieve/) internally.

This library supports many, but not all, CSS selectors. Supported selectors can be found [here](https://facelessuser.github.io/soupsieve/selectors/). Namespace selection is not supported as we use the `lxml` backend.

Using the `self.page_doc` attribute, a [`BeautifulSoup`][bs4.BeautifulSoup] object representing the page, the root element of the picker should be found and returned.

The `select_one` method is used to find the root element, and will return `None` if no element is found, which will be intepreted as the picker not being present on the page.
The `select_one` method is used to find the root element, and will return `None` if no element is found, which will be interpreted as the picker not being present on the page.

If a value is returned, the `self.root_el` attribute will be populated with the result of this method.

??? example "Example `get_root` implementation"
!!! example "Example `get_root` implementation"
```python

class MyPicker(LangPicker):
...
def get_root(self) -> Tag:
return self.page_doc.select_one('ul', class_='translations')
def get_root(self):
return self.page_doc.select_one('ul.translations')
```

### `extract()`

Using the `self.root_el` attribute, the languages should be found and added to the dataset.

Be careful to avoid:

- Adding the current language
- Adding languages which are listed but don't have translations

??? example "Example `extract` implementation"
!!! example "Example `extract` implementation"
```python

class MyPicker(LangPicker):
Expand All @@ -112,11 +121,51 @@ Be careful to avoid:
for lang_el in self.root_el.select('li'):
lang_a = lang_el.select_one('a')
if 'current-lang' in lang_a.get('class'):
self.set_current_lang(lang)
self.set_current_lang(lang_a.get('lang'))
elif 'no-translation' not in lang_a.get('class'):
self.add_translation(lang_a.get('href'), lang_a.get('lang'))
```

### Parsing Robustness

BeautifulSoup's `select` and `select_one` methods will silently fail if no matching elements are found (returning `None` and an empty list respectively). In some cases this may be desirable, e.g. if the picker contains no languages, `select_one` returning an empty list will probably be the correct behaviour.

In cases where this _isn't_ right, like when retrieving an element which should always be present, you can instead use the [`LangPicker._root_select()`][wpextract.parse.translations.LangPicker._root_select] or [`LangPicker._root_select_one()`][wpextract.parse.translations.LangPicker._root_select_one] methods. These will raise an error if no element(s) are found.

Currently, this error will be caught and the page in question will be skipped as if no translation picker could be found. In future, this may instead result in other pickers being tried instead.

!!! example "Example more robust `extract` implementation"
```python

class MyPicker(LangPicker):
...
def extract(self):
# This should *always* be present, if it isn't this can't be the correct picker.
current_lang_a = self._root_select_one('li a.current-lang')
self.set_current_lang(current_lang_a.get('lang'))

# This could be empty
for lang_a in self.root_el.select('li a.lang'):
if (
'current-lang' not in lang_a.get('class')
and 'no-translation' not in lang_a.get('class')
):
self.add_translation(lang_a.get('href'), lang_a.get('lang'))
```

??? note "Using other selector methods"

BeautifulSoup's CSS selector methods should cover most use cases. However, if you need to use more complex selection logic, you can generate the same error by using the [`LangPicker._build_extraction_fail_err`][wpextract.parse.translations.LangPicker._build_extraction_fail_err] method.


### Performance

For each post in the site, the `get_root()` method of every `LangPicker` instance selected will be run, so the efficiency of this method is important.

* Try and make the test as specific as possible
* For more complex tests, consider splitting it up and failing early
* Minimise usage of dynamic patterns. SoupSieve internally caches the compiled form of each search pattern to improve performance, but any change will invalidate this. Consider splitting the static and dynamic parts of patterns to work around this.

### Contributing Pickers

We welcome contributions via a GitHub PR so long as the picker is not overly specific to a single site.
6 changes: 6 additions & 0 deletions docs/api/downloader.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
## Downloading

::: wpextract.WPDownloader
options:
members:
- download
- download_media_files

## Configuring Request Behaviour

Expand All @@ -11,3 +15,5 @@
members: false

::: wpextract.download.AuthorizationType

::: wpextract.download.requestsession.DEFAULT_UA
14 changes: 14 additions & 0 deletions docs/api/extractor.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,20 @@
## Multilingual Extraction

::: wpextract.parse.translations.LangPicker
options:
members:
- current_language
- page_doc
- root_el
- translations
- add_translation
- extract
- get_root
- matches
- set_current_lang
- _root_select_one
- _root_select
- _build_extraction_fail_err

::: wpextract.parse.translations.PickerListType

Expand Down
27 changes: 27 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,32 @@
# Changelog

## 1.1.0 (upcoming)

**Features & Improvements**

- WPextract is now completely type-annotated, and ships a `py.typed` file to indicate this
- Added `--user-agent` argument to `wpextract download` to allow customisation of the user agent string
- HTTP errors raised when downloading now all inherit from a common `HTTPError` class
- If an HTTP error is encountered while downloading, it will no longer end the whole scrape process. A warning will be logged and the scrape will continue, and if some data was obtained for that type, it will be saved as normal. HTTP transit errors (e.g. connection timeouts) will still end the scrape process.
- Improved the resiliency of HTML parsing and extraction by better checking for edge cases like missing attributes
- Translation picker extractors will now raise an exception if elements are missing during the extraction process.
- Simplified the WordPress API library by removing now-unused cache functionality. This will likely improve memory usage of the download process.
- Significantly more tests have been added, particularly for the download process


**Fixes**

- Fixed the scrape crawling step crashing if a page didn't have a canonical link or `og:url` meta tag
- Fixed the scrape crawling not correctly recognising when duplicate URLs were encountered. Previously duplicates would be included, but only one would be used. Now, they will be correctly logged. As a result of this change, the `SCRAPE_CRAWL_VERSION` has been incremented, meaning running extraction on a scrape will cause it to be re-crawled.
- Fixed the return type annotation `LangPicker.get_root()`: it is now annotated as (`bs4.Tag` or `None`) instead of `bs4.PageElement`. This shouldn't be a breaking change, as the expected return type of this function was always a `Tag` object or `None`.
- Type of `TranslationLink.lang` changed to reflect that it can accept a string to resolve or an already resolved `Language` instance
- Fixed downloading throwing an error stating the WordPress v2 API was not supported in other error cases
- Fixed the maximum redirects permitted not being set properly, meaning the effective value was always 30

**Documentation**

- Improved guide on translation parsing, correcting some errors and adding information on parse robustness and performance

## 1.0.3 (2024-08-06)

**Changes**
Expand Down
8 changes: 8 additions & 0 deletions docs/usage/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ $ wpextract download TARGET OUT_JSON
`--max-redirects MAX_REDIRECTS`
: Maximum number of redirects before giving up (default: 20)

`--user-agent USER_AGENT`
: User agent to use for requests. Default is a recent version of Chrome on Linux (see [`requestsession.DEFAULT_UA`][wpextract.download.requestsession.DEFAULT_UA])

**logging**

`--log FILE`, `-l FILE`
Expand Down Expand Up @@ -109,6 +112,11 @@ We would also suggest enabling the following options, with consideration for how
- `--wait` to space out requests
- `--random-wait` to vary the time between requests to avoid patterns

You may also wish to consider:

- The reputation of the IP used to make requests. IPs in ranges belonging to common VPS providers, e.g. DigitalOcean or AWS, may be more likely to be rate limited.
- `--user-agent` to set a custom user agent. The default is a recent version of Chrome on Linux, but this may become outdated. If using authentication, this may need to match the user agent of the browser used to log in.

### Error Handling

If an HTTP error occurs, the command will retry the request up to `--max-retries` times, with the backoff set by `--backoff-factor`. If the maximum number of retries is reached, the command will output the error, stop collecting the given data type, and start collecting the following data type. This is because it's presumed that if a given page is non-functional, the following one will be too.
Expand Down
Loading