GateNLP · freddyheppell · Aug 13, 2024 · Aug 7, 2024 · Aug 8, 2024 · Aug 8, 2024
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -40,6 +40,24 @@ $ make lint
 
 Both library code and tests are linted (although tests are slightly less restrictive, see `pyproject.toml`).
 
+### Docstring Validation
+
+Additional checks for missing `Returns` and `Raises` sections in the docstrings can be run with:
+
+```shell-session
+$ make doclint
+```
+
+This will raise false positives for returns in abstract methods, and should be considered experimental
+
+## Type Checking
+
+To check types with mypy, run:
+
+```shell-session
+$ make types
+```
+
 ## Branch Management
 
 Generally your contribution should be made to the `dev` branch. We will then merge it into `main` only when it's time to release.

diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -4,7 +4,7 @@ on:
   push:
     branches: ["main", "dev"]
   pull_request:
-    branches: ["main", "dev"]
+    branches: ["main", "dev", "release/*"]
 
 permissions:
   contents: read

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -4,7 +4,7 @@ on:
   push:
     branches: ["main", "dev"]
   pull_request:
-    branches: ["main", "dev"]
+    branches: [ "main", "dev", "release/*" ]
 
 permissions:
   contents: read
@@ -32,5 +32,7 @@ jobs:
         run: poetry install --no-interaction
       - name: Build Project
         run: poetry build
+      - name: Type check
+        run: poetry run mypy src/wpextract
       - name: Run tests
         run: poetry run pytest
diff --git a/Makefile b/Makefile
@@ -6,17 +6,25 @@ format:
 lint:
 	poetry run ruff check --fix
 
+.PHONY: doclint
+doclint:
+	poetry run ruff check --preview --select DOC
+
 .PHONY: docdev
 docdev:
 	poetry run mkdocs serve --watch src
 
+.PHONY: types
+types:
+	poetry run mypy src/wpextract
+
 .PHONY: testonly
 testonly:
 	poetry run pytest
 
 .PHONY: testcov
 testcov:
-	poetry run coverage run -m pytest
+	poetry run coverage run --branch -m pytest
 
 .PHONY: covreport
 covreport: testcov

diff --git a/docs/advanced/multilingual.md b/docs/advanced/multilingual.md
@@ -55,7 +55,7 @@ Currently the following plugins are supported:
 
 ## Adding Support
 
-!!! info "See also"
+!!! info "Note"
     To use additional pickers, you must [use WPextract as a library](library.md).
 
 Support can be added by creating a new picker definition inheriting from [`LangPicker`][wpextract.parse.translations.LangPicker], and passing to the `translation_pickers` argument of [`WPExtractor`][wpextract.WPExtractor]
@@ -69,41 +69,50 @@ More complicated pickers may need to override additional methods of the class, b
 
 This section will show implementing a new picker with the following simplified markup:
 
-```html
-<ul class="translations">
-  <li><a href="/page/" class="lang current-lang" lang="en">English</a></li>
-  <li><a href="/de/seite/" class="lang" lang="de">Deutsch</a></li>
-  <li><a href="/page/" class="lang no-translation" lang="fr">Français</a></li>
-</ul>
-```
+!!! example "Example picker markup"
+    ```html
+    <ul class="translations">
+      <li><a href="/page/" class="lang current-lang" lang="en">English</a></li>
+      <li><a href="/de/seite/" class="lang" lang="de">Deutsch</a></li>
+      <li><a href="/page/" class="lang no-translation" lang="fr">Français</a></li>
+    </ul>
+    ```
+
 The correct parse of this picker should set the current language to English, add German as a translation, and ignore French.
 
 ### `get_root()`
 
+!!! tip "Selector Support"
+
+    The `select()` and `select_one()` methods use [Soup Sieve](https://facelessuser.github.io/soupsieve/) internally.
+
+    This library supports many, but not all, CSS selectors. Supported selectors can be found [here](https://facelessuser.github.io/soupsieve/selectors/). Namespace selection is not supported as we use the `lxml` backend.
+
 Using the `self.page_doc` attribute, a [`BeautifulSoup`][bs4.BeautifulSoup] object representing the page, the root element of the picker should be found and returned.
 
-The `select_one` method is used to find the root element, and will return `None` if no element is found, which will be intepreted as the picker not being present on the page.
+The `select_one` method is used to find the root element, and will return `None` if no element is found, which will be interpreted as the picker not being present on the page.
 
 If a value is returned, the `self.root_el` attribute will be populated with the result of this method.
 
-??? example "Example `get_root` implementation"
+!!! example "Example `get_root` implementation"
     ```python
 
     class MyPicker(LangPicker):
         ...
-        def get_root(self) -> Tag:
-            return self.page_doc.select_one('ul', class_='translations')
+        def get_root(self):
+            return self.page_doc.select_one('ul.translations')
     ```
 
 ### `extract()`
 
 Using the `self.root_el` attribute, the languages should be found and added to the dataset.
 
 Be careful to avoid:
+
 - Adding the current language
 - Adding languages which are listed but don't have translations
 
-??? example "Example `extract` implementation"
+!!! example "Example `extract` implementation"
     ```python
 
     class MyPicker(LangPicker):
@@ -112,11 +121,51 @@ Be careful to avoid:
             for lang_el in self.root_el.select('li'):
                 lang_a = lang_el.select_one('a')
                 if 'current-lang' in lang_a.get('class'):
-                    self.set_current_lang(lang)
+                    self.set_current_lang(lang_a.get('lang'))
                 elif 'no-translation' not in lang_a.get('class'):
                     self.add_translation(lang_a.get('href'), lang_a.get('lang'))
     ```
 
+### Parsing Robustness
+
+BeautifulSoup's `select` and `select_one` methods will silently fail if no matching elements are found (returning `None` and an empty list respectively). In some cases this may be desirable, e.g. if the picker contains no languages, `select_one` returning an empty list will probably be the correct behaviour.
+
+In cases where this _isn't_ right, like when retrieving an element which should always be present, you can instead use the [`LangPicker._root_select()`][wpextract.parse.translations.LangPicker._root_select] or [`LangPicker._root_select_one()`][wpextract.parse.translations.LangPicker._root_select_one] methods. These will raise an error if no element(s) are found.
+
+Currently, this error will be caught and the page in question will be skipped as if no translation picker could be found. In future, this may instead result in other pickers being tried instead.
+
+!!! example "Example more robust `extract` implementation"
+    ```python
+
+    class MyPicker(LangPicker):
+        ...
+        def extract(self):
+            # This should *always* be present, if it isn't this can't be the correct picker.
+            current_lang_a = self._root_select_one('li a.current-lang')
+            self.set_current_lang(current_lang_a.get('lang'))
+
+            # This could be empty
+            for lang_a in self.root_el.select('li a.lang'):
+                if (
+                    'current-lang' not in lang_a.get('class')
+                    and 'no-translation' not in lang_a.get('class')
+                ):
+                    self.add_translation(lang_a.get('href'), lang_a.get('lang'))
+    ```
+
+??? note "Using other selector methods"
+
+    BeautifulSoup's CSS selector methods should cover most use cases. However, if you need to use more complex selection logic, you can generate the same error by using the [`LangPicker._build_extraction_fail_err`][wpextract.parse.translations.LangPicker._build_extraction_fail_err] method.
+
+
+### Performance
+
+For each post in the site, the `get_root()` method of every `LangPicker` instance selected will be run, so the efficiency of this method is important.
+
+* Try and make the test as specific as possible
+* For more complex tests, consider splitting it up and failing early
+* Minimise usage of dynamic patterns. SoupSieve internally caches the compiled form of each search pattern to improve performance, but any change will invalidate this. Consider splitting the static and dynamic parts of patterns to work around this.
+
 ### Contributing Pickers
 
 We welcome contributions via a GitHub PR so long as the picker is not overly specific to a single site. 
diff --git a/docs/api/downloader.md b/docs/api/downloader.md
@@ -3,6 +3,10 @@
 ## Downloading
 
 ::: wpextract.WPDownloader
+    options:
+        members:
+        - download
+        - download_media_files
 
 ## Configuring Request Behaviour
 
@@ -11,3 +15,5 @@
         members: false
 
 ::: wpextract.download.AuthorizationType
+
+::: wpextract.download.requestsession.DEFAULT_UA
diff --git a/docs/api/extractor.md b/docs/api/extractor.md
@@ -15,6 +15,20 @@
 ## Multilingual Extraction
 
 ::: wpextract.parse.translations.LangPicker
+    options:
+        members:
+            - current_language
+            - page_doc
+            - root_el
+            - translations
+            - add_translation
+            - extract
+            - get_root
+            - matches
+            - set_current_lang
+            - _root_select_one
+            - _root_select
+            - _build_extraction_fail_err
 
 ::: wpextract.parse.translations.PickerListType
 

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,5 +1,32 @@
 # Changelog
 
+## 1.1.0 (upcoming)
+
+**Features & Improvements**
+
+- WPextract is now completely type-annotated, and ships a `py.typed` file to indicate this
+- Added `--user-agent` argument to `wpextract download` to allow customisation of the user agent string
+- HTTP errors raised when downloading now all inherit from a common `HTTPError` class
+- If an HTTP error is encountered while downloading, it will no longer end the whole scrape process. A warning will be logged and the scrape will continue, and if some data was obtained for that type, it will be saved as normal. HTTP transit errors (e.g. connection timeouts) will still end the scrape process.
+- Improved the resiliency of HTML parsing and extraction by better checking for edge cases like missing attributes
+  - Translation picker extractors will now raise an exception if elements are missing during the extraction process.
+- Simplified the WordPress API library by removing now-unused cache functionality. This will likely improve memory usage of the download process.
+- Significantly more tests have been added, particularly for the download process
+
+
+**Fixes**
+
+- Fixed the scrape crawling step crashing if a page didn't have a canonical link or `og:url` meta tag
+- Fixed the scrape crawling not correctly recognising when duplicate URLs were encountered. Previously duplicates would be included, but only one would be used. Now, they will be correctly logged. As a result of this change, the `SCRAPE_CRAWL_VERSION` has been incremented, meaning running extraction on a scrape will cause it to be re-crawled.
+- Fixed the return type annotation `LangPicker.get_root()`: it is now annotated as (`bs4.Tag` or `None`) instead of `bs4.PageElement`. This shouldn't be a breaking change, as the expected return type of this function was always a `Tag` object or `None`.
+- Type of `TranslationLink.lang` changed to reflect that it can accept a string to resolve or an already resolved `Language` instance
+- Fixed downloading throwing an error stating the WordPress v2 API was not supported in other error cases
+- Fixed the maximum redirects permitted not being set properly, meaning the effective value was always 30
+
+**Documentation**
+
+- Improved guide on translation parsing, correcting some errors and adding information on parse robustness and performance
+
 ## 1.0.3 (2024-08-06)
 
 **Changes**

diff --git a/docs/usage/download.md b/docs/usage/download.md
@@ -59,6 +59,9 @@ $ wpextract download TARGET OUT_JSON
 `--max-redirects MAX_REDIRECTS`
 : Maximum number of redirects before giving up (default: 20)
 
+`--user-agent USER_AGENT`
+: User agent to use for requests. Default is a recent version of Chrome on Linux (see [`requestsession.DEFAULT_UA`][wpextract.download.requestsession.DEFAULT_UA])
+
 **logging**
 
 `--log FILE`, `-l FILE`
@@ -109,6 +112,11 @@ We would also suggest enabling the following options, with consideration for how
 - `--wait` to space out requests
 - `--random-wait` to vary the time between requests to avoid patterns
 
+You may also wish to consider:
+
+- The reputation of the IP used to make requests. IPs in ranges belonging to common VPS providers, e.g. DigitalOcean or AWS, may be more likely to be rate limited.
+- `--user-agent` to set a custom user agent. The default is a recent version of Chrome on Linux, but this may become outdated. If using authentication, this may need to match the user agent of the browser used to log in.
+
 ### Error Handling
 
 If an HTTP error occurs, the command will retry the request up to `--max-retries` times, with the backoff set by `--backoff-factor`. If the maximum number of retries is reached, the command will output the error, stop collecting the given data type, and start collecting the following data type. This is because it's presumed that if a given page is non-functional, the following one will be too.