Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #15

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,38 @@ You can also fetch the content of the web capture as bytes:
There's a full example of iterating and selecting a subset of captures
to write into an extracted WARC file in [examples/iter-and-warc.py](examples/iter-and-warc.py)

### Caching

It is possible to cache request to the index by using the requests_cache library https://github.com/reclosedev/requests-cache.
This is a nice way to reduce the load on the CC servers and speed up your
code at the same time.

To make this work we need to initialize the cache with more allowable codes
in order to also cache "empty" search results from the Index Server (404 and 400).
The 206 Http code is needed for downloading contents of the WARC archives.

Just put the following code at the start of your script before making any calls to the
index using cdx_toolkit:

```
import requests_cache

requests_cache.install_cache(
"/my/path/to/cache",
include_get_headers=True,
allowable_codes=(200, 404, 400, 206)
)
```

Additionally, in order for the caching to work we need a static request url. cdx_toolkit
default parameters use a dynamic timestamp parameter. It is necessary to override this
by a custom static date when fetching the index parts:

```
for obj in cdx.iter(url, limit=1, from_ts="20191207000000")):
print(obj)
```

## Filter syntax

Filters can be used to limit captures to a subset of the results.
Expand Down