Skip to content

Commit

Permalink
Rework docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed Jun 26, 2024
1 parent 4c56388 commit f6a1db6
Show file tree
Hide file tree
Showing 5 changed files with 176 additions and 135 deletions.
9 changes: 3 additions & 6 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,13 @@ Unreleased

#. :reqmeta:`zyte_api_session_params`

#. :reqmeta:`zyte_api_session_location` (using any
:meth:`~scrapy_zyte_api.SessionConfig.params` override)
#. :reqmeta:`zyte_api_session_location`

#. :setting:`ZYTE_API_SESSION_PARAMS`

#. :setting:`ZYTE_API_SESSION_LOCATION` (using any
:meth:`~scrapy_zyte_api.SessionConfig.params` override)
#. :setting:`ZYTE_API_SESSION_LOCATION`

#. :meth:`~scrapy_zyte_api.SessionConfig.location` (using any
:meth:`~scrapy_zyte_api.SessionConfig.params` override)
#. :meth:`~scrapy_zyte_api.SessionConfig.location`

#. :meth:`~scrapy_zyte_api.SessionConfig.params`

Expand Down
29 changes: 13 additions & 16 deletions docs/reference/meta.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,15 +108,19 @@ zyte_api_session_location

Default: ``{}``

Address for ``setLocation``-based session initialization. See
:setting:`ZYTE_API_SESSION_LOCATION` for details.
See :ref:`session-init` for general information about location configuration
and parameter precedence.

This request metadata key, if not empty, takes precedence over the
:setting:`ZYTE_API_SESSION_LOCATION` setting, the
:setting:`ZYTE_API_SESSION_PARAMS` setting, and the
:reqmeta:`zyte_api_session_location` request metadata key.
Example:

.. seealso:: :meth:`scrapy_zyte_api.SessionConfig.location`
.. code-block:: python
Request(
"https://example.com",
meta={
"zyte_api_session_location": {"postalCode": "10001"},
},
)
.. reqmeta:: zyte_api_session_params
Expand All @@ -126,15 +130,8 @@ zyte_api_session_params

Default: ``{}``

Parameters to use for session initialization. See
:setting:`ZYTE_API_SESSION_PARAMS` for details.

This request metadata key, if not empty, takes precedence over the
:setting:`ZYTE_API_SESSION_PARAMS` setting, but it can be overridden
by the :setting:`ZYTE_API_SESSION_LOCATION` setting or the
:reqmeta:`zyte_api_session_location` request metadata key.

.. seealso:: :meth:`scrapy_zyte_api.SessionConfig.params`
See :ref:`session-init` for general information about defining session
initialization parameters and parameter precedence.


.. reqmeta:: zyte_api_session_pool
Expand Down
38 changes: 7 additions & 31 deletions docs/reference/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -354,30 +354,16 @@ ZYTE_API_SESSION_LOCATION

Default: ``{}``

If defined, sessions are initialized using the ``setLocation``
:http:`action <request:actions>`, and the value of this setting must be the
target address :class:`dict`. For example:
See :ref:`session-init` for general information about location configuration
and parameter precedence.

Example:

.. code-block:: python
:caption: settings.py
ZYTE_API_SESSION_LOCATION = {"postalCode": "10001"}
If the :setting:`ZYTE_API_SESSION_PARAMS` setting or the
:reqmeta:`zyte_api_session_params` request metadata key set a ``"url"``, it
will be used for session initialization as well. Otherwise, the URL of the
request for which the session is being initialized will be used instead.

This setting, if not empty, takes precedence over the
:setting:`ZYTE_API_SESSION_PARAMS` setting and the
:reqmeta:`zyte_api_session_params` request metadata key, but it can be
overridden by the :reqmeta:`zyte_api_session_location` request metadata key.

To disable the :setting:`ZYTE_API_SESSION_LOCATION` setting on a specific
request, e.g. to use the :setting:`ZYTE_API_SESSION_PARAMS` setting or the
:reqmeta:`zyte_api_session_params` request metadata key instead, set
the :reqmeta:`zyte_api_session_location` request metadata key to ``{}``.

.. setting:: ZYTE_API_SESSION_MAX_BAD_INITS

Expand Down Expand Up @@ -430,20 +416,10 @@ to work even after an unsuccessful response. See :ref:`optimize-sessions`.
ZYTE_API_SESSION_PARAMS
=======================

Default: ``{"browserHtml": True}``

Parameters to use for session initialization.

It works similarly to :http:`request:sessionContextParams` from
:ref:`server-managed sessions <zyte-api-session-contexts>`, but it supports
arbitrary Zyte API parameters instead of a specific subset.

If it does not define a ``"url"``, the URL of the request for which the session
is being initialized will be used.
Default: ``{}``

This setting can be overridden by the :setting:`ZYTE_API_SESSION_LOCATION`
setting, the :reqmeta:`zyte_api_session_location` request metadata key, or the
:reqmeta:`zyte_api_session_params` request metadata key.
See :ref:`session-init` for general information about defining session
initialization parameters and parameter precedence.

Example:

Expand Down
156 changes: 102 additions & 54 deletions docs/usage/session.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,16 @@ scrapy-zyte-api session management offers some advantages over
- You have granular control over the session pool size, max errors, etc. See
:ref:`optimize-sessions` and :ref:`session-configs`.

However, scrapy-zyte-api session manager is not a replacement for
However, scrapy-zyte-api session management is not a replacement for
:ref:`server-managed sessions <zyte-api-session-contexts>` or
:ref:`client-managed sessions <zyte-api-session-id>`:

- :ref:`Server-managed sessions <zyte-api-session-contexts>` offer a longer
life time than the :ref:`client-managed sessions <zyte-api-session-id>`
that scrapy-zyte-api session management uses, so as long as you do not need
one of the scrapy-zyte-api session management features, they can be
significantly more efficient (fewer total sessions needed per crawl).
one of the scrapy-zyte-api session management features, server-managed
sessions can be significantly more efficient (fewer total sessions needed
per crawl).

Zyte API can also optimize server-managed sessions based on the target
website. With scrapy-zyte-api session management, you need to :ref:`handle
Expand All @@ -64,10 +65,14 @@ management on or off for specific requests using the
:meth:`~scrapy_zyte_api.SessionConfig.enabled` method of a :ref:`session config
override <session-configs>`.

.. _session-init-default:

By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each
initialized with a :ref:`browser request <zyte-api-browser>` targeting the URL
of the first request that will use the session. Sessions will be automatically
rotated among requests, and refreshed as they expire or get banned.
of the first request that will use the session. Sessions are automatically
rotated among requests, and refreshed as they expire or get banned. You can
customize most of this logic though request metadata, settings and
:ref:`session config overrides <session-configs>`.

For session management to work as expected, your
:setting:`ZYTE_API_RETRY_POLICY` should not retry 520 and 521 responses:
Expand All @@ -76,13 +81,14 @@ For session management to work as expected, your
(:data:`~zyte_api.zyte_api_retrying`) or
:data:`~zyte_api.aggressive_retrying`:

- If you are :ref:`using the add-on <config-addon>`, they are
automatically replaced with a matching session-specific retry policy,
either :data:`~scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY` or
- If you are :ref:`using the scrapy-zyte-api add-on <config-addon>`,
these built-in retry policies are automatically replaced with a
matching session-specific retry policy, either
:data:`~scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY` or
:data:`~scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY`.

- If you are not using the add-on, set :setting:`ZYTE_API_RETRY_POLICY`
manually to either
- If you are not using the scrapy-zyte-api add-on, set
:setting:`ZYTE_API_RETRY_POLICY` manually to either
:data:`~scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY` or
:data:`~scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY`. For example:

Expand All @@ -99,44 +105,63 @@ For session management to work as expected, your
Initializing sessions
=====================

To change how sessions are initialized, you have the following options:
To change the :ref:`default session initialization parameters
<session-init-default>`, you have the following options:

- To run the ``setLocation`` :http:`action <request:actions>` for session
initialization, use the :setting:`ZYTE_API_SESSION_LOCATION` setting or the
- To initialize sessions with a given **location**, use the
:setting:`ZYTE_API_SESSION_LOCATION` setting or the
:reqmeta:`zyte_api_session_location` request metadata key.

- For session initialization with arbitrary Zyte API request fields, use the
:setting:`ZYTE_API_SESSION_PARAMS` setting or the
The value should be a dictionary with keys supported by the ``address``
field of the ``setLocation`` :http:`action <request:actions>`, e.g.

.. code-block:: python
{
"addressCountry": "US",
"addressRegion": "NY",
"postalCode": "10001",
"streetAddress": "3 Penn Plz",
}
By default, the location is set using the ``setLocation``
:http:`action <request:actions>`. A :ref:`session config override
<session-configs>` can change that through
:meth:`~scrapy_zyte_api.SessionConfig.params`.

- For session initialization with **arbitrary Zyte API request fields**, use
the :setting:`ZYTE_API_SESSION_PARAMS` setting or the
:reqmeta:`zyte_api_session_params` request metadata key.

- To customize session initialization per request, define
:meth:`~scrapy_zyte_api.SessionConfig.params` in a :ref:`session config
override <session-configs>`.
It works similarly to :http:`request:sessionContextParams` from
:ref:`server-managed sessions <zyte-api-session-contexts>`, but it supports
arbitrary Zyte API parameters instead of a specific subset.

If it does not define a ``"url"``, the URL of the request :ref:`triggering
a session initialization request <pool-size>` will be used.

- When defining a :ref:`session config override <session-configs>`, you can
customize the default and location-setting session initialization
parameters through :meth:`~scrapy_zyte_api.SessionConfig.params`.

:meth:`~scrapy_zyte_api.SessionConfig.location` can define a default
location for a given domain.
location for its :ref:`session config override <session-configs>` to use
when no location is specified otherwise.

Precedence, from higher to lower, is:

#. :reqmeta:`zyte_api_session_params`

#. :reqmeta:`zyte_api_session_location` (using any
:meth:`~scrapy_zyte_api.SessionConfig.params` override)
#. :reqmeta:`zyte_api_session_location`

#. :setting:`ZYTE_API_SESSION_PARAMS`

#. :setting:`ZYTE_API_SESSION_LOCATION` (using any
:meth:`~scrapy_zyte_api.SessionConfig.params` override)
#. :setting:`ZYTE_API_SESSION_LOCATION`

#. :meth:`~scrapy_zyte_api.SessionConfig.location` (using any
:meth:`~scrapy_zyte_api.SessionConfig.params` override)
#. :meth:`~scrapy_zyte_api.SessionConfig.location`

#. :meth:`~scrapy_zyte_api.SessionConfig.params`

.. note:: An implementation of :meth:`~scrapy_zyte_api.SessionConfig.location`
can technically override :reqmeta:`zyte_api_session_location` or
:setting:`ZYTE_API_SESSION_LOCATION`, but it is not recommended as it
breaks the precedence chain above that users may expect.

.. _session-check:

Checking sessions
Expand All @@ -151,30 +176,34 @@ initialization fails, e.g. due to rendering issues, IP-geolocation mismatches,
A-B tests, etc. It can also help in cases where website sessions expire before
Zyte API sessions.

By default, for sessions that are initialized with a location, the outcome of
the ``setLocation`` action is checked. If the action fails, the session is
discarded. If the action is not even available for a given website, the spider
is closed with ``unsupported_set_location`` as the close reason, so that you
can set a proper :ref:`session initialization logic <session-init>` for
requests targeting that website.

For sessions initialized with arbitrary or no parameters, no session check is
By default, if a location is defined through
:reqmeta:`zyte_api_session_location`, :setting:`ZYTE_API_SESSION_LOCATION` or
:meth:`~scrapy_zyte_api.SessionConfig.location`, even if the parameters used
for session initialization actually come from
:reqmeta:`zyte_api_session_params` or :setting:`ZYTE_API_SESSION_LOCATION`, the
outcome of the first ``setLocation`` action used, if any, is checked. If the
action fails, the session is discarded. If the action is not even available for
a given website, the spider is closed with ``unsupported_set_location`` as the
close reason; in that case, you should define a proper :ref:`session
initialization logic <session-init>` for requests targeting that website.

For sessions initialized without a configured location, no session check is
performed, sessions are assumed to be fine until they expire or are banned.
That is so even if those arbitrary parameters include a ``setLocation`` action.
That is so even if session initialization parameters include a ``setLocation``
action.

To implement your own code to check session responses and determine whether
their session should be kept or discarded, use the
:setting:`ZYTE_API_SESSION_CHECKER` setting.

If you need to check session validity for multiple websites, it is better to
define a separate :ref:`session config override <session-configs>` for each
website, each with its own implementation of
:meth:`~scrapy_zyte_api.SessionConfig.check`.
:setting:`ZYTE_API_SESSION_CHECKER` setting. If you need to check session
validity for multiple websites, it is better to define a separate :ref:`session
config override <session-configs>` for each website, each with its own
implementation of :meth:`~scrapy_zyte_api.SessionConfig.check`.

The :reqmeta:`zyte_api_session_location` and :reqmeta:`zyte_api_session_params`
request metadata keys, if present in a request that triggers a session
initialization request, will be copied into the session initialization request,
so that they are available when :setting:`ZYTE_API_SESSION_CHECKER` or
request metadata keys, if present in a request that :ref:`triggers a session
initialization request <pool-size>`, will be copied into the session
initialization request, so that they are available when
:setting:`ZYTE_API_SESSION_CHECKER` or
:meth:`~scrapy_zyte_api.SessionConfig.check` are called for a session
initialization request.

Expand Down Expand Up @@ -206,7 +235,8 @@ By default, scrapy-zyte-api maintains a separate pool of sessions per domain.
If you use the :reqmeta:`zyte_api_session_params` or
:reqmeta:`zyte_api_session_location` request metadata keys, scrapy-zyte-api
will automatically use separate session pools within the target domain for
those parameters or locations.
those parameters or locations. See :meth:`~scrapy_zyte_api.SessionConfig.pool`
for details.

If you want to customize further which pool is assigned to a given request,
e.g. to have the same pool for multiple domains or use different pools within
Expand All @@ -220,15 +250,32 @@ of concurrent, active, working sessions per pool. The
:setting:`ZYTE_API_SESSION_POOL_SIZES` setting allows defining different values
for specific pools.

.. _pool-size:

The actual number of sessions created for a session pool depends on the number
of requests that ask for a session from that pool, and the life time of those
sessions:

- When a request asks for a session from a given pool, if the session pool
has not yet reached its desired pool size, a :ref:`session initialization
request <session-init>` is triggered. If the session pool has been filled,
an existing session is used instead.

- When a response associated with a session pool indicates that the session
expired, an error over the limit (see
:setting:`ZYTE_API_SESSION_MAX_ERRORS`), or a failed :ref:`validity check
<session-check>`, a :ref:`session initialization request <session-init>` is
triggered to replace that session in the session pool.


.. _optimize-sessions:

Optimizing sessions
===================

For faster crawls and lower costs, specially where session initialization
requests are more expensive than session usage requests (e.g. because
initialization relies on ``browserHtml`` and usage relies on
requests are more expensive than session usage requests (e.g. scenarios where
initialization relies on ``browserHtml`` while usage relies on
``httpResponseBody``), you should try to make your sessions live as long as
possible before they are discarded.

Expand Down Expand Up @@ -273,9 +320,10 @@ Overriding session configs

For spiders that target a single website, using settings and request metadata
keys for :ref:`session initialization <session-init>` and :ref:`session
checking <session-check>` should do the job. However, for broad crawls or
:doc:`multi-website spiders <zyte-spider-templates:index>`, you might want to
define different session configs for different websites.
checking <session-check>` should do the job. However, for broad-crawl spiders,
:doc:`multi-website spiders <zyte-spider-templates:index>`, or for code
reusability purposes, you might want to define different session configs for
different websites.

The default session config is implemented by the
:class:`~scrapy_zyte_api.SessionConfig` class:
Expand Down
Loading

0 comments on commit f6a1db6

Please sign in to comment.