Skip to content

Latest commit

 

History

History
535 lines (385 loc) · 16.1 KB

pypdu_README.md

File metadata and controls

535 lines (385 loc) · 16.1 KB

pypdu

This module provides basic read-only access to the data contained in Prometheus on-disk files from Python.

pypdu may be installed from pip (on linux and macOS):

pip install pypdu

pypdu can optionally expose samples in a numpy array if numpy is installed. If you need this, you can either ensure numpy is installed, or have it pulled in by pypdu as a dependency with:

pip install pypdu[numpy]

Example usage:

#!/usr/bin/env python3

import pypdu

data = pypdu.load("/path/to/stats_data")

for series in data:
    print(series.name) # equivalent to series.labels["__name__"]
    print(series.labels)
    print(len(series.samples)) # number of samples can be computed
                               # without iterating all of them
    for sample in series.samples:
        print(f"{sample.timestamp} : {sample.value}")

Or the series and samples can be unpacked:

for name, labels, samples in data:
    print(name)
    print(labels)
    print(len(samples))
    for timestamp, value in samples:
        print(f"{timestamp} : {value}")

Conversion methods

Manipulating large time series as lists-of-lists is likely to perform poorly in Python. pypdu can expose samples as a thin python wrapper around an underlying C++ type.

This wrapper exposes "list like" operations:

>>> series = data["foobar"]
>>> vector = series.samples.as_vector()
>>> vector[0]
{timestamp=1664592572000, value=0.000000}
>>> vector[0].timestamp
1664592572000

pypdu also provides a convenience to_list(), with the same interface returning pure python types.

These conversions can also apply some common manipulations to the time series:

  • Scaling the timestamps to seconds
series.samples.as_vector(timestamp_units=pypdu.Seconds)
  • Filtering NaN values out of the time series
series.samples.as_vector(filter_nan_values=True)

numpy

If numpy is installed, samples can additionally be accessed as a numpy array. This may avoid copying the samples around if your code expects numpy arrays. E.g.,

for name, labels, samples in data:
    arr = samples.as_array()
    print(arr.dtype)
    print(arr[0])

prints:

dtype([('timestamp', '<i8'), ('value', '<f8')])
(1653556688725, 0.)

as_array() also accepts timestamp_units and filter_nan_values as above.

If numpy is not available at runtime, this will raise an exception:

RuntimeError: Accessing samples as a numpy array requires numpy to be installed

Filtering time series

If only a subset of the time series are desired, pypdu can filter them based on label values, and avoid parsing unneeded series at all:

for series in data.filter({"__name__":"sysproc_page_faults_raw"}):

This will usually perform better than filtering "manually" in python after the fact.

Multiple labels can be specified:

data.filter({"__name__":"sysproc_page_faults_raw", "proc":"memcached"})

ECMAScript regexes can also be used:

data.filter({"proc":pypdu.regex("^go.*")})

Or even arbitrary Python callbacks:

data.filter({"proc":lambda x: x.startswith("go")})

As shorthand, when filtering on __name__ alone, just a string may be provided.

data.filter("sysproc_page_faults_raw")

Single series lookup

If there is only one time series matching your filter, for convenience you can do:

foobar_series = data[{"__name__":"foobar"}]

This is roughly equivalent to:

foobar_series = next(iter(data.filter({"__name__":"foobar"})))

If there are multiple time series matching your filter, this will silently discard all but the lexicographically first (sorted by the key and value of all labels).

If none match, a KeyError is raised.

All types of filter demonstrated above with .filter(...) may be used in this manner also.

Calculations

Simple operations (+ - / *) may be applied to Series objects, computing the result lazily.

a = data["foobar"]
b = data["bazqux"]
c = data["spam"]
expression = (a + b) * (c / 100)
for timestamp, value in expression:
    ...

Note: the resulting iterable will contain a sample at each timestamp seen in any of the constituent series. Even if all series are scraped with the same interval, if they are offset from each other this can lead to a lot of values. To avoid this, the expression can be resampled at a given interval:

for timestamp, value in expression.resample(10000): # 10s in ms
    ...

This will lead to one sample exactly every 10000 milliseconds. No interpolation is performed - if a given series did not have a sample at a chosen instant, the most recent value will be used.

IRate
pypdu.irate(expr)

Results in a Expression which computes the instantaneous rate of change based on the current and previous sample - roughly equivalent to Prometheus irate.

e.g.,

a = data["foobar"]
b = data["bazqux"]
rate = pypdu.irate(a+b/100)
for timestamp, rate_value in rate:
    ....
Sum

As Expression supports addition, the standard Python method sum can be used to add multiple series together.

However, if working with a very large number of series, pypdu.sum may more efficiently construct the Expression result (computation of the summed Samples is identical, however).

e.g.,

series_list = list(data)
py_sum_expr = sum(series_list)
pdu_sum_expr = pypdu.sum(series_list) # may be faster if len(series_list) is large

# but the resulting samples are identical
assert(list(pdu_sum_expr) == list(py_sum_expr))

Histograms

PrometheusData(...).histograms allows iterating all histograms represented by the time series in a data directory.

The histograms are exposed as HistogramTimeSeries, grouping all the component ..._bucket time series together. Indexing into this series provides access to the histogram at a single point in time.

e.g.,

data = pypdu.load("<...>")

for histSeries in data.histograms:
    print("Labels: ", histSeries.labels)
    print("Number of samples: ", len(histSeries))
    for hist in histSeries:
        print("TS: ", hist.timestamp)
        print(hist.buckets())

Iterates over every histogram found in the Prometheus data, then iterates over every sample contained in that time series.

Example output:

Labels:  {'__name__': 'cm_http_requests_seconds', 'instance': 'ns_server', 'job': 'ns_server_high_cardinality'}
Number of samples:  3826
TS:  1621268098827
[(0.001, 8.0), (0.01, 25.0), (0.1, 25.0), (1.0, 25.0), (10.0, 25.0), (inf, 25.0)]
TS:  1621268158827
[(0.001, 39.0), (0.01, 118.0), (0.1, 126.0), (1.0, 127.0), (10.0, 127.0), (inf, 127.0)]
TS:  1621268218827
[(0.001, 43.0), (0.01, 132.0), (0.1, 140.0), (1.0, 141.0), (10.0, 141.0), (inf, 141.0)]
TS:  1621268278827
[(0.001, 48.0), (0.01, 145.0), (0.1, 153.0), (1.0, 154.0), (10.0, 154.0), (inf, 154.0)]
TS:  1621268338827
[(0.001, 53.0), (0.01, 158.0), (0.1, 166.0), (1.0, 167.0), (10.0, 167.0), (inf, 167.0)]
TS:  1621268398827
[(0.001, 55.0), (0.01, 171.0), (0.1, 179.0), (1.0, 180.0), (10.0, 180.0), (inf, 180.0)]
TS:  1621268458827
[(0.001, 60.0), (0.01, 191.0), (0.1, 199.0), (1.0, 200.0), (10.0, 200.0), (inf, 200.0)]
TS:  1621268518827
[(0.001, 66.0), (0.01, 204.0), (0.1, 212.0), (1.0, 213.0), (10.0, 213.0), (inf, 213.0)]
TS:  1621268578827
[(0.001, 71.0), (0.01, 217.0), (0.1, 225.0), (1.0, 226.0), (10.0, 226.0), (inf, 226.0)]
TS:  1621268638827
[(0.001, 73.0), (0.01, 230.0), (0.1, 238.0), (1.0, 239.0), (10.0, 239.0), (inf, 239.0)]
...
Labels: ...

HistogramTimeSeries (in the above example, this is histSeries), can be indexed into - currently only by a sample index, but in the future, selecting the histogram closest to a given timestamp may be supported.

E.g., the first and last point in time view available for a specific histogram can be found with:

first = histSeries[0]
last = histSeries[-1]

From which the timestamp and buckets could be read:

>>> print(last.timestamp) # time since epoch in ms
1631007596974

>>> print(last.bucket_bounds()))
[0.001, 0.01, 0.1, 1.0, 10.0, inf]

>>> print(last.bucket_values())
[4279.0, 4371.0, 4666.0, 5044.0, 5044.0, 5044.0]

>>> print(last.buckets()) # convenience zip of (bounds, values)
[(0.001, 4279.0), (0.01, 4371.0), (0.1, 4666.0), (1.0, 5044.0), (10.0, 5044.0), (inf, 5044.0)]

The difference between histograms at two points in time can also be calculated:

delta = last-first
>>> delta.time_delta
60000
>>> delta.buckets()
[(0.001, 653.0), (0.01, 653.0), (0.1, 653.0), (1.0, 653.0), (10.0, 653.0), (inf, 653.0)]

Or the summation of two histograms:

total = histA+histB
>>> total.buckets()
[(0.001, 1985.0), (0.01, 1985.0), (0.1, 1985.0), (1.0, 1985.0), (10.0, 1985.0), (inf, 1985.0)]

For either of addition or subtraction, the bucket boundaries must exactly match.

Serialisation

Time series may be dumped individually to a file or bytes. This may be useful if you need to store some number of series (e.g., in a key-value store), but don't wish to retain the entire Prometheus data directory.

pypdu.dump/pypdu.load take an int file descriptor or, for convenience, a file-like object supporting fileLike.fileno() -> int.

These methods be used to read/write data from/to a pipe or socket, not just a file on disk. Note, arbitrary file-like objects which are not backed by a file descriptor are not supported.

If provided a file handle which actually refers to a file on disk, load will try to mmap the file. If this fails, it will fall back to reading it like a stream. If mmapping is not desired, it can be disabled with:

pypdu.load(fileDescriptor, allow_mmap=False)

When loading many series from a stream (socket, pipe, etc), the underlying data for all Series will be read into memory - this may be costly if there are many Series. pypdu.load_lazy can instead be used to consume Series from a stream, one at a time.

for series in pypdu.load_lazy(someSocket):
    # series are read and deserialised on demand while iterating

pypdu.dumps creates a bytes object, while pypdu.loads operates on a buffer. Anything supporting the buffer protocol exposing a contiguous buffer may be used. This includes bytes objects, but also numpy arrays and many other types.

A memoryview may be used to slice a buffer, allowing deserialisation from part of a buffer, without having to copy out the relevant bytes.

# fd : int or file-like object with .fileno() method

pypdu.dump(fd, series)
pypdu.dump(fd, [series, series, ...])
pypdu.dump(fd, PrometheusData)

# note, dumps on a lot of series will consume a lot of memory building
# a big bytes object
pypdu.dumps(series) -> bytes
pypdu.dumps([series, series, ...]) -> bytes
pypdu.dumps(PrometheusData) -> bytes

# result of load{,s} depends on what was written
# Deserialised series are entirely in-memory, may consume a lot of
# memory.
pypdu.load(fd) -> Series
pypdu.load(fd) -> [Series, Series,...]

pypdu.loads(buffer) -> Series
pypdu.loads(buffer) -> [Series, Series, ...]

# when loading a lot of series, this is the advised way to avoid
# holding them all in memory at the same time
pypdu.load_lazy(fd) -> Iterable

Example dumping and loading multiple series to/from a file:

to_serialise = []
for series in pypdu.load("foobar/baz/stats_data"):
    if some_condition(series):
        to_serialise.append(series)

with open("somefile", "wb") as f:
    pypdu.dump(f, to_serialise)
...
with open("somefile", "rb") as f:
    for series in pypdu.load_lazy(f):
        # do something with the loaded series

Example dumping and loading a single series to/from stdin/out:

data = pypdu.load("foobar/baz/stats_data")
series = data["foobar_series_name"]
pypdu.dump(sys.stdout, series)

...

series = pypdu.load(sys.stdin)

pypdu.json

For performance, pypdu provides a json encoder capable of efficiently dumping pypdu types. It can also dump typical python types (everything supported by the builtin json), but is not a drop in replacement in terms of arguments.

data = pypdu.load(...)
series = data["foobar"]
pypdu.json.dumps(series)

will produce:

{
    "metric": {
        "__name__": "some_metric_name",
        "label_foo": "label_foo_value",
    },
    "values": [
        [
            1664592572000,
            0.0
        ],
        [
            1664592582000,
            0.0
        ],
        [
            1664592592000,
            0.0
        ],

dumps also supports samples, sample vectors, and expressions:

>>> pypdu.json.dumps(series.samples)
"[[1664592572000, 0.0], [1664592582000, 0.0],...]"
>>> pypdu.json.dumps(series.samples.as_vector(timestamp_units=pypdu.Seconds))
"[[1664592572, 0.0], [1664592582, 0.0],...]"
>>> pypdu.json.dumps((series + 1) * 2)
"[[1664592572000, 2.0], [1664592582000, 2.0],...]"
>>> pypdu.json.dumps(((series + 1) * 2).as_vector(timestamp_units=pypdu.Seconds))
"[[1664592572, 2.0], [1664592582, 2.0],...]"

XOR Chunks

For specific use cases, access to the raw XOR encoded (chunk documentation) chunk data may be required.

To find the chunk objects for a given series:

>>> data = pypdu.load("some_stats_dir")
>>> series = data["foobar_series_name"]
>>> series.chunks
[<pypdu.Chunk object at 0x11c29c270>, <pypdu.Chunk object at 0x11c29dbb0>, ...]

To access the XOR encoded sample data:

>>> chunk = series.chunks[0]
# without copying
>>> memoryview(chunk)
<memory at 0x11c227880>
# with a copy into a python bytes object
>>> chunk.as_bytes()
b'\x00y\xc8\xe0\x8e\...'

Most users will not need to do this as samples can be read from a pypdu.Series(), with the chunks handled transparently.

Runtime version checking

The pypdu version can be specified at install time (e.g., in requirements.txt), but you can also verify the correct version is available at runtime (maybe someone is building locally and forgot to update some dependencies!).

>>> import pypdu
>>> pypdu.__version__
'0.0.12a3'
>>> pypdu.__git_rev__
'a096f0d'
>>> pypdu.__git_tag__
''
>>> pypdu.require(0, 0, 0)
>>> pypdu.require(0, 0, 12)
>>> pypdu.require(0, 1, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Current pypdu version 0.0.12a3 does not meet required 0.1.0
>>> pypdu.require(0, 0, 12, "a3")
>>> pypdu.require(0, 0, 12, "a4")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Current pypdu version 0.0.12a3 does not meet required 0.0.12a4

If using a feature introduced in version X.Y.Z, pypdu.require(X, Y, Z) will raise an exception if an older version is in use. This exception can be caught, if you want to provide a more specific error message (e.g., "Remember to update dependencies by running ...").

Alternative installation steps

pip install from source

If a wheel is not available for your platform or architecture, pypdu can be built and installed with:

pip install git+https://github.com/jameseh96/pdu.git

or for a specific version:

pip install git+https://github.com/jameseh96/[email protected]
e.g.,
pip install git+https://github.com/jameseh96/[email protected]

Building pypdu will require the dependencies listed in the installation instructions.

pypdu is relatively platform independent, but has not been tested on platforms/architectures that don't have a wheel built (e.g., Windows, MacOS+Apple Silicon) - be prepared for potential issues at build and runtime.

setup.py

pypdu may be installed without pip. To use, clone the repository as in the installation instructions.

Then run:

python setup.py install
manual .so

Alternatively, following the cmake steps in the installation instructions to build the project produces a module with a platform-dependent name - for example on MacOS this may be pypdu.cpython-39-darwin.so.

This can be found either in <build dir>/src/pypdu or in your chosen installation prefix. This can be used without installing with setup.py, simply ensure the containing directory is in your PYTHONPATH.