Skip to content

Commit

Permalink
Added some functions to iter.py:
Browse files Browse the repository at this point in the history
- first_i: an index-version of first
- isin_sorted: a numba-compiled 1D version of np.isin optimised for sorted arrays (10x faster)
- isin_sorted_intervals: similar to the above but returning the indices of subsequences of the first array also found in the second; allows non-strictness (i.e. ignoring duplicates on both sides)
- complement_intervals: inverts a sequence of intervals, mostly written for use with isin_sorted_intervals
The isin_* functions are the first use of numba in iter.py; could move them elsewhere in the future, but for these tasks it does make sense to use arrays and not lists.
  • Loading branch information
T-Flet committed Jun 23, 2023
1 parent 9702612 commit b4a9355
Show file tree
Hide file tree
Showing 7 changed files with 112 additions and 35 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,24 @@ It also contains a variety of convenience functions for the
[numba](https://numba.pydata.org/) JIT compiler library.

See the [documentation][rtd-link] for details, but functions are grouped as follows:
- Generic_Util.benchmarking: functions covering typical code-timing scenarios,
- `Generic_Util.benchmarking`: functions covering typical code-timing scenarios,
such as a "with" statement context, an n-executions timer,
and a convenient function for comparing and summarising n execution times of different implementations
of the same function.
- Generic_Util.iter: iterable-focussed functions, covering multiple varieties of
- `Generic_Util.iter`: iterable-focussed functions, covering multiple varieties of
flattening, iterable combining,
grouping and predicate/property-based processing (including topological sorting),
element-comparison-based operations, value interspersal, and finally batching.
- Generic_Util.operator: functions regarding item retrieval, and syntactic sugar for patterns of function application.
- Generic_Util.misc: functions with less generic purpose than in the above; currently mostly to do with min/max-based operations.
- `Generic_Util.operator`: functions regarding item retrieval, and syntactic sugar for patterns of function application.
- `Generic_Util.misc`: functions with less generic purpose than in the above; currently mostly to do with min/max-based operations.

Then a sub-package is dedicated to utility functions for the [numba](https://numba.pydata.org/) JIT compiler library:
- Generic_Util.numba.benchmarking: functions comparing execution times of (semi-automatically-generated) varieties of
- `Generic_Util.numba.benchmarking`: functions comparing execution times of (semi-automatically-generated) varieties of
numba-compilations of a given function, including
lazy vs eager compilation, vectorisation, parallelisation, as well as varieties of rolling (see Generic_Util.numba.higher_order).
- Generic_Util.numba.higher_order: higher-order numba-compilation functions, currently only functions to "roll" simpler functions
- `Generic_Util.numba.higher_order`: higher-order numba-compilation functions, currently only functions to "roll" simpler functions
(1d-to-scalar or 2d-to-scalar/1d) over arrays, with a few combinations of input and output type signatures.
- Generic_Util.numba.types: convenient shorthands for frequently used numba (and respective numpy) types, with a focus on
- `Generic_Util.numba.types`: convenient shorthands for frequently used numba (and respective numpy) types, with a focus on
C-contiguity of arrays; these are useful in declaring eager-compilation function signatures.

Many functions which would have been included in this package were dropped in favour of using those in the wonderful
Expand Down
19 changes: 10 additions & 9 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,16 @@ It also contains a variety of convenience functions for the

The functions are grouped as follows:

- Generic_Util.benchmarking: functions covering typical code-timing scenarios,
- :py:mod:`Generic_Util.benchmarking`: functions covering typical code-timing scenarios,
such as a "with" statement context, an n-executions timer,
and a convenient function for comparing and summarising n execution times of different implementations
of the same function.
- Generic_Util.iter: iterable-focussed functions, covering multiple varieties of
- :py:mod:`Generic_Util.iter`: iterable-focussed functions, covering multiple varieties of
flattening, iterable combining,
grouping and predicate/property-based processing (including topological sorting),
element-comparison-based operations, value interspersal, and finally batching.
- Generic_Util.operator: functions regarding item retrieval, and syntactic sugar for patterns of function application.
- Generic_Util.misc: functions with less generic purpose than in the above; currently mostly to do with min/max-based operations.
- :py:mod:`Generic_Util.operator`: functions regarding item retrieval, and syntactic sugar for patterns of function application.
- :py:mod:`Generic_Util.misc`: functions with less generic purpose than in the above; currently mostly to do with min/max-based operations.

Many functions which would have been included here were dropped in favour of using those in the wonderful
`more-itertools <https://github.com/more-itertools/more-itertools>`_ package
Expand All @@ -37,13 +37,13 @@ source of algorithm-simplifying ingredients.

Then a sub-package is dedicated to utility functions for the `numba <https://numba.pydata.org/>`_ JIT compiler library:

- Generic_Util.numba.benchmarking: functions comparing execution times of (semi-automatically-generated) varieties of
- :py:mod:`Generic_Util.numba.benchmarking`: functions comparing execution times of (semi-automatically-generated) varieties of
numba-compilations of a given function, including
lazy vs eager compilation, vectorisation, parallelisation, as well as varieties of rolling (see Generic_Util.numba.higher_order).
- Generic_Util.numba.higher_order: higher-order numba-compilation functions, currently only functions to "roll" simpler functions
- :py:mod:`Generic_Util.numba.higher_order`: higher-order numba-compilation functions, currently only functions to "roll" simpler functions
(1d-to-scalar or 2d-to-scalar/1d) over arrays, with a few combinations of input and output type signatures.
- Generic_Util.numba.types: convenient shorthands for frequently used numba (and respective numpy) types, with a focus on
C-contiguity of arrays; these are useful in declaring eager-compilation function signatures.
- :py:mod:`Generic_Util.numba.types`: convenient shorthands for frequently used numba (and respective numpy) types, with a focus on
C-contiguity of arrays; these are useful in declaring eager-compilation function signatures.



Expand All @@ -62,4 +62,5 @@ Indices and tables

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
..
* :ref:`search`
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ dependencies = [
"typing_extensions >=3.7; python_version<'3.8'",
"pandas>=1.5.3",
"numpy>=1.23.5",
"numba>=0.56.4",
"sortedcontainers>=2.4.0",
]

Expand Down
4 changes: 2 additions & 2 deletions src/Generic_Util/benchmarking.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

@contextmanager
def time_context(name: str = None):
'''"with" statement context for timing the execution of the enclosed code block, i.e. `with time_context('Name of code block'): ...` '''
'''"with" statement context for timing the execution of the enclosed code block, i.e. ``with time_context('Name of code block'): ...`` '''
start = time.perf_counter()
yield # No need to yield anything or for the above to be in a "try" and the below in a customary "finally" since there is no dangling resource
end = time.perf_counter()
Expand Down Expand Up @@ -58,7 +58,7 @@ def time_n(f: Callable, n = 2, *args, **kwargs):
def compare_implementations(fs_with_shared_args: dict[str, Callable], n = 200, wait = 1, verbose = True,
fs_with_own_args: dict[str, tuple[Callable, list, dict]] = None, args: list = None, kwargs: dict = None):
'''Benchmark multiple implementations of the same function called n times (each with the same args and kwargs), with a break between functions.
Recommended later output view if verbose is False: `print(table.to_markdown(index = False))`.
Recommended later output view if verbose is False: ``print(table.to_markdown(index = False))``.
:param fs_with_own_args: alternative to fs_with_shared_args, args and kwargs arguments: meant for additional functions taking different *args and **kwargs.'''
assert n >= 3
table = []
Expand Down
99 changes: 87 additions & 12 deletions src/Generic_Util/iter.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
import numpy as np
from numpy.typing import NDArray

from Generic_Util.numba.types import njit, b1, b1A, b1_NP, i8, i8A, f8A, i8A2, i8_NP

from typing import TypeVar, Callable, Union, Sequence, Iterable, Iterator, Generator, Any, Generic, Mapping
_a = TypeVar('_a')
_b = TypeVar('_b')
Expand All @@ -32,7 +34,7 @@ def deep_flatten(xss_: Iterable) -> Generator:

def deep_extract(xss_: Iterable[Iterable], *key_path) -> Generator:
'''Given a nested combination of iterables and a path of keys in it, return a deep_flatten-ed list of entries under said path.
Note: `deep_extract(xss_, *key_path) == deep_flatten(Generic_Util.operator.get_nested(xss_, *key_path))`'''
Note: ``deep_extract(xss_, *key_path) == deep_flatten(Generic_Util.operator.get_nested(xss_, *key_path))``'''
level = xss_
for k in key_path: level = level[k]
return deep_flatten(level)
Expand Down Expand Up @@ -64,16 +66,16 @@ def all_combinations(xs: Sequence, min_size = 1, max_size = None) -> list:
## Predicate/Property-Based Functions

def partition(p: Callable[[_a], bool], xs: Iterable[_a]) -> tuple[Iterable[_a], Iterable[_a]]:
'''Haskell's partition function, partitioning xs by some boolean predicate p: `partition p xs == (filter p xs, filter (not . p) xs)`.'''
'''Haskell's partition function, partitioning xs by some boolean predicate p: ``partition p xs == (filter p xs, filter (not . p) xs)``.'''
acc = ([],[])
for x in xs: acc[not p(x)].append(x)
return acc

def group_by(f: Callable[[_a], _b], xs: Iterable[_a]) -> dict[_b, list[_a]]:
'''Generalisation of partition to any-output key-function.
Notes:
- 'Retrieval' functions from the operator package are typical f values (`itemgetter(...)`, `attrgetter(...)` or `methodcaller(...)`)
- This is NOT Haskell's groupBy function'''
Notes:
- 'Retrieval' functions from the operator package are typical f values (``itemgetter(...)``, ``attrgetter(...)`` or ``methodcaller(...)``)
- This is NOT Haskell's groupBy function'''
acc = defaultdict(list)
for x in xs: acc[f(x)].append(x)
return acc
Expand All @@ -82,15 +84,19 @@ def first(c: Callable[[_a], bool], xs: Iterable[_a], default: _a = None) -> _a:
'''Return the first value in xs which satisfies condition c.'''
return next((x for x in xs if c(x)), default)

def first_i(c: Callable[[_a], bool], xs: Sequence[_a], default: _a = None) -> _a:
'''Return the index of the first value in xs which satisfies condition c.'''
return next((i for i in range(len(xs)) if c(xs[i])), default)

def foldq(f: Callable[[_b, _a], _b], g: Callable[[_b, _a, list[_a]], list[_a]], c: Callable[[_a], bool], xs: Sequence[_a], acc: _b) -> tuple[_b, list[_a]]:
'''
Fold-like higher-order function where xs is traversed by consumption conditional on c, and remaining xs are updated by g
(therefore consumption order is not known a priori):
- the first/next item to be ingested is the first in the remaining xs to fulfil condition c
- at every x ingestion, the item is removed from (a copy of) xs, and all the remaining ones are potentially modified by function g
- this function always returns a tuple of `(acc, remaining_xs)`, unlike the stricter `foldq_`, which raises an exception for leftover xs
- this function always returns a tuple of ``(acc, remaining_xs)``, unlike the stricter ``foldq_``, which raises an exception for leftover xs
Note: `fold(f, xs, acc) == foldq(f, lambda acc, x, xs: xs, lambda x: True, xs, acc)`.
Note: ``fold(f, xs, acc) == foldq(f, lambda acc, x, xs: xs, lambda x: True, xs, acc)``.
Sequence of suitable names leading to the current one: consumption_fold, condition_update_fold, cu_fold, q_fold, qfold or foldq
:param f: 'Traditional' fold function :: acc -> x -> acc
Expand Down Expand Up @@ -119,7 +125,7 @@ def full_step(acc, xs): # Alternative implementation: move function content insi
return acc, xs

def foldq_(f: Callable[[_b, _a], _b], g: Callable[[_b, _a, list[_a]], list[_a]], c: Callable[[_a], bool], xs: list[_a], acc: _b) -> _b:
r'''Stricter version of foldq (see its description for details); only returns the accumulator and raises an exception on leftover xs.
'''Stricter version of foldq (see its description for details); only returns the accumulator and raises an exception on leftover xs.
:raises ValueError on leftover xs'''
acc, xs = foldq(f, g, c, xs, acc)
if xs: raise ValueError('No suitable next element found for given condition while elements remain')
Expand Down Expand Up @@ -157,7 +163,7 @@ def unique_by(f: Callable[[_a], Any], xs: Iterable[_a]) -> list[_a]:
return [x for x in xs if (fx := f(x), ) if fx not in seen and not seen.append(fx)] # Neat true-tuple assignment and neat short-circuit 'and' trick

def eq_elems(xs: Iterable[_a], ys: Iterable[_a]) -> bool:
'''Equality of iterables by their elements'''
'''Equality of iterables by their elements.'''
cys = list(ys) # make a mutable copy
try:
for x in xs: cys.remove(x)
Expand All @@ -166,15 +172,84 @@ def eq_elems(xs: Iterable[_a], ys: Iterable[_a]) -> bool:

def diff(xs: Iterable[_a], ys: Iterable[_a]) -> list[_a]:
'''Difference of iterables.
Notes:
- not a set difference, so strictly removing as many xs duplicate entries as there are in ys
- preserves order in xs'''
Notes:
- not a set difference, so strictly removing as many xs duplicate entries as there are in ys
- preserves order in xs'''
cxs = list(xs) # make a mutable copy
try:
for y in ys: cxs.remove(y)
except ValueError: pass
return cxs

@njit([b1A(i8A, i8A), b1A(f8A, f8A)])
def isin_sorted(xs: NDArray[_a], ys: NDArray[_a]) -> NDArray[b1_NP]:
'''Optimised (10x faster) 1D version of np.isin assuming BOTH xs and ys are (ascendingly) sorted arrays of the same type (both int or both float):
return a boolean array of the same length as xs indicating whether the corresponding element at that index is present in ys.'''
res = np.zeros_like(xs, dtype = b1_NP)
i = j = 0
while i < len(xs) and j < len(ys):
if ys[j] < xs[i]: j += 1 # Let the ys catch-up to the xs
elif ys[j] == xs[i]: res[i], i = True, i + 1 # Could increase j here as well, but it would imply assuming ys is strictly increasing
else: i += 1 # Let the xs catch up to the ys
return res

@njit([i8A2(i8A, i8A, b1), i8A2(f8A, f8A, b1)])
def isin_sorted_intervals(xs: NDArray[_a], ys: NDArray[_a], strict = True) -> NDArray[i8_NP]:
'''Assuming BOTH xs and ys are (ascendingly) sorted arrays of the same type (both int or both float):
return a 2D array of the intervals (in terms of indices of xs) of subsequences shared with ys.
Notes:
- The interval-end indices are of the last matching value, not of the next (non-matching) one; run ``res[:,1] += 1`` to switch the behaviour
- If the desired intervals are the opposite of the matching ones, call ``complement_intervals(intervals, len(xs), closed_interval_ends = True)``
:param strict: Whether xs and ys subsequences need to match exactly (e.g. 1223 will not match 123 and vice-versa if strict).
If the strictness of the assumption is not respected, the function will produce extra len-1 intervals for each duplicate.'''
intervals = np.empty((len(xs) // 2, 2), dtype = i8_NP) # There can be at most half as many intervals as values
i = j = k = 0
if strict: # xs and ys subsequences need to match exactly
while i < len(xs) and j < len(ys):
if ys[j] < xs[i]: j += 1 # Let the ys catch-up to the xs
if ys[j] == xs[i]:
intervals[k, 0] = i
while i < len(xs) and j < len(ys) and xs[i] == ys[j]: i, j = i + 1, j + 1
intervals[k, 1] = i - 1
k += 1
else: i += 1 # Let the xs catch up to the ys
else: # xs and ys subsequences tolerate non-matching duplicates
while i < len(xs) and j < len(ys):
if ys[j] < xs[i]: j += 1 # Let the ys catch-up to the xs
if ys[j] == xs[i]:
intervals[k, 0] = i
while i < len(xs) and j < len(ys):
if xs[i] == ys[j]: i, j = i + 1, j + 1 # Guaranteed outcome in first iteration, hence reasoning with -1s below
elif xs[i] == xs[i - 1]: i += 1
elif ys[j] == ys[j - 1]: j += 1
else: break
intervals[k, 1] = i - 1
k += 1
else: i += 1 # Let the xs catch up to the ys
return intervals[:k,...] # not k-1 since already ++ed

def complement_intervals(intervals: NDArray[int], true_length: int, closed_interval_ends = True) -> NDArray[int]:
'''Invert a series of index intervals (in the form of an (n,2)-array),
i.e. return an (m,2)-array of intervals starting after the ends of the given ones and ending before their starts.
:param true_length: 1 more than the final index for the overall range these intervals are within (which might be ``intervals[-1,1]``); if coming from isin_sorted_intervals, then simply ``len(xs)``
:param closed_interval_ends: whether BOTH input and output intervals are (and will be) closed, i.e. their end-index is INCLUDED in the interval, rather than being the first index after it
'''
if closed_interval_ends: intervals[:, 1] += 1

# Drop to 1D and toggle the presence of initial and final indices
indices = np.reshape(intervals, intervals.size)
indices = indices[1:] if indices[0] == 0 else np.insert(indices, 0, 0)
indices = indices[:-1] if indices[-1] == true_length else np.append(indices, true_length)

# Return to 2D (cardinality is even again, as both previous steps changed it by 1)
flipped_intervals = np.reshape(indices, (len(indices) // 2, 2))

if closed_interval_ends: flipped_intervals[:, 1] -= 1
return flipped_intervals



## Interspersing Functions
Expand Down
2 changes: 1 addition & 1 deletion src/Generic_Util/misc.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def interval_overlap(ab: tuple[float, float], cd: tuple[float, float]) -> float:
def min_max(xs: Sequence[_a]) -> tuple[_a, _a]:
'''Mathematically most efficient joint identification of min and max (minimum comparisons = 3n/2 - 2).
Note:
- This function is numba-compilable, e.g. as `njit(nTup(f8,f8)(f8[::1]))(min_max)` (see `Generic_Util.numba.types` for `nTup` shorthand),
- This function is numba-compilable, e.g. as ``njit(nTup(f8,f8)(f8[::1]))(min_max)`` (see ``Generic_Util.numba.types`` for ``nTup`` shorthand),
- If using numpy arrays, min and max are cached for O(1) lookup, and one would imagine this is the used algorithm'''
if xs[0] > xs[1]: min, max = xs[1], xs[0] # Initialise
else: min, max = xs[0], xs[1]
Expand Down
8 changes: 4 additions & 4 deletions src/Generic_Util/operator.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,21 @@ def get_nested(xss_: Iterable[Iterable], *key_path) -> Generator:

def on(f: Callable, xs: Iterable[_a], g: Callable[[_a, ...], _b], *args, **kwargs):
'''Transform xs by element-wise application of g and call f with them as its arguments.
E.g. `on(operator.gt, (a, b), len)`
E.g. ``on(operator.gt, (a, b), len)``
Notes:
- *args, **kwargs are for g, not f
- 'Retrieval' functions from the operator package are reasonable g values (`itemgetter(...)`, `attrgetter(...)` or `methodcaller(...)`),
- 'Retrieval' functions from the operator package are reasonable g values (``itemgetter(...)``, ``attrgetter(...)`` or ``methodcaller(...)``),
BUT on_a and on_m are shorthands for the attribute and method cases'''
return f(*[g(x, *args, **kwargs) for x in xs])

def on_a(f: Callable, xs: Iterable, a: str):
'''Extract attribute a from xs elements and call f with them as its arguments.
E.g. `on_a(operator.eq, [a, b], '__class__')`'''
E.g. ``on_a(operator.eq, [a, b], '__class__')``'''
return f(*[getattr(x, a) for x in xs])

def on_m(f: Callable, xs: Iterable, m: str, *args, **kwargs):
'''Call method m on xs elements and call f with their results as its arguments.
E.g. `on_m(operator.gt, [a, b], 'count', 'hello')`
E.g. ``on_m(operator.gt, [a, b], 'count', 'hello')``
Notes:
- *args, **kwargs are for method m, not f'''
return f(*[getattr(x, m)(*args, **kwargs) for x in xs])
Expand Down

0 comments on commit b4a9355

Please sign in to comment.