This release features official, out-of-the-box support for compiling dpctl
for specified AMD GPU architectures, the addition of new function tensor.top_k
, a radix-sort-based implementation of sorting functions, and improvements to interoperability with DLPack through tensor.dldevice_to_sycl_device
and tensor.sycl_device_to_dldevice
.
A number of adjustments were also made to improve performance of dpctl
reductions (i.e., sum
, min
, max
, etc.), accumulators (i.e., cumulative_sum
, cumulative_logsumexp
), and copy-and-cast operations.
Added
- Support for compiling
dpctl
for specified AMD GPU architecture with use of CodePlay oneAPI plug-in gh-1731 - Added
tensor.top_k
per Python Array API specification gh-1921 - Added functions
tensor.dldevice_to_sycl_device
andtensor.sycl_device_to_dldevice
for converting between DLPack and sycl devices, and a methodget_device_id
todpctl.SyclDevice
to improve interoperability with DLPack protocol gh-1953 - Added
DPCTL_OFFLOAD_COMPRESS
cmake option (set toOFF
by default) to toggle --offload-compress linker option when buildingdpctl
gh-1961
Changed
- Improved performance of copy-and-cast operations from
numpy.ndarray
totensor.usm_ndarray
for contiguous inputs gh-1829 py_sort
andpy_argsort
now throwpy::value_error
if inputs are not C-contiguous gh-1838- Improved performance of copying operation to C-/F-contig array, with optimization for batch of square matrices gh-1850
- Improved performance of
tensor.argsort
function for all types gh-1859 - Improved performance of
tensor.sort
andtensor.argsort
for short arrays in the range [16, 64] elements gh-1866 - Implemented radix sort algorithm to be used in
dpt.sort
anddpt.argsort
gh-1867, gh-1883 - Extended
dpctl.SyclTimer
withdevice_timer
keyword, implementing different methods of collecting device times gh-1872 dpctl
changed to see GPU devices out of the box in virtual environment on Windows gh-1922- Improved performance of
tensor.cumulative_sum
,tensor.cumulative_prod
,tensor.cumulative_logsumexp
as well as performance of boolean indexing gh-1923, gh-1942 - Improved performance of
tensor.min
,tensor.max
,tensor.logsumexp
,tensor.reduce_hypot
for floating point type arrays by at least 2x gh-1932, gh-1937 - Updated Cython examples to use scikit-build gh-1935
- Reduced binary size of
_tensor_accumulation_impl
by 13 MB gh-1957 - Extended
tensor.asarray
to support objects that implement__usm_ndarray__
property to be interpreted asusm_ndarray
objects gh-1959 tensor.usm_ndarray
object disallows implicit conversions to NumPy array gh-1964stream
arguments intensor.usm_ndarray
methods now raise an error ifstream
is not atensor.SyclQueue
gh-1969dpctl
initialization sets subprocess to use SPAWN method on Linux to enablegdb-oneapi
to debug kernels submitted from Python applications gh-1971- Reduced binary size of
_tensor_elementwise_impl
gh-1976 - Allow
dpctl.SyclQueue.memcpy
to and from multi-dimensional buffers gh-1985
Fixed
- Fixed a bug in
tensor.roll
for very large values ofshift
gh-1869 - Fix for
tensor.result_type
when all inputs are Python built-in scalars gh-1877 - Improved error in constructors
tensor.full
andtensor.full_like
when provided a non-numeric fill value gh-1878 - Added a check for pointer alignment when copying to C-contiguous memory gh-1890, gh-1891
- Fixed
dpctl
installed into virtual environment not finding DPC++ runtime libraries by addingDPCTL_WITH_REDIST
cmake option (set toOFF
by default) gh-1893 - Fixed incorrect result (issue gh-1901) in
tensor.cumulative_sum
and in advanced indexing gh-1902 - Fixed
__setitem__()
fortensor.usm_ndarray
when passed an empty boolean mask gh-1915 tensor.from_dlpack
docstring now shows that return type can be NumPy array and stipulates when this will be the case gh-1919- Fixed docstring in helper class in DLPack tests gh-1920
- Fixed a bug in
tensor.astype
wherecopy=False
would not be respected for 1d arrays when order keyword is specified gh-1928 - Replaced deprecated
CL/sycl.hpp
with recommendedsycl/sycl.hpp
in examples gh-1933 - Fixed
tensor.take_along_axis
andtensor.put_along_axis
raising an error fortensor.uint64
indices when given an array of dimension greater than 1 gh-1934 - Fixed unexpected results of
tensor.sum
with a requested output type ofbool
gh-1958 - Use
std::move
to avoid unnecessary copying of temporary intriul_ctor.cpp
gh-1960 - Make
stream
a keyword-only argument intensor.usm_ndarray.to_device
per requirement by array API specification gh-1966 - Improve efficiency of copy implementation and avoid an unnecessary kernel invocation in
tensor.argsort
for 1d input gh-1967 - Corrected uses of NumPy constructors with
tensor.usm_ndarray
inputs in test suite gh-1968 - Fixed array API namespace inspection utilities showing
complex128
as a valid dtype on devices without double precision anddevice
keywords not working withdpctl.SyclQueue
or filter strings gh-1979 - Fixed a bug in
test_sycl_device_interface.cpp
which would cause compilation to fail with Clang version 20.0 gh-1989 - Fixed memory leaks in smart-pointer-managed USM temporaries in synchronizing kernel calls gh-2002
UsmNDArray_MakeSimpleFromPtr
andUsmNDArray_MakeFromPtr
now raise an error when provided an invalidtypenum
before attempting to create the array gh-2003- Fixed typos in
tensor.from_numpy
andtensor.astype
gh-2006
Maintenance
- Revert pinning of cmake to 3.26 on Windows gh-1823
- Update black version used in Python code style workflow gh-1828
- Fixed CI/CD workflow for building conda packages on Windows gh-1831
- Revert work-around in
test_sycl_kernel_submit.py
for problem in MKL 2024.2.0 gh-1836 - Do not use Mambaforge variant of miniforge as deprecated gh-1844
- Use pybind11=2.13.6 gh-1845
- Remove unnecessary include in C++ header file gh-1846
- Build translation unit "simplify_iteration_space.cpp" compiled multiple times as a static library gh-1847
- Add instructions for installing
dpctl
from Intel PyPi channel gh-1860 - Fix warnings when generating docs gh-1855, gh-1861
- Align conda recipe with conda-forge's
{{ stdlib("c") }}
migration gh-1868 - Add missing include of SYCL header to "math_utils.hpp" gh-1899
- Add support of CV-qualifiers in
is_complex<T>
helper gh-1900 - Tuning work for elementwise functions with modest performance gains (under 10%) gh-1889
- Reduce binary size of accumulators by saving repeated expressions to a temporary gh-1896
- Added workflow to run nightly tests of
dpctl
gh-1903, gh-1905 - Support and testing for Python 3.13 for
dpctl
gh-1941, gh-1943 - Change libtensor to use
std::size_t
anddpctl::tensor::ssize_t
throughout and fix missing includes forstd::size_t
andsize_t
gh-1950 - Fixed some unqualified
size_t
and fixed-width integral types inlibtensor
gh-1955 - Add versioneer as a build requirement in documentation on building
dpctl
from source gh-1972 - Remove const qualifiers for class and struct members gh-1974, gh-1975
- Various code quality improvements to
test_sycl_queue_submit_local_accessor_arg.cpp
gh-1990 - Added Python 3.12 to package metadata gh-2005
- Miscellaneous changes to continuous integration/delivery (CI/CD) supporting scripts:
gh-1837, gh-1839, gh-1848, gh-1853, gh-1854, gh-1856, gh-1858, gh-1863, gh-1864, gh-1865, gh-1881, gh-1882, gh-1884, gh-1886, gh-1888, gh-1897, gh-1898, gh-1909, gh-1916, gh-1927, gh-1940, gh-1948, gh-1949, gh-1952, gh-1962, gh-1963, gh-1973, gh-1980, gh-1981, gh-1983, gh-1988
New Contributors
- @sommerlukas made their first contribution in #1985