Skip to content

v0.19.0

Latest
Compare
Choose a tag to compare
@ndgrigorian ndgrigorian released this 28 Feb 19:25
· 131 commits to master since this release
1336b31

This release features official, out-of-the-box support for compiling dpctl for specified AMD GPU architectures, the addition of new function tensor.top_k, a radix-sort-based implementation of sorting functions, and improvements to interoperability with DLPack through tensor.dldevice_to_sycl_device and tensor.sycl_device_to_dldevice.

A number of adjustments were also made to improve performance of dpctl reductions (i.e., sum, min, max, etc.), accumulators (i.e., cumulative_sum, cumulative_logsumexp), and copy-and-cast operations.

Added

  • Support for compiling dpctl for specified AMD GPU architecture with use of CodePlay oneAPI plug-in gh-1731
  • Added tensor.top_k per Python Array API specification gh-1921
  • Added functions tensor.dldevice_to_sycl_device and tensor.sycl_device_to_dldevice for converting between DLPack and sycl devices, and a method get_device_id to dpctl.SyclDevice to improve interoperability with DLPack protocol gh-1953
  • Added DPCTL_OFFLOAD_COMPRESS cmake option (set to OFF by default) to toggle --offload-compress linker option when building dpctl gh-1961

Changed

  • Improved performance of copy-and-cast operations from numpy.ndarray to tensor.usm_ndarray for contiguous inputs gh-1829
  • py_sort and py_argsort now throw py::value_error if inputs are not C-contiguous gh-1838
  • Improved performance of copying operation to C-/F-contig array, with optimization for batch of square matrices gh-1850
  • Improved performance of tensor.argsort function for all types gh-1859
  • Improved performance of tensor.sort and tensor.argsort for short arrays in the range [16, 64] elements gh-1866
  • Implemented radix sort algorithm to be used in dpt.sort and dpt.argsort gh-1867, gh-1883
  • Extended dpctl.SyclTimer with device_timer keyword, implementing different methods of collecting device times gh-1872
  • dpctl changed to see GPU devices out of the box in virtual environment on Windows gh-1922
  • Improved performance of tensor.cumulative_sum, tensor.cumulative_prod, tensor.cumulative_logsumexp as well as performance of boolean indexing gh-1923, gh-1942
  • Improved performance of tensor.min, tensor.max, tensor.logsumexp, tensor.reduce_hypot for floating point type arrays by at least 2x gh-1932, gh-1937
  • Updated Cython examples to use scikit-build gh-1935
  • Reduced binary size of _tensor_accumulation_impl by 13 MB gh-1957
  • Extended tensor.asarray to support objects that implement __usm_ndarray__ property to be interpreted as usm_ndarray objects gh-1959
  • tensor.usm_ndarray object disallows implicit conversions to NumPy array gh-1964
  • stream arguments in tensor.usm_ndarray methods now raise an error if stream is not a tensor.SyclQueue gh-1969
  • dpctl initialization sets subprocess to use SPAWN method on Linux to enable gdb-oneapi to debug kernels submitted from Python applications gh-1971
  • Reduced binary size of _tensor_elementwise_impl gh-1976
  • Allow dpctl.SyclQueue.memcpy to and from multi-dimensional buffers gh-1985

Fixed

  • Fixed a bug in tensor.roll for very large values of shift gh-1869
  • Fix for tensor.result_type when all inputs are Python built-in scalars gh-1877
  • Improved error in constructors tensor.full and tensor.full_like when provided a non-numeric fill value gh-1878
  • Added a check for pointer alignment when copying to C-contiguous memory gh-1890, gh-1891
  • Fixed dpctl installed into virtual environment not finding DPC++ runtime libraries by adding DPCTL_WITH_REDIST cmake option (set to OFF by default) gh-1893
  • Fixed incorrect result (issue gh-1901) in tensor.cumulative_sum and in advanced indexing gh-1902
  • Fixed __setitem__() for tensor.usm_ndarray when passed an empty boolean mask gh-1915
  • tensor.from_dlpack docstring now shows that return type can be NumPy array and stipulates when this will be the case gh-1919
  • Fixed docstring in helper class in DLPack tests gh-1920
  • Fixed a bug in tensor.astype where copy=False would not be respected for 1d arrays when order keyword is specified gh-1928
  • Replaced deprecated CL/sycl.hpp with recommended sycl/sycl.hpp in examples gh-1933
  • Fixed tensor.take_along_axis and tensor.put_along_axis raising an error for tensor.uint64 indices when given an array of dimension greater than 1 gh-1934
  • Fixed unexpected results of tensor.sum with a requested output type of bool gh-1958
  • Use std::move to avoid unnecessary copying of temporary in triul_ctor.cpp gh-1960
  • Make stream a keyword-only argument in tensor.usm_ndarray.to_device per requirement by array API specification gh-1966
  • Improve efficiency of copy implementation and avoid an unnecessary kernel invocation in tensor.argsort for 1d input gh-1967
  • Corrected uses of NumPy constructors with tensor.usm_ndarray inputs in test suite gh-1968
  • Fixed array API namespace inspection utilities showing complex128 as a valid dtype on devices without double precision and device keywords not working with dpctl.SyclQueue or filter strings gh-1979
  • Fixed a bug in test_sycl_device_interface.cpp which would cause compilation to fail with Clang version 20.0 gh-1989
  • Fixed memory leaks in smart-pointer-managed USM temporaries in synchronizing kernel calls gh-2002
  • UsmNDArray_MakeSimpleFromPtr and UsmNDArray_MakeFromPtr now raise an error when provided an invalid typenum before attempting to create the array gh-2003
  • Fixed typos in tensor.from_numpy and tensor.astype gh-2006

Maintenance

New Contributors