Releases: aws/aws-sdk-pandas
Releases · aws/aws-sdk-pandas
AWS SDK for pandas 3.1.0
Features/Enhancements 🚀
- Add
neptune.bulk_load
for bulk loading data into Neptune by @LeonLuttenberger in #2238 #2267 - Add
s3.to_deltalake
function by @LeonLuttenberger in #2228 - Add Timestream Batch Load support by @jaidisido in #2214
- Add Iceberg insert by @kukushking in #2233
- Support upsert mode for OracleDB by @LeonLuttenberger in #2265
- Add
chunked
parameter to DynamoDB read functions by @LeonLuttenberger in #2227 - Upgrade Modin to 0.20.1 & allow Ray 2.4 by @kukushking in #2234
- Support Glue Connection SSM credential type by @kukushking in #2232
- Add ability to pass schema to S3 Select by @kukushking in #2237
- Add dynamic classification EMR config by @LLejoly in #2250
- Add support for server-side cursors in PostgreSQL module by @kukushking in #2262
- Add time unit to Timestream write API by @jaidisido in #2263
Fixes 🛠️
- Set
ignore_metadata
toFalse
by default by @jaidisido in #2206 - Fix conflicting types for
path_ignore_suffix
by @LeonLuttenberger in #2240 - Athena workgroup query engine v3 upgrade artifacts by @kukushking in #2243
- Fixing
test_spectrum_decimal_cast
test by @LeonLuttenberger in #2244 emr.create_cluster
was not passing security configuration to internal method by @malachi-constant in #2246- Fix pagination in
timestream.list_tables
by @SukruHan #2275
Documentation 📚
- Include our ADRs in GitHub by @LeonLuttenberger in #2215 #2259
- Fixes in the Athena Cache tutorial by @patrick-muller in #2201
- Write ADR for the switching between PyArrow and Pandas I/O functions by @LeonLuttenberger in #2245
- Fix "about" URL in README by @CGarces in #2207
- Update
layers.rst
with Python 3.10 layers by @LeonLuttenberger in #2219 - Fix links to 'Who uses library' section by @LeonLuttenberger in #2241
- Declutter function overloads by extracting overloads to
pyi
files by @LeonLuttenberger in #2229 #2255 #2256
Full Changelog: 3.0.0...3.1.0
AWS SDK for pandas 3.0.0
Breaking changes 💥
- Move dependencies to optional by @jaidisido in #1992 🔓
- Dependencies required by the following modules have been moved to optional: redshift, mysql, postgres, sqlserver, oracle, gremlin, sparql, deltalake
- The required dependencies can be easily installed with
pip install awswrangler[<MODULE_NAME>]
, for examplepip install awswrangler[redshift]
- Change SQL formatters for Athena and LakeFormation so that they properly format types by @Taragolis and @LeonLuttenberger in #1416 #1543 #1684 💾
- For example a parameter of type
dt.datetime
is parsed intoDATETIME xxxx-xx-xx xx:xx:xx
, while a parameter of typestr
is formatted into"x"
- For example a parameter of type
- Refactor function signatures so that closely related parameters are grouped into a single parameter defined as a
TypeDict
by @LeonLuttenberger and @kukushking in #1855 #1996 #2016 #2055 #2081 💼- Glue catalog parameters are grouped together in
to_parquet
,to_csv
andto_json
- Athena UNLOAD and CTAS parameters are grouped together
- Glue catalog parameters are grouped together in
- Deprecate
wr.s3.merge_upsert_table
by @kukushking in #2076⚠️ - Deprecate
updated_name
parameter inupdate_ruleset
by @jaidisido in #2122⚠️ - Stop support for Python 3.7
⚠️
New functionalities 🚀
AWS SDK for pandas can now run at scale 🚀💻🚀
Tutorials
- 034 - Distributing Calls Using Ray
- 035 - Distributing Calls on Ray Remote Cluster
- 036 - Distributing Calls with Glue Interactive Sessions on Ray
AWS Blogs
Features/Enhancements 🚀
- Thread-safety improvements by @kukushking in #2186
- Allow Python 3.11 by @kukushking in #2101 🐍
- Add
use_theads
parameter todynamodb.read_items
by @LeonLuttenberger in #2113 📈 - Distribute
wr.dynamodb.put_df
with executor task by @LeonLuttenberger in #2118 📈 - Add additional arg for glue database
DatabaseInput
by @malachi-constant in #2067 🔧 - Add overloads for function which can have multiple return value types by @LeonLuttenberger #1855
- Add support for boto3 kwargs to
timestream.create_table
by @cnfait in #1819 - Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
- Upgrade to Ray 2.0 by @kukushking in #1635
- Add partitioning on block level by @kukushking in #1653
- Use fast file metadata provider by @kukushking in #1997
- Distribute DynamoDB Parallel Scan by @jaidisido in #1981
- Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
- Add distributed variant of the
_read_parquet_metadata_file
function based on the PyArrow file system by @LeonLuttenberger in #2050 - Validate distributed kwargs by @kukushking in #2051
- Add
@Experimental
and@Deprecated
annotations by @kukushking in #2062 - Distribute S3
describe_objects
by @jaidisido in #2069 - Distributed S3 copy/merge by @kukushking in #2070
- Add
bulk_read
option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033 - Deprecate boto3 resources by @kukushking in #2097
- Add retries for s3 select by @kukushking in #1780
- Make tqdm progress reporting opt-in by @kukushking in #1741
- Distribute data types inference by @jaidisido in #1692
- Change to singledispatch, add repartitioning utility, fix distributed write text regression by @kukushking in #1611
- Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
- Configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
- Validate partitions along row axis, add warning by @kukushking in #1700
- Refactor executor module by @kukushking in #2120
- Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
- Distribute Timestream write with executor by @jaidisido in #1715
- Distribute
s3.to_json
ands3.to_csv
by @LeonLuttenberger in #1631 - Distribute
s3.read_csv
,s3.read_json
ands3.read_fwf
by @LeonLuttenberger in #1567 #1607 - Distribute
s3.wait_objects
by @LeonLuttenberger in #1539 - Distribute
s3.to_parquet
by @kukushking in #1526 - Distribute
s3.delete objects
by @malachi-constant in #1474 - Distribute
s3.read_parquet
by @jaidisido in #1513 - Add ThreadPoolExecutor and RayExecutor; refactor threading/ray; add single-path distributed
s3.select_query
by @kukushking in #1446 - Add distributed Lake Formation read by @jaidisido in #1397
- Refactor ray datasources by @kukushking in #1687
- Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
- Add
Literal
typing formode
andprojection_types
by @LeonLuttenberger in #2191
Fixes 🛠️
- Sanitize bucketing col names by @kukushking in #2155
- Allow writing files from an empty dataframe by @malachi-constant in #2045
- Athena out of bound dates by @kukushking in #2180
- Fix partition block overwriting by @kukushking in #1695
- Distrib S3 Select - check row count before creating the Ray dataset by @kukushking in #1808
- Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
- Add retries to
read_parquet_metadata_distributed
by @jaidisido in #2196 - Fix default
utcnow
argument instart_query
by @LeonLuttenberger in #2193
Documentation 📚
- Athena Iceberg tutorial by @kukushking in #2117
- Add at scale section by @kukushking in #2119
- Documentation spell-checking improvements by @LeonLuttenberger in #2165
- Add AWS Glue on Ray docs by @jaidisido in #1810
- Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
- Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
- Add "Introduction to Ray" Tutorials by @LeonLuttenberger in #1661
- Add SDK for pandas job on ray cluster tutorial by @malachi-constant in #1616
- Add typeddicts to docs by @LeonLuttenberger in #2167
Tests 🧪
- Add PR linter Github action by @jaidisido in #2106
- Replace load tests bucket with SSM parameter by @jaidisido in #2121
- opensearch index cleanup / skip by @kukushking in #2149
- Add benchmark tests by @jaidisido in #2143
- Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
- Remove
awswrangler.distributed
from coverage report by @LeonLuttenberger in #1884 - Consolidate unit and load tests by @jaidisido in #1525
- Distribute tests in tox config by @malachi-constant in #1469
Full Changelog: 2.20.1...3.0.0
AWS SDK for pandas 2.20.1
What's Changed
- (fix) Timestream - ignore None, NaN, and NaT measure values by @kukushking in #2072
- (docs) Minor - update opensearch api docs by @kukushking in #2085
- Correct documentation for
chunksize=True
by @LeonLuttenberger in #2087 - fix: timestream empty batches by @kukushking in #2098
- enhancement: Add timestream common attributes by @jaidisido in #2091
- deprecate: boto3 resources by @kukushking in #2097
- tests: Add PR linter Github action by @jaidisido in #2106
- fix: Schema evolution for
to_csv
andto_json
by @LeonLuttenberger in #2104 - [skip ci] pip(deps): Bump deltalake from 0.7.0 to 0.8.0 by @dependabot in #2110
- tutorials: Athena Iceberg by @kukushking in #2117
- deprecate updated_name param in update_ruleset by @jaidisido in #2122
- fix: Config not loading environment variables for config by @LeonLuttenberger in #2136
Full Changelog: 2.20.0...2.20.1
AWS SDK for pandas 3.0.0rc3
What's Changed
Breaking changes:
- breaking change: Move dependencies to optional by @jaidisido in #1992
- breaking change: Use ExecuteStatement instead of Scan for DynamoDB read_partiql by @jaidisido in #1964
Features/Enhancements:
- enhancement: Refactor engine switching when Ray is installed by @LeonLuttenberger in #1792
- logging: Enable user to configure RayLogger by @jaidisido in #1801
- enhancement: Add support for boto3 kwargs to timestream.create_table by @cnfait in #1819
- enhancement: Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
- enhancement: Unload ray default max file size by @kukushking in #1912
- enhancement: Remove session serialization/deserialization by @kukushking in #1957
- enhancement: Unify return values for write json by @LeonLuttenberger in #1960
- feature: Log data sizes in load test benchmarks by @LeonLuttenberger in #1949
- enhancement: Add write_table_args by @kukushking in #1978
- feature: Distribute DynamoDB Parallel Scan by @jaidisido in #1981
- enhancement: Use fast file metadata provider by @kukushking in #1997
- enhancement: Add
names
parameter support to PyArrow reading by @LeonLuttenberger in #2008 - enhancement: Add support for JSON PyArrow data source by @LeonLuttenberger in #2019
- enhancement: Set ray.data parallelisation to -1 by default by @jaidisido in #2022
- enhancement: Add distributed variant of the
_read_parquet_metadata_file
function based on the PyArrow file system by @LeonLuttenberger in #2050 - feature: Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
- feature: Validate distributed kwargs by @kukushking in #2051
- enhancement: Distribute S3 describe_objects by @jaidisido in #2069
- feature: Distributed S3 copy/merge by @kukushking in #2070
- enhancement: Add
bulk_read
option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033 - enhancement: Upgrade ray to 2.3 by @jaidisido in #2084
- enhancement: Extract
parallelism
andbulk_read
intoray_modin_args
by @LeonLuttenberger in #2081 - deprecate: boto3 resources by @kukushking in #2097
Fixes:
- fix: Check row count before creating the Ray dataset in S3 Select by @kukushking in #1808
- fix: Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
- fix: Fix empty arrow refs by @kukushking in #1816
- fix: Sanitize column names modifying the data frame in distributed mode by @LeonLuttenberger in #1926
Documentation:
- docs: Add AWS Glue on Ray docs by @jaidisido in #1810
- docs: Clarify datasource.on_write_complete docs by @kukushking in #2100
Tests:
- tests: Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
- tests: Remove
awswrangler.distributed
from coverage report by @LeonLuttenberger in #1884 - tests: Create oad Testing Benchmark Analytics by @malachi-constant in #1905
- tests: Adjust load test benchmark values by @malachi-constant in #1910
- tests: Remove exports from glueray stack by @malachi-constant in #2020
- tests: Add
test_modin_s3_read_parquet_many_files
by @LeonLuttenberger in #2096
Full Changelog: 3.0.0rc2...3.0.0rc3
AWS SDK for pandas 2.20.0
Breaking changes
dynamodb.read_partiql
no longer performs a Scan operation under the hood. Instead theExecuteStatement
API is used. It means that thePartiQL*
IAM permission is required instead ofScan
Noteworthy
- (feat): opensearch serverless by @kukushking in #1922. See the tutorial 🔥
- (breaking change): Use
ExecuteStatement
instead of Scan for DynamoDBread_partiql
by @jaidisido in #1964 - (enhancement) Remove session serialization/deserialization by @kukushking in #1957
What's Changed
- (enhancement): Allow override ParquetWriter args by @kukushking in #1941
- (enhancement): Add EMR configurations arg by @kukushking in #1939
- (feat): Add index_name to DynamoDB read_items by @jaidisido in #1961
- (fix): Set Content Type in lowercase by @jaidisido in #1976
- (enhancement): Add write_table_args by @kukushking in #1978
- (enhancement): Extend arrow to include python build modules by @nkarpov in #1977
- (fix): Add uuid to athena2pyarrow mapping by @jaidisido in #1995
- (fix): Add missing TIME type to pyarrow2redshift conversion method by @jaidisido in #2040
- (enhancement): Add configurable query polling delay parameters by @LeonLuttenberger in #2056
- (enhancement)" Add @experimental and @deprecated annotations by @kukushking in #2062
- (enhancement): Update EMR release version for tests and default by @malachi-constant in #2065
- (enhancement): Add loaded and default parameters to config args by @LeonLuttenberger in #2075
Documentation
- (docs): fix contributing guide by @jaidisido in #2054
- (docs): Document return value of timestream.write by @mdavis-xyz in #2025
Tests
- (tests): Add Glue DQ role name by @kukushking in #1936
- (tests): Fix mock call args error on py37 by @kukushking in #1937
- (tests): Fix any unnecessary
xfail
's in tests by @malachi-constant in #1930 - (tests): Move AOSS collection to infra by @kukushking in #1993
- (tests): Add missing LakeFormation permissions by @kukushking in #2001
- (test-infra): Replace cfn export with ssm parameters by @malachi-constant in #2009
- (tests): Fix SSE defaults by @kukushking in #2049
New Contributors
- @nkarpov made their first contribution in #1977
- @mdavis-xyz made their first contribution in #2025
Full Changelog: 2.19.0...2.20
AWS SDK for pandas 2.19.0
Noteworthy
- Glue Data Quality now supported, checkout the tutorial 🔥
- Delta lake support by @fvaleye
- New DynamoDB
read_items
method by @a-slice-of-py
Features & enhancements
- feat: add read_items to dynamodb module by @a-slice-of-py in #1877
- Add deltalake support in AWS S3 with Pandas by @fvaleye in #1834
- support for pagination for timestream.list_databases list_tables by @cnfait in #1846
- (feat) glue data quality by @kukushking in #1861
- Add unit test for evaluating two rulesets at once by @LeonLuttenberger in #1871
- (enhancement) Minor - wr.redshift.copy - pass through commit_transaction by @kukushking in #1878
- (enhancement): Extend get and update ruleset DQ methods by @jaidisido in #1882
- enhancement: Adding filter to quicksight
delete_all
methods by @malachi-constant in #1913 - enhancement: Support optional
measure_name
inwr.timestream.write()
by @malachi-constant in #1925
Bug fixes
- (fix) Check if timezone is present in column metadata by @kukushking in #1840
- (fix) Include numpy==1.23.4 && poetry update by @kukushking in #1850
- Fix apply_configs decorator causing function signature to be lost by @LeonLuttenberger in #1858
- forward use_threads to _validate_schemas_from_files by @robert-schmidtke in #1869
- (fix) Minor - KeyError in wr.opensearch.seach && cleanup tests by @kukushking in #1879
- (fix): missing timestamp data type in Timestream by @jaidisido in #1881
- Fix the Athena cache unit test errors by @LeonLuttenberger in #1883
- (fix): Handle None in databases data types by @jaidisido in #1892
Documentation
- Document the create_csv_table function's sensitivity to column order by @LeonLuttenberger in #1923
- (docs) Add extension for ipython console highlighting by @kukushking in #1841
- (feat) Minor - add sphinx copy button for code blocks by @kukushking in #1854
Tests
- Test infra: Add NAT gateway IP addresses to base stack SSM parameters by @LeonLuttenberger in #1847
- Testing: Update Opensearch test output and fixture by @malachi-constant in #1848
- (test-infra) Enable SSE, enforce HTTPS, enable node-to-node encryption by @kukushking in #1851
- (tests) add workaround to enable deltalake to use AWS profile creds by @kukushking in #1934
- Enable warn_unused_ignores for MyPy by @LeonLuttenberger in #1860
- Increase coverage for dynamodb write by @LeonLuttenberger in #1893
- Add tests for S3 wait functions by @LeonLuttenberger in #1896
- Increase coverage for s3.delete* by @LeonLuttenberger in #1897
- Increase S3 tests coverage by @jaidisido in #1909
- Add coverage report to tox by @LeonLuttenberger in #1874
- Add coverage section to pyproject by @jaidisido in #1911
- Deps: Update wheel 0.37.1 -> 0.38.1 by @malachi-constant in #1904
- Add minimum coverage by @LeonLuttenberger in #1927
- refactor: quicksight test resources as fixtures by @malachi-constant in #1928
New Contributors
- @fvaleye made their first contribution in #1834
- @robert-schmidtke made their first contribution in #1869
- @a-slice-of-py made their first contribution in #1877
Thanks
We thank the following contributors/users for their work on this release:
@jaidisido, @kukushking, @LeonLuttenberger, @cnfait, @malachi-constant, @mdavis-xyz, @dydc, @enricomarchesin
Full Changelog: 2.18.0...2.19.0
AWS SDK for pandas 2.18.0
Noteworthy
- Pyarrow 10 support 🔥 by @kukushking in #1731
- Lambda layers now available in
af-south-1
(Cape Town) 🌍 by @malachi-constant
Features & enhancements
- Add unload_approach to athena.read_sql_table by @jaidisido in #1634
- Pass additional partition projection params to wr.s3.to_parquet & cat… by @kukushking in #1627
- Regenerate poetry.lock with no update by @cnfait in #1663
- Upgrading poetry installed in workflow by @cnfait in #1677
- Improve bucketing series generation by casting only the required columns by @kukushking in #1664
- Add get_query_executions generating DataFrames from Athena query executions detail by @KhueNgocDang in #1676
- Dependency: Set Pandas Version != 1.5.0 bue to memory leak by @malachi-constant in #1688
- read_csv: read file as binary when encoding_errors is set to ignore by @cnfait in #1723
- Deps: Remove upper bound limit on 'python' version by @malachi-constant in #1720
- (enhancement) Redshift: Adding 'primary_keys' to parameter validation by @malachi-constant in #1728
- Add describe_log_streams and filter_log_events to the CloudWatch module by @KhueNgocDang in #1785
- Update lambda layers with pyarrow 10 by @kukushking in #1758
- Add ctas_write_compression argument to athena.read_sql_query by @LeonLuttenberger in #1795
- Add auto termination policy to EMR by @vikramsg in #1818
- timestream.query: add QueryId and NextToken to df attributes by @cnfait in #1821
- Add support for boto3 kwargs to timestream.create_table by @cnfait in #1819
- Adding args to submit spark step by @vikramsg in #1826
Bug fixes
- Fix athena.read_sql_query for empty table and chunk size not returning an empty frame generator by @LeonLuttenberger in #1685
- Fixing index column validation in
s3.read.parquet()
validate schema by @malachi-constant in #1735 - Bug: Replace extra_registries with extra_public_registries by @vikramsg in #1757
- Fix: map datatype issue of athena by @pal0064 in #1753
- Fix Redshift commands breaking with hyphenated table names by @LeonLuttenberger in #1762
- Add correct service names for timestream boto3 clients by @malachi-constant in #1716
- Allow read partitions with extra = in the value by @kukushking in #1779
Documentation
- Update install page in docs with screenshot of new managed layer name by @LeonLuttenberger in #1636
- Remove semicolon from python code eol in s3 tutorial by @cnfait in #1673
- Consistent kernel for jupyter notebooks by @cnfait in #1674
- Correct a few typos in our ipynb tutorials by @cnfait in #1694
- Fix broken links in readme by @lucasasmith in #1702
- Typos in comments and docs by @mycaule in #1761
Tests
- Support for test infrastructure in private subnets by @cnfait in #1698
- Upgrade engine versions to match defaults from aws console by @cnfait in #1709
- Set redshift and Neptune clusters removal policy to destroy by @cnfait in #1675
- Upgrade pytest-xdist by @LeonLuttenberger in #1760
- Fix timestream endpoint tests by @LeonLuttenberger in #1781
New Contributors
- @lucasasmith made their first contribution in #1702
- @vikramsg made their first contribution in #1757
- @mycaule made their first contribution in #1761
- @pal0064 made their first contribution in #1753
Thanks
We thank the following contributors/users for their work on this release:
@lucasasmith, @vikramsg, @mycaule, @pal0064, @LeonLuttenberger, @cnfait, @malachi-constant, @kukushking, @jaidisido
Full Changelog: 2.17.0...2.18.0
3.0.0rc2
What's Changed
- (enhancement): Enable missing unit tests and Redshift, Athena, LF load tests by @jaidisido in #1736
- (enhancement): configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
- (testing): Enable Athena and Redshift tests, and address errors by @LeonLuttenberger in #1721
- (feat): Make tqdm progress reporting opt-in by @kukushking in #1741
Full Changelog: 3.0.0rc1...3.0.0rc2
3.0.0rc1
What's Changed
- (enhancement): Move RayLogger out of non-distributed modules by @jaidisido in #1686
- (perf): Distribute data types inference by @jaidisido in #1692
- (docs): Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
- (fix): partition block overwriting by @kukushking in #1695
- (refactor): Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
- (docs): Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
- (enhancement): Apply modin repartitioning where required only by @jaidisido in #1701
- (enhancement): Remove local from ray.init call by @jaidisido in #1708
- (feat): Validate partitions along row axis, add warning by @kukushking in #1700
- (feat): Expand SQL formatter to LakeFormation by @LeonLuttenberger in #1684
- (feat): Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
- (convention): Add Arrow prefix to parquet datasource for consistency by @jaidisido in #1724
- (perf): Distribute Timestream write with executor by @jaidisido in #1715
Full Changelog: 3.0.0b3...3.0.0rc1
3.0.0b3
What's Changed
- (feat): Add partitioning on block level by @kukushking in #1653
- (refactor): Make room for additional distributed engines by @jaidisido in #1646
- (feat): Distribute s3 write text by @LeonLuttenberger in #1631
- (docs): Add "Introduction to Ray" Tutorial by @LeonLuttenberger in #1661
- (fix): Return address config param by @kukushking in #1660
- (refactor): Enable new engines with custom dispatching and other constructs by @jaidisido in #1666
- (deps): Uptick modin to 0.16 by @jaidisido in #1659
Full Changelog: 3.0.0b2...3.0.0b3