Releases: aws/aws-sdk-pandas
3.0.0b2
What's Changed
- (feat) Update to Ray 2.0 by @kukushking in #1635
- (feat) Ray logging by @malachi-constant in #1623
- (enhancement): Reduce LOC in S3 write methods create_table by @jaidisido in #1626
- (docs) Tutorial: Run SDK for pandas job on ray cluster by @malachi-constant in #1616
Full Changelog: 3.0.0b1...3.0.0b2
3.0.0b1
What's Changed
- (test) Consolidate unit and load tests by @jaidisido in #1525
- (feat) Distribute S3 read text by @LeonLuttenberger in #1567
- (feat) Distribute s3 wait_objects by @LeonLuttenberger in #1539
- (test) Ray Load Tests CDK Stack and Instructions for Load Testing by @malachi-constant in #1583
- (fix) Fix S3 read text with version ID was not working by @LeonLuttenberger in #1587
- (feat) Add distributed s3 write parquet by @kukushking in #1526
- (fix) Distribute write text regression, change to singledispatch, add repartitioning utility by @kukushking in #1611
- (enhancement) Optimise distributed s3.read_text to load data in chunks by @LeonLuttenberger in #1607
Full Changelog: 3.0.0a2...3.0.0b1
AWS SDK for pandas 2.17.0
New Functionalities
- RedshiftDataAPI serverless support 🔥 #1530
- Check out the tutorial
- Add
get_query_results
to the Athena module #1496- Check out the function documentation
- Add
generate_create_query
to the Athena module #1514- Check out the function documentation
Enhancements
- Returning empty DataFrame for empty TimeStream query #1430
- Added support for
INSERT IGNORE
formysql.to_sql
#1429 - Added
use_column_names
toredshift.copy
akin toredshift.to_sql
#1437 - Enable passing kwargs to
redshift.connect
#1467 - Add
timestream_endpoint_url
property to the config #1483 - Add support for upserting to an empty Glue table #1579
Documentation
- Fix typos in documentation #1434
Bug Fix
validate_schema=True
forwr.s3.read_parquet
breaks with partition columns anddataset=True
#1426wr.neptune.to_property_graph
failing for Neptune version 1.1.1.0 #1407- ValueError when using opensearch.index_df with documents with an array field #1444
- Missing
catalog_id
inwr.catalog.create_database
#1480 - Check for pair of brackets in query preparation for Athena cache #1529
- Fix wrong type hint for
TagColumnOperation
inquicksight.create_athena_dataset
#1570 s3.to_json
compression parameters is passed twice whendataset=True
#1585- Cast Athena array, map & struct types to pandas object #1581
- In the OpenSearch module, use SSL only for HTTPS (port 443) #1603
Noteworthy
AWS Lambda Managed Layers
Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler
to AWSSDKPandas
.
You can view the ARN value for the layers here.
PyArrow 7 Support
pip install pyarrow==2 awswrangler
Thanks
We thank the following contributors/users for their work on this release:
@bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking
3.0.0a2
This is a pre-release for the Wrangler@Scale project
What's Changed
- (feat): Add directory for Distributed Wrangler Load Tests by @malachi-constant in #1464
- (CI): Distribute tests in tox config by @malachi-constant in #1469
- (feat): Distribute s3 delete objects by @malachi-constant in #1474
- (CI): Enable new CI pipeline for standard & distributed tests by @malachi-constant in #1481
- (feat): Refactor to distribute s3.read_parquet by @jaidisido in #1513
- (bug): s3 delete tests failing in distributed codebase by @malachi-constant in #1517
Full Changelog: 3.0.0a1...3.0.0a2
3.0.0a1
This is a pre-release for the Wrangler@Scale project
What's Changed
- (feat): Add distributed config flag and initialise method by @jaidisido in #1389
- (feat): Add distributed Lake Formation read by @jaidisido in #1397
- (feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
- (refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in #1446
Full Changelog: 2.16.1...3.0.0a1
2.16.1
Noteworthy
🐛 Fixed issue introduced by
2.16.0
to methods3.read_parquet()
Patch
- Fix bug: pq_file.schema.names(): TypeError: 'list' object is not callable
s3.read_parquet()
#1412
P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Full Changelog: 2.16.0...2.16.1
AWS Data Wrangler 2.16.0
Noteworthy
⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️pip install pyarrow==2 awswrangler
New Functionalities
Enhancements
- add test infrastructure for oracle database #1274
- revisiting S3 Select performance #1287
- migrate test infra from cdk v1 to cdk v2 #1288
- to_sql() make column names quoted identifiers to allow sql keywords #1392
- throw NoFilesFound exception on 404 #1290
- fast executemany #1299
- add precombine key to upsert method for Redshift #1304
- pass precombine to redshift.copy() #1319
- use DataFrame column names in INSERT statement for UPSERT operation #1317
- add data_source param to athena.repair_table #1324
- modify athena2quicksight datatypes to allow startswith for varchar #1332
- add TagColumnOperation to quicksight.create_athena_dataset #1342
- enable list timestream databases and tables #1345
- enable s3.to_parquet to receive "zstd" compression type #1369
- create a way to perform PartiQL queries to a Dynamo DB table #1390
- s3 proxy support with data wrangler #1361
Documentation
- be more explicit about awswrangler.s3.to_parquet overwrite behavior #1300
- fix Python Version in Readme #1302
Bug Fix
- set encoding to utf-8 when no encoding is specified when reading/writing to s3 #1257
- fix Redshift Locking Behavior #1305
- specify cfn deletion policy for sqlserver and oracle instances #1378
- to_sql() make column names quoted identifiers to allow sql keywords #1392
- fix extension dtype index handling #1333
- fix issue with redshift.to_sql() method when mode set to "upsert" and schema contains a hyphen #1360
- timestream - array cols to str #1368
- read_parquet Does Not Throw Error for Missing Column #1370
Thanks
We thank the following contributors/users for their work on this release:
@bnimam, @IldarAlmakaev, @syokoysn, @IldarAlmakaev, @thomasniebler, @maxdavidson91, @takeknock, @Sleekbobby1011, @snikolakis, @willsmith28, @malachi-constant, @cnfait, @jaidisido, @kukushking
P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
AWS Data Wrangler 2.15.1
Noteworthy
⚠️ Dropped Python 3.6 support
⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️pip install pyarrow==2 awswrangler
Patch
- Add
sparql
extra & makeSPARQLWrapper
dependency optional #1252
P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
AWS Data Wrangler 2.15.0
Noteworthy
⚠️ Dropped Python 3.6 support
⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️pip install pyarrow==2 awswrangler
New Functionalities
- Amazon Neptune module 🚀 #1084 Check out the tutorial. Thanks to @bechbd & @sakti-mishra !
- ARM64 Support for Python 3.8 and 3.9 layers 🔥 #1129 Many thanks @cnfait !
Enhancements
- Timestream module - support multi-measure records #1214
- Warnings for implicit float conversion of nulls in to_parquet #1221
- Support additional sql params in Redshift COPY operation #1210
- Add create_ctas_table to Athena module #1207
- S3 Proxy support #1206
- Add Athena get_named_query_statement #1183
- Add manifest parameter to 'redshift.copy_from_files' method #1164
Documentation
Bug Fix
- Give precedence to user path for Athena UNLOAD S3 Output Location #1216
- Honor User specified workgroup in athena.read_sql_query with unload_approach=True #1178
- Support map type in Redshift copy #1185
- data_api.rds.read_sql_query() does not preserve data type when column is all NULLS - switches to Boolean #1158
- Allow decimal values within struct when writing to parquet #1179
Thanks
We thank the following contributors/users for their work on this release:
@bechbd, @sakti-mishra, @mateogianolio, @jasadams, @malachi-constant, @cnfait, @jaidisido, @kukushking
P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
AWS Data Wrangler 2.14.0
Caveats
⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️pip install pyarrow==2 awswrangler
New Functionalities
- Support Athena Unload 🚀 #1038
Enhancements
- Add the
ExcludeColumnSchema=True
argument to the glue.get_partitions call to reduce response size #1094 - Add PyArrow flavor argument to
write_parquet
viapyarrow_additional_kwargs
#1057 - Add
rename_duplicate_columns
andhandle_duplicate_columns
flag tosanitize_dataframe_columns_names
method #1124 - Add
timestamp_as_object
argument to all databaseread_sql_table
methods #1130 - Add
ignore_null
to read_parquet_metadata method #1125
Documentation
- Improve documentation on installing SAR Lambda layers with the CDK #1097
- Fix broken link to tutorial in
to_parquet
method #1058
Bug Fix
- Ensure that partition locations retrieved from AWS Glue always end in a "/" #1094
- Fix bucketing overflow issue in Athena #1086
Thanks
We thank the following contributors/users for their work on this release:
@dennyau, @kailukowiak, @lucasmo, @moykeen, @RigoIce, @vlieven, @kepler, @mdavis-xyz, @ConstantinoSchillebeeckx, @kukushking, @jaidisido
P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!