Skip to content

Commit

Permalink
fix(server/pypika): fix aliases generated for split step [TCTC-10029]…
Browse files Browse the repository at this point in the history
… [TCTC-10030] (#2339)

* chore: added test fixture

Signed-off-by: Luka Peschke <[email protected]>

* fix(server/pypika): fix aliases generated for split step [TCTC-10029]

Signed-off-by: Luka Peschke <[email protected]>

* test(server/pypika): added fixture for TCTC-10030 as well

Signed-off-by: Luka Peschke <[email protected]>

* docs: update changelog

Signed-off-by: Luka Peschke <[email protected]>

---------

Signed-off-by: Luka Peschke <[email protected]>
  • Loading branch information
lukapeschke authored Feb 4, 2025
1 parent c228063 commit b1a611a
Show file tree
Hide file tree
Showing 6 changed files with 119 additions and 29 deletions.
5 changes: 4 additions & 1 deletion server/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# Changelog (weaverbird python package)


## Unreleased

### Fixed

- Pypika: The split step can now be followed by other steps for Google Big Query

## [0.50.0] - 2025-02-03

### Changed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,8 @@ def split(

safe_offset = CustomFunction("SAFE_OFFSET", ["index"])

splitted_cols_aliases = []

# Sub-optimal, could do that in two sub_queries, one for splitting to a temp array col, and
# another one to select eveything needed from the array col rather than splitting N times
def gen_splitted_cols():
Expand All @@ -155,13 +157,13 @@ def gen_splitted_cols():
#
# The IfNull is required because other backends use SPLIT_PART, which will return an
# empty string rather than NULL
yield functions.IfNull(LiteralValue(f"{split_str}[{safe_offset_str}]"), "").as_(
f"{step.column}_{i + 1}"
)
splitted_col_alias = f"{step.column}_{i + 1}"
splitted_cols_aliases.append(splitted_col_alias)
yield functions.IfNull(LiteralValue(f"{split_str}[{safe_offset_str}]"), "").as_(splitted_col_alias)

splitted_cols = list(gen_splitted_cols())
query: QueryBuilder = prev_step_table.select(*columns, *splitted_cols)
return StepContext(query, columns + splitted_cols)
return StepContext(query, columns + splitted_cols_aliases)

@classmethod
def _date_trunc(cls, date_part: str, target_column: Field) -> Term:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
exclude:
- mongo
- pandas
- snowflake
step:
pipeline:
- name: convert
columns:
- brewing_date
dataType: text
- name: split
column: brewing_date
delimiter: '-'
numberColsToKeep: 2
- name: rename
toRename:
- - brewing_date_1
- renamed_column_1
- - brewing_date_2
- renamed_column_2
- name: select
columns:
- renamed_column_1
- renamed_column_2
expected:
schema:
pandas_version: 1.5.0
fields:
- name: renamed_column_1
type: string
- name: renamed_column_2
type: string
data:
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
- renamed_column_1: '2022'
renamed_column_2: '01'
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
exclude:
- mongo
- pandas
- snowflake
step:
pipeline:
- name: convert
columns:
- brewing_date
dataType: text
- name: split
column: brewing_date
delimiter: '-'
numberColsToKeep: 2
- name: replace
searchColumn: brewing_date_2
toReplace:
- - '01'
- 'HELLO'
- name: text
newColumn: 'bye'
text: 'bye'
- name: select
columns:
- brewing_date_2
- bye
expected:
schema:
pandas_version: 1.5.0
fields:
- name: brewing_date_2
type: string
- name: bye
type: string
data:
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
- brewing_date_2: 'HELLO'
bye: 'bye'
Original file line number Diff line number Diff line change
@@ -1,26 +1,3 @@
"""
BigQuery free DBs have tables that expire after 60 days.
If the table "beers.beers_tiny" is expired, re-create it:
- open the BigQuery console https://console.cloud.google.com/bigquery?project=biquery-integration-tests&ws=!1m4!1m3!3m2!1sbiquery-integration-tests!2sbeers
- use "create table", choose "Upload" and use the `beers-bigquery.csv` file available [here](https://github.com/ToucanToco/weaverbird/pull/1835#issuecomment-1647810149)
- name the table "beers" and check "Edit text" for the schema
- fill the schema with:
```
price_per_l:FLOAT,
alcohol_degree:FLOAT,
name:STRING,
cost:FLOAT,
beer_kind:STRING,
volume_ml:FLOAT,
brewing_date:DATE,
nullable_name:STRING
```
- run the query:
``
`CREATE TABLE `beers.beers_tiny` AS SELECT * FROM `beers.beers` ORDER BY brewing_date LIMIT 10
```
"""

import json
from io import StringIO
from os import environ
Expand Down
2 changes: 1 addition & 1 deletion server/uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit b1a611a

Please sign in to comment.