Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore/upgrade datafusion 45 #1010

Merged
merged 9 commits into from
Feb 7, 2025
Merged

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Feb 2, 2025

Which issue does this PR close?

Closes apache/datafusion#14410
Upgrade datafusion dependencies to v45.0.0 https://crates.io/crates/datafusion/45.0.0

(Rebase on #1009 is merged)

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
@timsaucer
Copy link
Contributor

Thank you for taking this on! It looks like a smooth update!

@kevinjqliu kevinjqliu marked this pull request as draft February 3, 2025 03:00
@kevinjqliu
Copy link
Contributor Author

Converted to draft for now, blocked until datafusion v45 is published

@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Feb 7, 2025

Interesting... CI was previously passing when using the branch-45 commit hash...

test_string_functions errors on comparison

assert result.column(0) == expected_result

testing locally,

(Pdb) result.column(0)
<pyarrow.lib.ListArray object at 0x104723880>
[
  [
    "ell"
  ],
  [
    "orl"
  ],
  null
]
(Pdb) expected_result
<pyarrow.lib.ListArray object at 0x1041c8d00>
[
  [
    "ell"
  ],
  [
    "orl"
  ],
  null
]
(Pdb) type(expected_result)
<class 'pyarrow.lib.ListArray'>

It seems like the == is using comparing the pointers instead of ListArray values

(Pdb) result.column(0).equals(expected_result)
False
(Pdb) result.column(0).to_pylist() == expected_result.to_pylist()
True

Weird that .equals is False

cc @timsaucer before i dig deeper, do you have any pointers on this?

@timsaucer
Copy link
Contributor

@kevinjqliu I just pushed to your branch an update to the return type. As we've been updating we've seen a lot of these small changes from string -> string_view.

@kevinjqliu
Copy link
Contributor Author

Thank you @timsaucer i verified locally that that indeed resolved the issue for test_string_functions

I also pushed up a fix for test_relational_expr (which was originally changed here but now im reverting that change)

There's one last issue with test_execution_plan, stream = ctx.execute(plan, 0) produce no data.
I think it has to do with the plan itself. I've printed out the plans here.

Im not familiar with this, could you help take a look?

(Pdb) print(plan.display())
AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[sum(test.c2)]

(Pdb) print(plan.display_indent())
AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[sum(test.c2)]
  CoalesceBatchesExec: target_batch_size=8192
    RepartitionExec: partitioning=Hash([c1@0], 8), input_partitions=8
      AggregateExec: mode=Partial, gby=[c1@0 as c1], aggr=[sum(test.c2)]
        RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
          CsvExec: file_groups={1 group: [[Users/kevinliu/repos/datafusion-python/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2], has_header=true

@timsaucer
Copy link
Contributor

@kevinjqliu Are you getting an error on this unit test? I'm seeing it pass. Is it possible there's an issue with your submodule that contains the data? It looks like CI unit tests are passing also.

@timsaucer
Copy link
Contributor

Do you think this is ready? If so and you move it out of draft I can merge it.

@kevinjqliu kevinjqliu marked this pull request as ready for review February 7, 2025 19:32
@kevinjqliu
Copy link
Contributor Author

cool as long as CI works, i think we're ready

@kevinjqliu
Copy link
Contributor Author

I double checked by running that test with v44.1.0 and the plans are the same

@timsaucer timsaucer merged commit d635d56 into apache:main Feb 7, 2025
15 checks passed
@kevinjqliu
Copy link
Contributor Author

Is it possible there's an issue with your submodule that contains the data? It looks like CI unit tests are passing also.

I think thats the issue, i didnt see that testing/ is a submodule

@kevinjqliu kevinjqliu deleted the kevinjqliu/45 branch February 7, 2025 19:37
@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Feb 7, 2025

@timsaucer im still seeing this issue on my local setup.
Following the same steps as the CI test

- name: Run tests
env:
RUST_BACKTRACE: 1
run: |
git submodule update --init
uv sync --dev --no-install-package datafusion
uv run --no-project maturin develop --uv
uv run --no-project pytest -v .

Do you mind giving it a try on your local?

Ive verified that the submodule matches that of main

➜  datafusion-python git:(main) git submodule status
 e13af117de7c4f0a4d9908ae3827b3ab119868f3 parquet (e13af11)
 5bab2f264a23f5af68f69ea93d24ef1e8e77fc88 testing (5bab2f2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test DataFusion 45 with datafusion-python
2 participants