Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: basic Table operations fail if the empty string is a column name #10514

Open
1 task done
edschofield opened this issue Nov 21, 2024 · 2 comments
Open
1 task done
Labels
bug Incorrect behavior inside of ibis

Comments

@edschofield
Copy link

edschofield commented Nov 21, 2024

What happened?

This code succeeds with the sqlite and polars backends but fails for me with the duckdb backend:

import ibis
import polars as pl

url = "https://raw.githubusercontent.com/PythonCharmers/PythonCharmersData/refs/heads/master/palmerpenguins.csv"
penguins_pl = pl.read_csv(url)
penguins = ibis.memtable(penguins_pl)
result = penguins.to_polars()  # fails

The final line raises a ValueError:

ValueError: Target schema's field names are not matching the table's field names: ['v0', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'], ['', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']

Note that a ValueError also occurs if the source or destination is a Pandas dataframe, as in this code:

penguins = ibis.memtable(penguins_pl.to_pandas())
result = penguins.to_pandas()  # fails
ValueError: schema names don't match input data columns

However, after renaming the column named '' (the empty string) to anything else, like ' ' (a space), displaying the Table works with the DuckDB backend too:

penguins = ibis.memtable(penguins_pl.rename({'': ' '}))
print(penguins)

It may seem perverse and weird to have a column name as the empty string, but note that this is the default CSV output format produced by Pandas:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.to_csv('ohdear.csv')
!cat ohdear.csv
,A,B
0,1,4
1,2,5
2,3,6

What version of ibis are you using?

Ibis 9.5.0

What backend(s) are you using, if any?

DuckDB 1.1.3

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@edschofield edschofield added the bug Incorrect behavior inside of ibis label Nov 21, 2024
@edschofield edschofield changed the title bug: failure displaying Table if a column name is the empty string bug: basic Table operations fail the empty string is a column name Nov 21, 2024
@edschofield edschofield changed the title bug: basic Table operations fail the empty string is a column name bug: basic Table operations fail if the empty string is a column name Nov 21, 2024
@IndexSeek
Copy link
Member

Thank you for bringing this to our attention and for providing the code to reproduce the issue. This behavior is quite interesting. I would like someone to examine this in more detail, but it appears that DuckDB does not support zero-length identifiers.

Using the DuckDB CLI:

$ duckdb
v1.1.2 f680b7d08f
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SELECT 1 AS "";
Parser Error: zero-length delimited identifier at or near """"
LINE 1: SELECT 1 AS "";

In the meantime, you might consider swapping the default backend, as it should better support your memtable usage and allow for a zero-length identifier column name.

import ibis
import polars as pl

ibis.set_backend("polars")

url = "https://raw.githubusercontent.com/PythonCharmers/PythonCharmersData/refs/heads/master/palmerpenguins.csv"
penguins_pl = pl.read_csv(url)
penguins = ibis.memtable(penguins_pl)
result = penguins.to_polars()  # fails

It may seem perverse and weird to have a column name as the empty string, but note that this is the default CSV output format produced by Pandas:

It looks as if DuckDB (or maybe this is something Ibis handles internally) automatically assigns a column name in this case. The DuckDB documentation clarifies something similar on Deduplicating Identifiers, but I think this may be a bit different. Using the Polars backend will still keep the index column as an empty string.

In [1]: from ibis.interactive import *

In [2]: data = """,name,amount
   ...: 0,Alice,100
   ...: 1,Bob,200
   ...: 2,Charlie,300"""

In [3]: with open("/tmp/example.csv", "w") as f:
   ...:     f.write(data)
   ...: 

In [4]: ibis.read_csv("/tmp/example.csv")
Out[4]: 
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃ column0 ┃ name    ┃ amount ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ int64   │ string  │ int64  │
├─────────┼─────────┼────────┤
│       0 │ Alice   │    100 │
│       1 │ Bob     │    200 │
│       2 │ Charlie │    300 │
└─────────┴─────────┴────────┘

In [5]: ibis.set_backend("polars")

In [6]: ibis.read_csv("/tmp/example.csv")
Out[6]: 
┏━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃       ┃ name    ┃ amount ┃
┡━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ int64 │ string  │ int64  │
├───────┼─────────┼────────┤
│     0 │ Alice   │    100 │
│     1 │ Bob     │    200 │
│     2 │ Charlie │    300 │
└───────┴─────────┴────────┘

@gforsyth
Copy link
Member

DuckDB doesn't support zero-length identifiers.

One option is for us to enforce adding identifiers to columns when we materialize a memtable. In the interim, if you use the backend read_csv instead of using polars reader as an intermediate this won't be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

No branches or pull requests

3 participants