bug: basic `Table` operations fail if the empty string is a column name #10514

edschofield · 2024-11-21T00:00:10Z

What happened?

This code succeeds with the sqlite and polars backends but fails for me with the duckdb backend:

import ibis
import polars as pl

url = "https://raw.githubusercontent.com/PythonCharmers/PythonCharmersData/refs/heads/master/palmerpenguins.csv"
penguins_pl = pl.read_csv(url)
penguins = ibis.memtable(penguins_pl)
result = penguins.to_polars()  # fails

The final line raises a ValueError:

ValueError: Target schema's field names are not matching the table's field names: ['v0', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'], ['', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']

Note that a ValueError also occurs if the source or destination is a Pandas dataframe, as in this code:

penguins = ibis.memtable(penguins_pl.to_pandas())
result = penguins.to_pandas()  # fails

ValueError: schema names don't match input data columns

However, after renaming the column named '' (the empty string) to anything else, like ' ' (a space), displaying the Table works with the DuckDB backend too:

penguins = ibis.memtable(penguins_pl.rename({'': ' '}))
print(penguins)

It may seem perverse and weird to have a column name as the empty string, but note that this is the default CSV output format produced by Pandas:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.to_csv('ohdear.csv')
!cat ohdear.csv

,A,B
0,1,4
1,2,5
2,3,6

What version of ibis are you using?

Ibis 9.5.0

What backend(s) are you using, if any?

DuckDB 1.1.3

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

IndexSeek · 2024-11-21T03:53:33Z

Thank you for bringing this to our attention and for providing the code to reproduce the issue. This behavior is quite interesting. I would like someone to examine this in more detail, but it appears that DuckDB does not support zero-length identifiers.

Using the DuckDB CLI:

$ duckdb
v1.1.2 f680b7d08f
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SELECT 1 AS "";
Parser Error: zero-length delimited identifier at or near """"
LINE 1: SELECT 1 AS "";

In the meantime, you might consider swapping the default backend, as it should better support your memtable usage and allow for a zero-length identifier column name.

import ibis
import polars as pl

ibis.set_backend("polars")

url = "https://raw.githubusercontent.com/PythonCharmers/PythonCharmersData/refs/heads/master/palmerpenguins.csv"
penguins_pl = pl.read_csv(url)
penguins = ibis.memtable(penguins_pl)
result = penguins.to_polars()  # fails

It may seem perverse and weird to have a column name as the empty string, but note that this is the default CSV output format produced by Pandas:

It looks as if DuckDB (or maybe this is something Ibis handles internally) automatically assigns a column name in this case. The DuckDB documentation clarifies something similar on Deduplicating Identifiers, but I think this may be a bit different. Using the Polars backend will still keep the index column as an empty string.

In [1]: from ibis.interactive import *

In [2]: data = """,name,amount
   ...: 0,Alice,100
   ...: 1,Bob,200
   ...: 2,Charlie,300"""

In [3]: with open("/tmp/example.csv", "w") as f:
   ...:     f.write(data)
   ...: 

In [4]: ibis.read_csv("/tmp/example.csv")
Out[4]: 
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃ column0 ┃ name    ┃ amount ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ int64   │ string  │ int64  │
├─────────┼─────────┼────────┤
│       0 │ Alice   │    100 │
│       1 │ Bob     │    200 │
│       2 │ Charlie │    300 │
└─────────┴─────────┴────────┘

In [5]: ibis.set_backend("polars")

In [6]: ibis.read_csv("/tmp/example.csv")
Out[6]: 
┏━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃       ┃ name    ┃ amount ┃
┡━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ int64 │ string  │ int64  │
├───────┼─────────┼────────┤
│     0 │ Alice   │    100 │
│     1 │ Bob     │    200 │
│     2 │ Charlie │    300 │
└───────┴─────────┴────────┘

gforsyth · 2024-11-21T21:12:19Z

DuckDB doesn't support zero-length identifiers.

One option is for us to enforce adding identifiers to columns when we materialize a memtable. In the interim, if you use the backend read_csv instead of using polars reader as an intermediate this won't be an issue.

edschofield added the bug Incorrect behavior inside of ibis label Nov 21, 2024

github-project-automation bot added this to Ibis planning and roadmap Nov 21, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Nov 21, 2024

edschofield changed the title ~~bug: failure displaying Table if a column name is the empty string~~ bug: basic Table operations fail the empty string is a column name Nov 21, 2024

edschofield changed the title ~~bug: basic Table operations fail the empty string is a column name~~ bug: basic Table operations fail if the empty string is a column name Nov 21, 2024

edschofield mentioned this issue Nov 21, 2024

possible bug: Table.__repr__ sometimes produces non-ASCII characters #10516

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: basic `Table` operations fail if the empty string is a column name #10514

bug: basic `Table` operations fail if the empty string is a column name #10514

edschofield commented Nov 21, 2024 •

edited

Loading

IndexSeek commented Nov 21, 2024

gforsyth commented Nov 21, 2024

bug: basic Table operations fail if the empty string is a column name #10514

bug: basic Table operations fail if the empty string is a column name #10514

Comments

edschofield commented Nov 21, 2024 • edited Loading

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

IndexSeek commented Nov 21, 2024

gforsyth commented Nov 21, 2024

bug: basic `Table` operations fail if the empty string is a column name #10514

bug: basic `Table` operations fail if the empty string is a column name #10514

edschofield commented Nov 21, 2024 •

edited

Loading