Upcoming (likely breaking) change: Deprecating/Removing TextColumn #1090

lwhite1 · 2022-04-01T19:11:06Z

lwhite1
Apr 1, 2022
Maintainer

TextColumn was designed to hold non-categorical text, a good example being the text from a tweet in a table of twitter activity.

StringColumn, by contrast, works best with categorical data, as it uses a dictionary encoding scheme to reduce memory for repeated strings. A good example might be US states in a table with 1 million rows. If the average state name is 15 characters. This would be more than 30 MB using a list of the raw strings, but encoding each string as 1 byte reduces the memory requirements to ~ 1MB. This can make a big difference in tables with lots of categorical strings.

The problem with this arrangement is you don't always know what type you will get. The decision is made by the column type detection process when you read a file. A lot of times you may request a StringColumn (e.g. myTable.stringColumn("foo") , only to get an exception indicating saying that TextColumn cannot be cast to StringColumn.

(This is analogous to the short/int/long or float/double situation, but the user can ignore those if their tables are small. They will generally get (or can force) ints and doubles for their numerical column types.)

The encoding of Strings is an internal detail that should never have been visible to users, in the same way that the details of the dictionary encoding (whether a particular column uses bytes, shorts, or ints, for the encoded values) is hidden from the user.

The plan is to go back to having a single column (StringColumn) for strings, and deprecating and at some point removing TextColumn. The work is currently on the revised-string-column branch: https://github.com/jtablesaw/tablesaw/tree/revised-string-column.

The issue is #1074

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upcoming (likely breaking) change: Deprecating/Removing TextColumn #1090

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Upcoming (likely breaking) change: Deprecating/Removing TextColumn #1090

lwhite1 Apr 1, 2022 Maintainer

Replies: 0 comments

lwhite1
Apr 1, 2022
Maintainer