You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TextColumn was designed to hold non-categorical text, a good example being the text from a tweet in a table of twitter activity.
StringColumn, by contrast, works best with categorical data, as it uses a dictionary encoding scheme to reduce memory for repeated strings. A good example might be US states in a table with 1 million rows. If the average state name is 15 characters. This would be more than 30 MB using a list of the raw strings, but encoding each string as 1 byte reduces the memory requirements to ~ 1MB. This can make a big difference in tables with lots of categorical strings.
The problem with this arrangement is you don't always know what type you will get. The decision is made by the column type detection process when you read a file. A lot of times you may request a StringColumn (e.g. myTable.stringColumn("foo") , only to get an exception indicating saying that TextColumn cannot be cast to StringColumn.
(This is analogous to the short/int/long or float/double situation, but the user can ignore those if their tables are small. They will generally get (or can force) ints and doubles for their numerical column types.)
The encoding of Strings is an internal detail that should never have been visible to users, in the same way that the details of the dictionary encoding (whether a particular column uses bytes, shorts, or ints, for the encoded values) is hidden from the user.
The plan is to go back to having a single column (StringColumn) for strings, and deprecating and at some point removing TextColumn. The work is currently on the revised-string-column branch: https://github.com/jtablesaw/tablesaw/tree/revised-string-column.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
TextColumn was designed to hold non-categorical text, a good example being the text from a tweet in a table of twitter activity.
StringColumn, by contrast, works best with categorical data, as it uses a dictionary encoding scheme to reduce memory for repeated strings. A good example might be US states in a table with 1 million rows. If the average state name is 15 characters. This would be more than 30 MB using a list of the raw strings, but encoding each string as 1 byte reduces the memory requirements to ~ 1MB. This can make a big difference in tables with lots of categorical strings.
The problem with this arrangement is you don't always know what type you will get. The decision is made by the column type detection process when you read a file. A lot of times you may request a StringColumn (e.g.
myTable.stringColumn("foo")
, only to get an exception indicating saying that TextColumn cannot be cast to StringColumn.(This is analogous to the short/int/long or float/double situation, but the user can ignore those if their tables are small. They will generally get (or can force) ints and doubles for their numerical column types.)
The encoding of Strings is an internal detail that should never have been visible to users, in the same way that the details of the dictionary encoding (whether a particular column uses bytes, shorts, or ints, for the encoded values) is hidden from the user.
The plan is to go back to having a single column (StringColumn) for strings, and deprecating and at some point removing TextColumn. The work is currently on the revised-string-column branch: https://github.com/jtablesaw/tablesaw/tree/revised-string-column.
The issue is #1074
Beta Was this translation helpful? Give feedback.
All reactions