Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CSV.read docs about WeakRefString conversions #118

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/Source.jl
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ Keyword Arguments:
* `categorical::Bool=true`: read string column as a `CategoricalArray` ([ref](https://github.com/JuliaData/CategoricalArrays.jl)), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small.

Note by default, "string" or text columns will be parsed as the [`WeakRefString`](https://github.com/quinnj/WeakRefStrings.jl) type. This is a custom type that only stores a pointer to the actual byte data + the number of bytes.
To convert a `String` to a standard Julia string type, just call `string(::WeakRefString)`, this also works on an entire column.
To convert a `String` to a standard Julia string type, just call `string(::WeakRefString)` for an individual observation, or `string.(::WeakRefString)` on an entire column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This looks like a nice small clarification, so I feel sorry for my lengthy comment ;) :

  • I think String has to be fixed too since it is a "standard Julia string type".
  • string.(::WeakRefString) will work, but ::WeakRefString is not a type of the column

An alternative:
"
To convert wstr::WeakRefString to a standard Julia string type, just call string(wstr), or string.(df[:wstr]) to convert an entire wstr column of df table.
"

However, the problem with this approach is that missing values in df[:wstr] would be converted into "missing" strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, good catch. So I guess the only correct approach is convert(Union{String, Missing}, ::WeakRefString) and convert(Array{Union{String, Missing}}, ::AbstractArray{WeakRefString})?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the problem with this approach is that missing values in df[:wstr] would be converted into "missing" strings.

Not sure I follow... is that a problem? Or do you mean they'd become "" instead of Missings.missing?

Copy link
Contributor

@alyst alyst Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only correct approach is convert(Union{String, Missing}, ::WeakRefString) and convert(Array{Union{String, Missing}}, ::AbstractArray{WeakRefString})?

I also think so. More precisely, the scalar form convert(Union{String, Missing}, wstr) would give the correct result both for wstr::WeakRefString and wstr === missing.
But convert(Array{Union{String, Missing}}, wstr) should only be called for wstr::AbstractArray{Union{WeakRefString, Missing}} unless we want to allow missingness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickeubank

julia> string(Missings.missing)
"missing"

string.(wstr::AbstractArray{Union{WeakRefString, Missing}}) doesn't preserve missingness: your Missings.missing missing values in the array would become non-missing strings "missing"::String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately

julia> using CSV, Missings
julia> df = CSV.read(joinpath(Pkg.dir("CSV"), "test/test_files/attenu.csv"), null="NA", rows_for_type_detect=200);
julia> eltype(df[3])
Union{Missings.Missing, WeakRefString{UInt8}}
julia> eltype(convert.(Union{String, Missing}, df[3]))
Any

Maybe it's somehow related to this

julia> typeof(df[1,3])
String
julia> collect(skipmissing(df[3]))
ERROR: TypeError: _collect: in typeassert, expected WeakRefString{UInt8}, got String
Stacktrace:
 [1] next at /home/astukalov/.julia/v0.6/Missings/src/Missings.jl:265 [inlined]
 [2] _collect(::UnitRange{Int64}, ::Missings.EachSkipMissing{WeakRefStrings.WeakRefStringArray{Union{Missings.Missing, WeakRefString{UInt8}},1}}, ::Base.HasEltype, ::Base.SizeUnknown) at ./array.jl:442
 [3] collect(::Missings.EachSkipMissing{WeakRefStrings.WeakRefStringArray{Union{Missings.Missing, WeakRefString{UInt8}},1}}) at ./array.jl:431
 [4] macro expansion at ./REPL.jl:97 [inlined]
 [5] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like WeakRefStringArray is lying about its eltype: it should be String.

Copy link
Contributor

@alyst alyst Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WeakRefString to String conversion is now supposed to be handled automatically, maybe we don't need this docstring section at all :)
I guess we need @quinnj assistance to clarify the situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And regarding the original problem that the eltype is Any, see JuliaLang/julia#24332.

So I guess the only solution for now is to recommend using either convert(Array{String}, x) or convert(Array{Union{String, Missing}}, x) depending on whether you want to support missing values or not.

Oftentimes, however, it can be convenient to work with `WeakRefStrings` depending on the ultimate use, such as transfering the data directly to another system and avoiding all the intermediate copying.

Example usage:
Expand Down