Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadStatColumns internal vectors store empty elements #21

Closed
jkrumbiegel opened this issue Jan 4, 2023 · 2 comments
Closed

ReadStatColumns internal vectors store empty elements #21

jkrumbiegel opened this issue Jan 4, 2023 · 2 comments

Comments

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Jan 4, 2023

I'm not sure if I'm missing something in the implementation but this seems weird to me. The fifth and sixth columns have type index 7 (double) but location index 4 and 5 in that vector, even though they are vectors 2 and 3 if counted upwards from 1. Why the empty slots 2, 3 and 7? I guess the index erroneously counts upwards even if columns of other types are added after the first one.

julia> rs = readstat("data/sample.sav")
5×7 ReadStatTable:
 Row │ mychar    mynum               mydate                dtime            
     │ String  Float64            DateTime?            DateTime?  Labeled{F 
─────┼───────────────────────────────────────────────────────────────────────
   1 │      a      1.1  2018-05-06T00:00:00  2018-05-06T10:10:10            
   2 │      b      1.2  1880-05-06T00:00:00  1880-05-06T10:10:10            
   3 │      c  -1000.3  1960-01-01T00:00:00  1960-01-01T00:00:00            
   4 │      d     -1.4  1583-01-01T00:00:00  1583-01-01T00:00:00            
   5 │      e   1000.3              missing              missing            
                                                            3 columns omitted

julia> getfield(rs, :columns).index
7-element Vector{Tuple{Int64, Int64}}:
 (2, 1)
 (7, 1)
 (9, 1)
 (9, 2)
 (7, 4)
 (7, 5)
 (9, 3)

julia> getfield(rs, :columns).double
6-element Vector{SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}}:
 [1.1, 1.2, -1000.3, -1.4, 1000.3]
 0-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}
 0-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}
 [1.0, 2.0, 1.0, 2.0, 1.0]
 [1.0, 2.0, 3.0, 1.0, 1.0]
 0-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}
@junyuan-chen
Copy link
Owner

This is resulting from the way how the Date/DateTime values are obtained. The numerical values obtained from ReadStat are stored in columns but then emptied after the conversion is finished, as shown below:

if ntasks == 1 && convert_datetime
cols = _columns(tb)
@inbounds for i in 1:ncol(tb)
format = _colmeta(tb, i, :format)
isdta && (format = first(format, 3))
dtpara = get(dtformats, format, nothing)
if dtpara !== nothing
epoch, delta = dtpara
col0 = cols[i]
col = parse_datetime(col0, epoch, delta, _hasmissing(tb)[i])
if epoch isa Date
push!(cols.date, col)
cols.index[i] = (8, length(cols.date))
empty!(col0)
elseif epoch isa DateTime
push!(cols.time, col)
cols.index[i] = (9, length(cols.time))
empty!(col0)
end
end
end
end

I wasn't aware of any substantial side-effect from doing this as the indices are considered to be internal. However, I plan to remove the date and time fields in ReadStatColumns that are currently used to store the converted Date/DateTime values for v0.3.0. Instead, I think it would be sufficient to use a MappedArray constructed inside getcolumn to convert the original numerical values to Date/DateTime values lazily. An inverse map would also help getting Date/DateTime values back to the numerical values for writestat.

@jkrumbiegel
Copy link
Contributor Author

This just came up when I thought about removing certain columns from a test. I'll close the issue if it's not a plain bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants