-
Notifications
You must be signed in to change notification settings - Fork 2
[DISCUSSION] - Going to 1.0? #14
Comments
@ocramz @glutamate what do you guys think about storing columns the following way:
|
If you use Vector, I think that precludes any chunking or streaming-like solution for very large data sets. I'd see this type of structure as Frames.Strict, implying a Frames.Lazy that deals with chunking. Obviously the Text and Bytestring APIs are inspiring the idea. Perhaps the container (Vector) could be polymorphic? |
That is a good idea @tonyday567, the thing is that one should then reimplement all of the operations based on the container, right? I was thinking for a first version to use |
On the other hand, if we were to use a streaming package like Conduit, or Streaming, how would one for example sort a dataframe without loading it all? |
I started some experiment here: https://github.com/glutamate/analyze/blob/playground/src/Analyze/New.hs |
If your input doesn't fit into memory then you won't be able to do an in-memory sort. Map-reduce methods are what you do in practice - there's usually a sort between the map and the reduce. Looking at the @glutamate experiment, I think this will resolve naturally as what you can do with a FrameContainer and what you can do to a Frame. |
Yes, definitely. But I was rather thinking about when one wants to work with small data. Like a 500mb CSV file |
Added an arbitrary instance to the @glutamate experiment, mostly to build some intuition. Worked well. I figured a Show dependency was ok - couldnt imagine a csv field that wasn't a Show. https://github.com/tonyday567/analyze/blob/arb/src/Analyze/New.hs |
Looks great @tonyday567 , also about sorting: It might be feasible to implement external sorting |
To have it in a persistent format, so anyone can join and discuss, I'm making this issue with different ideas that have been discussed in Gitter. Copy/pasting:
The text was updated successfully, but these errors were encountered: