You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tested an alternative version of this snippet which iterates but ignores rows and it seems to be working for simple cases like load -> dump_to_path. Hopefully collecting data is not vital for the framework and we can introduce pure stream mode option.
Flow().results() returns all the data processed so obviously it will take a lot of memory.
To avoid that you should use Flow().process() instead.
The documentation mentions that but is not very explicit I now see
What about large data files? In the above examples, the results are loaded into memory, which is not always preferable or acceptable. In many cases, we'd like to store the results directly onto a hard drive - without having the machine's RAM limit in any way the amount of data we can process.
I think this is an indication that documentation should be improved...
Overview
For now, the base processor collects all the data into the
results
return var:I've tested an alternative version of this snippet which iterates but ignores rows and it seems to be working for simple cases like
load -> dump_to_path
. Hopefully collecting data is not vital for the framework and we can introduce pure stream mode option.WDYT
@akariv
@cschloer
Initial discussion:
The text was updated successfully, but these errors were encountered: