Pure streaming mode for handling very big files #139

roll · 2020-05-27T14:18:40Z

Overview

For now, the base processor collects all the data into the results return var:

dataflows.base.datastream_processor

    def results(self, on_error=None):
        try:
            ds = self._process()
            results = [
                list(schema_validator(res.res, res, on_error=on_error))
                for res in ds.res_iter
            ]
        except Exception as exception:
            self.raise_exception(exception)
        return results, ds.dp, ds.merge_stats()

I've tested an alternative version of this snippet which iterates but ignores rows and it seems to be working for simple cases like load -> dump_to_path. Hopefully collecting data is not vital for the framework and we can introduce pure stream mode option.

WDYT
@akariv
@cschloer

Initial discussion:

Improve runner so that it can be called from within python frictionlessdata/datapackage-pipelines#189

The text was updated successfully, but these errors were encountered:

roll · 2020-05-27T14:34:33Z

BTW,

I found flow.process in the code (not sure it's documented in the readme)

@cschloer
What if you try using it instead of flow.results?

akariv · 2020-05-27T14:39:09Z

Flow().results() returns all the data processed so obviously it will take a lot of memory.

To avoid that you should use Flow().process() instead.

The documentation mentions that but is not very explicit I now see

What about large data files? In the above examples, the results are loaded into memory, which is not always preferable or acceptable. In many cases, we'd like to store the results directly onto a hard drive - without having the machine's RAM limit in any way the amount of data we can process.

I think this is an indication that documentation should be improved...

roll changed the title ~~Pure streaming mode for handing very big files~~ Pure streaming mode for handling very big files May 27, 2020

roll closed this as completed May 27, 2020

roll mentioned this issue May 27, 2020

Improve runner so that it can be called from within python frictionlessdata/datapackage-pipelines#189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pure streaming mode for handling very big files #139

Pure streaming mode for handling very big files #139

roll commented May 27, 2020 •

edited

Loading

roll commented May 27, 2020 •

edited

Loading

akariv commented May 27, 2020

Pure streaming mode for handling very big files #139

Pure streaming mode for handling very big files #139

Comments

roll commented May 27, 2020 • edited Loading

Overview

roll commented May 27, 2020 • edited Loading

akariv commented May 27, 2020

roll commented May 27, 2020 •

edited

Loading

roll commented May 27, 2020 •

edited

Loading