Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pure streaming mode for handling very big files #139

Closed
roll opened this issue May 27, 2020 · 2 comments
Closed

Pure streaming mode for handling very big files #139

roll opened this issue May 27, 2020 · 2 comments

Comments

@roll
Copy link
Contributor

roll commented May 27, 2020

Overview

For now, the base processor collects all the data into the results return var:

dataflows.base.datastream_processor

    def results(self, on_error=None):
        try:
            ds = self._process()
            results = [
                list(schema_validator(res.res, res, on_error=on_error))
                for res in ds.res_iter
            ]
        except Exception as exception:
            self.raise_exception(exception)
        return results, ds.dp, ds.merge_stats()

I've tested an alternative version of this snippet which iterates but ignores rows and it seems to be working for simple cases like load -> dump_to_path. Hopefully collecting data is not vital for the framework and we can introduce pure stream mode option.

WDYT
@akariv
@cschloer

Initial discussion:

@roll roll changed the title Pure streaming mode for handing very big files Pure streaming mode for handling very big files May 27, 2020
@roll
Copy link
Contributor Author

roll commented May 27, 2020

BTW,

I found flow.process in the code (not sure it's documented in the readme)

@cschloer
What if you try using it instead of flow.results?

@roll roll closed this as completed May 27, 2020
@akariv
Copy link
Member

akariv commented May 27, 2020

Flow().results() returns all the data processed so obviously it will take a lot of memory.

To avoid that you should use Flow().process() instead.

The documentation mentions that but is not very explicit I now see

What about large data files? In the above examples, the results are loaded into memory, which is not always preferable or acceptable. In many cases, we'd like to store the results directly onto a hard drive - without having the machine's RAM limit in any way the amount of data we can process.

I think this is an indication that documentation should be improved...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants