Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for #267 - configurable option to disable lzo index lookup #431

Closed
wants to merge 1 commit into from
Closed

Conversation

EugenCepoi
Copy link
Contributor

Making the reading of lzo indexes optional. This should reduce job start time when the input doesn't have any indexes.

@pkallos
Copy link

pkallos commented Mar 6, 2015

👍

@EugenCepoi
Copy link
Contributor Author

Please, can one of the commiters take a look at this PR? When the users knows in advance that there are no indexes for the lzo files then this improves greatly the start time.

This, combined with using list status in parallel (mapreduce.input.fileinputformat.list-status.num-threads) can be an alternative solution to issue #426.

Another improvement when we do want to read indexes would be to read the indexes in parallel like FileInputFormat does in listStatus.

@gerashegalov
Copy link
Contributor

I think getting index files and lzo themselves in a single fetch/listStatus RPC with an appropriate path filter would be most elegant and effective improvement. Then you can just hash index files into a map for quick lookups.

Disclaimer: I am not a committer

@pkallos
Copy link

pkallos commented Mar 13, 2015

I think getting index files and lzo themselves in a single fetch/listStatus RPC with an appropriate path filter would be most elegant and effective improvement. Then you can just hash index files into a map for quick lookups.

that sounds like a changeset with a much higher blast radius :)

@EugenCepoi
Copy link
Contributor Author

@gerashegalov yes it would be a more general solution improving indexes lookup. But here the goal is really to just disable indexes lookup/reading when we know in advance that there are no indexes.

@gerashegalov
Copy link
Contributor

@EugenCepoi If we list *.{lzo,lzo.index} in a single call there will be no performance advantage of disabling index lookups. So we can save unnecessary tuning key. The case when there are index files gets faster as well.

@gerashegalov
Copy link
Contributor

I implemented my suggestion in #434

@EugenCepoi EugenCepoi closed this May 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants