Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1458: Reduce HDFS NameNode getFileInfo RPC #1767

Closed
wants to merge 5 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Jan 22, 2024

What changes were proposed in this pull request?

If the file in HDFS is in a completed state, avoid calling the HDFS getFileInfo RPC.

Provide orc.file.length.fast configuration to enable this behavior.

Why are the changes needed?

Now reading an ORC file in HDFS will generate at least one open and getFileInfo RPC. This optimization can remove getFileInfo RPC as much as possible, improve ORC reading efficiency, and reduce the number of HDFS RPCs.

How was this patch tested?

The production environment has been running stably for several months

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the JAVA label Jan 22, 2024

public class OrcInputStreamUtil {

private static final String DFS_CLASS = "org.apache.hadoop.hdfs.DFSInputStream";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we reflect this when Apache ORC can access this directly?

$ javap -cp hadoop-client-api-3.3.6.jar org.apache.hadoop.hdfs.DFSInputStream | head -n3
Compiled from "DFSInputStream.java"
public class org.apache.hadoop.hdfs.DFSInputStream extends org.apache.hadoop.fs.FSInputStream implements org.apache.hadoop.fs.ByteBufferReadable,org.apache.hadoop.fs.CanSetDropBehind,org.apache.hadoop.fs.CanSetReadahead,org.apache.hadoop.fs.HasEnhancedByteBufferAccess,org.apache.hadoop.fs.CanUnbuffer,org.apache.hadoop.fs.StreamCapabilities,org.apache.hadoop.fs.ByteBufferPositionedReadable {
  public static boolean tcpReadsDisabledForTesting;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.hadoop.hdfs.DFSInputStream.shortCircuitForbidden Method is not public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before ORC-1430, we could not access hdfs-related classes in the core module, so I used reflection.
Now I have done some refactoring, but the method of shortCircuitForbidden is not public, so I still need to use reflection.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a high-risky way which Apache Hadoop community is also not recommended.

May I ask if you have any reference in ASF communities which adopts this hack, @cxzl25 ?

@dongjoon-hyun
Copy link
Member

Gentle ping, @cxzl25 .

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 24, 2024

This is a high-risky way which Apache Hadoop community is also not recommended.

May I ask if you have any reference in ASF communities which adopts this hack

@Hexiaoqiao Can you help review this PR?

@Hexiaoqiao
Copy link

Thanks involve me here. I have left my concerns here, I would like to reassert: I believe this could reduce load to NameNode when using ORC on HDFS but I am worried it involve any potential issues from HDFS view.

Back to this PR, IIUC this is used only for extract footer, right? Technically, for one immutable file, the result from DFSInputStream#getFileLength() and FileStatus#getLen() will be same, but it is not one recommended usage as @dongjoon-hyun mentioned above.

Thanks again.

@dongjoon-hyun dongjoon-hyun modified the milestone: 2.1.0 Jan 5, 2025
@dongjoon-hyun
Copy link
Member

Happy New Year. Do you still want to introduce this feature in Apache ORC for your use case, @cxzl25 ?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 6, 2025

Happy New Year. @dongjoon-hyun

Do you still want to introduce this feature in Apache ORC for your use case

Let me close this PR first, this is a bit tricky, BTW S3 has a similar problem.

@cxzl25 cxzl25 closed this Jan 6, 2025
@dongjoon-hyun
Copy link
Member

Thank you for closing, @cxzl25 .

For the following, could you file a JIRA Issue too? If we are aware of issues, the community can make alternative solution .

BTW S3 has a similar problem.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 21, 2025

For the following, could you file a JIRA Issue too? If we are aware of issues, the community can make alternative solution

Sorry for the late reply.

Created https://issues.apache.org/jira/browse/ORC-1845

@dongjoon-hyun
Copy link
Member

Thank you, @cxzl25 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants