Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28838: Remove sensitive jdbc properties from explain plan for JdbcStorageHandler tables #5709

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

InvisibleProgrammer
Copy link
Contributor

It removes the following sensitive properties from tables created with JdbcStorageHandler:

  • hive.sql.dbcp.username
  • hive.sql.dbcp.password
  • hive.sql.jdbc.url

Why are the changes needed?

Explain extended can cause a security issue if they query sys tables: Sys tables are created with JdbcStorageHandler and so that, the output of explain extended will contain sensitive jdbc connection info.

Does this PR introduce any user-facing change?

Yes, it removes the listed properties from explain extended output when a JdbcStorageHandler table is included in a query.

Is the change a dependency upgrade?

No

How was this patch tested?

QTest added: explain_systest_password.q

private static final String[] FILTER_OUT_FROM_EXPLAIN = {
TABLE_IS_CTAS,
HIVE_SQL_JDBC_USERNAME,
HIVE_SQL_JDBC_PASSWORD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing password is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would disagree with that. I excluded the username, password, and the JDBC URL.
With the username, we can share some extra information about who is the admin.
And about the JDBC URL, it provides the proper path to the metastore database.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you to remove all three.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users query MySQL tables through Hive jdbc storage handler, users sometimes need to know which MySQL database the table comes from through JDBC URL. Users may execute show create table mysql_jdbc_tbl to check the JDBC URL.

Therefore, I think it is necessary to retain the JDBC URL & username.

Copy link
Member

@deniskuzZ deniskuzZ Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe formatted / show create table also displays the password in plain text. I don't think it's covered here.

Copy link
Member

@deniskuzZ deniskuzZ Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, I think it is necessary to retain the JDBC URL & username.

agree, at least it should be available to the table owners/admins

PS: SYS schema is accessible only to admins, while the INFORMATION_SCHEMA is available to end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe formatted / show create table also displays the password in plain text. I don't think it's covered here.

Yes, This PR mainly focuses on removing sensitive information in explain. However, it does not handle the show create table for jdbc storage handler tables. Currently, sensitive information can still be seen in show create table.

For example, if you create a mysql jdbc handler table in Hive, and then show create table to check its TBLPROPERTIES, you will still can see its password "hive.sql.dbcp.password" = "passwd".

CREATE EXTERNAL TABLE jdbc_mysql_table
(
 id int
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
                "hive.sql.database.type" = "MYSQL",
                "hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver",
                "hive.sql.jdbc.url" = "jdbc:mysql://127.0.0.1:3306/testdb",
                "hive.sql.dbcp.username" = "root",
                "hive.sql.dbcp.password" = "passwd",
                "hive.sql.table" = "testmysql"
);

show create table is the main use case that I am focusing on. BTW, we will check which MySQL database this table belongs to through the JDBC URL (hive.sql.jdbc.url). This is also the reason why I think the JDBC URL should be retained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

This ticket came up because of an interesting behavior of explain extended in case of sys tables:
System tables are special kind of JdbcStorageHandler tables: they are created with JdbcStorageHandler but they are not containing a jdbc connection info. Instead of that, they are using the existing connection info provided in HMS configuration.

The interesting thing is that explain extended is a query. And for queries, Hive adds the jdbc connection info to the task (I assume because it is required to be able to run the query).
The questionable behavior is that at sys tables, why we share the JDO connection info provided in HMS config?

This PR addresses this issue. It is not about show create table or describe formatted at all. And, for regular JdbcStorageHandler tables, I would expect to see that information in show create table - as table is created with providing this info. And this is not the HMS metastore db connection info at all.

So, I'm open to have a discussion or a follow up ticket to rethink what kind of information we want to share and for what kind of users. But with this PR, I want to address the case when a simple explain statement can share the HMS database connection info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the qtest file to demonstrate that the problem doesn't exists for show create table or describe table - as it expected as the connection info is not part of the sys table creation.

Copy link
Member

@deniskuzZ deniskuzZ Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYS tables could be queried only by privilege users/admins, so I don't see a security risk here.
However, it makes sense to exclude the connection info from the explain plan.

join sys.DBS d on t.DB_ID = d.DB_ID
limit 1;

explain extended
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a more simpler qtest instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you prefer by a simpler qtest?

It only selects a constant. And the output is limited to one row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it is reasonable for the explain case.

If we also need to fix the "show create table" use case, then only a simple SQL command "show create table" is enough.

Copy link
Contributor

@zratkai zratkai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@InvisibleProgrammer InvisibleProgrammer force-pushed the HIVE-28838_Remove_jdbc_properties_from_jdbc_tables branch from ef3f64c to d3e057c Compare March 27, 2025 19:22
@InvisibleProgrammer InvisibleProgrammer force-pushed the HIVE-28838_Remove_jdbc_properties_from_jdbc_tables branch from d3e057c to 4e929fb Compare March 27, 2025 19:23
@@ -97,4 +97,10 @@

public static final java.lang.String EXPECTED_PARAMETER_VALUE = "expected_parameter_value";

public static final java.lang.String HIVE_SQL_JDBC_USERNAME = "hive.sql.dbcp.username";
Copy link
Member

@deniskuzZ deniskuzZ Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to define separate var for every connection config?
Can't we pass hive.sql regex to a filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we pass hive.sql regex to a filter?

It is a great question: if we add a regex pattern to filter this out, there is a chance that we can filter out something in the future that is not implemented yet. And the person who implements it, won't even know about it at all. But on the other hand, there are no huge improvements at Hive - I mean, it looks like a complete software - so I see only a little chance that it will happen.

And also, thank you for this comment. I just now realized it is a thrift generated code. So it is definitely a bad idea to put a new constant here - it will be removed as thrift is re-generated. Let me find a proper place to store those constants.

@InvisibleProgrammer
Copy link
Contributor Author

@zhangbutao , @zratkai , @deniskuzZ , thank you for the reviews so far. Do I have any question that I didn't answer or needs more to discuss about?

Thank you

Copy link
Contributor

@zhangbutao zhangbutao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

I would suggest optimizing the title, and emphasizing that this PR is for the sensitive information repair related to the explain type tasks of the sys JdbcStorageHandler table.

@InvisibleProgrammer InvisibleProgrammer changed the title HIVE-28838: Remove sensitive jdbc properties from JdbcStorageHandler tables HIVE-28838: Remove sensitive jdbc properties from explain plan for JdbcStorageHandler tables Apr 1, 2025
private static final String[] FILTER_OUT_FROM_EXPLAIN = {TABLE_IS_CTAS};
private static final String[] FILTER_OUT_FROM_EXPLAIN = {
TABLE_IS_CTAS,
JDBC_USERNAME,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@InvisibleProgrammer, what about

  public Map<String, String> getTblPropsExplain() { // only for displaying plan
    return PlanUtils.getPropertiesForExplain(tblProps,
      hive_metastoreConstants.TABLE_IS_CTAS,
      hive_metastoreConstants.TABLE_BUCKETING_VERSION);
  }

it covers explain CTAS query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first read, it looks like this is in the class, CreateTableLikeDesc. So it has a different explain operator then everything else. It is good to know about it. Let me check how it behaves for sys tables.
Thank you for pointing it out.

@InvisibleProgrammer InvisibleProgrammer force-pushed the HIVE-28838_Remove_jdbc_properties_from_jdbc_tables branch from 9dd2bfc to 0e4a2de Compare April 3, 2025 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants