-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28838: Remove sensitive jdbc properties from explain plan for JdbcStorageHandler tables #5709
base: master
Are you sure you want to change the base?
Conversation
616d043
to
d206e14
Compare
d206e14
to
8deff9e
Compare
private static final String[] FILTER_OUT_FROM_EXPLAIN = { | ||
TABLE_IS_CTAS, | ||
HIVE_SQL_JDBC_USERNAME, | ||
HIVE_SQL_JDBC_PASSWORD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing password
is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would disagree with that. I excluded the username, password, and the JDBC URL.
With the username, we can share some extra information about who is the admin.
And about the JDBC URL, it provides the proper path to the metastore database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you to remove all three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When users query MySQL tables through Hive jdbc storage handler, users sometimes need to know which MySQL database the table comes from through JDBC URL. Users may execute show create table mysql_jdbc_tbl
to check the JDBC URL.
Therefore, I think it is necessary to retain the JDBC URL & username.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
describe formatted
/ show create table
also displays the password in plain text. I don't think it's covered here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, I think it is necessary to retain the JDBC URL & username.
agree, at least it should be available to the table owners/admins
PS: SYS
schema is accessible only to admins, while the INFORMATION_SCHEMA
is available to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
describe formatted
/show create table
also displays the password in plain text. I don't think it's covered here.
Yes, This PR mainly focuses on removing sensitive information in explain
. However, it does not handle the show create table
for jdbc storage handler tables. Currently, sensitive information can still be seen in show create table
.
For example, if you create a mysql jdbc handler table in Hive, and then show create table
to check its TBLPROPERTIES, you will still can see its password "hive.sql.dbcp.password" = "passwd"
.
CREATE EXTERNAL TABLE jdbc_mysql_table
(
id int
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://127.0.0.1:3306/testdb",
"hive.sql.dbcp.username" = "root",
"hive.sql.dbcp.password" = "passwd",
"hive.sql.table" = "testmysql"
);
show create table
is the main use case that I am focusing on. BTW, we will check which MySQL database this table belongs to through the JDBC URL (hive.sql.jdbc.url)
. This is also the reason why I think the JDBC URL should be retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
This ticket came up because of an interesting behavior of explain extended in case of sys tables:
System tables are special kind of JdbcStorageHandler tables: they are created with JdbcStorageHandler but they are not containing a jdbc connection info. Instead of that, they are using the existing connection info provided in HMS configuration.
The interesting thing is that explain extended is a query. And for queries, Hive adds the jdbc connection info to the task (I assume because it is required to be able to run the query).
The questionable behavior is that at sys tables, why we share the JDO connection info provided in HMS config?
This PR addresses this issue. It is not about show create table or describe formatted at all. And, for regular JdbcStorageHandler tables, I would expect to see that information in show create table - as table is created with providing this info. And this is not the HMS metastore db connection info at all.
So, I'm open to have a discussion or a follow up ticket to rethink what kind of information we want to share and for what kind of users. But with this PR, I want to address the case when a simple explain statement can share the HMS database connection info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the qtest file to demonstrate that the problem doesn't exists for show create table or describe table - as it expected as the connection info is not part of the sys table creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYS tables could be queried only by privilege users/admins, so I don't see a security risk here.
However, it makes sense to exclude the connection info from the explain plan.
join sys.DBS d on t.DB_ID = d.DB_ID | ||
limit 1; | ||
|
||
explain extended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a more simpler qtest instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you prefer by a simpler qtest?
It only selects a constant. And the output is limited to one row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it is reasonable for the explain
case.
If we also need to fix the "show create table" use case, then only a simple SQL command "show create table" is enough.
8deff9e
to
ef3f64c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ef3f64c
to
d3e057c
Compare
d3e057c
to
4e929fb
Compare
@@ -97,4 +97,10 @@ | |||
|
|||
public static final java.lang.String EXPECTED_PARAMETER_VALUE = "expected_parameter_value"; | |||
|
|||
public static final java.lang.String HIVE_SQL_JDBC_USERNAME = "hive.sql.dbcp.username"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to define separate var for every connection config?
Can't we pass hive.sql
regex to a filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass hive.sql regex to a filter?
It is a great question: if we add a regex pattern to filter this out, there is a chance that we can filter out something in the future that is not implemented yet. And the person who implements it, won't even know about it at all. But on the other hand, there are no huge improvements at Hive - I mean, it looks like a complete software - so I see only a little chance that it will happen.
And also, thank you for this comment. I just now realized it is a thrift generated code. So it is definitely a bad idea to put a new constant here - it will be removed as thrift is re-generated. Let me find a proper place to store those constants.
4e929fb
to
9dd2bfc
Compare
@zhangbutao , @zratkai , @deniskuzZ , thank you for the reviews so far. Do I have any question that I didn't answer or needs more to discuss about? Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
I would suggest optimizing the title, and emphasizing that this PR is for the sensitive information repair related to the explain type tasks of the sys JdbcStorageHandler
table.
private static final String[] FILTER_OUT_FROM_EXPLAIN = {TABLE_IS_CTAS}; | ||
private static final String[] FILTER_OUT_FROM_EXPLAIN = { | ||
TABLE_IS_CTAS, | ||
JDBC_USERNAME, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@InvisibleProgrammer, what about
public Map<String, String> getTblPropsExplain() { // only for displaying plan
return PlanUtils.getPropertiesForExplain(tblProps,
hive_metastoreConstants.TABLE_IS_CTAS,
hive_metastoreConstants.TABLE_BUCKETING_VERSION);
}
it covers explain CTAS query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first read, it looks like this is in the class, CreateTableLikeDesc
. So it has a different explain operator then everything else. It is good to know about it. Let me check how it behaves for sys tables.
Thank you for pointing it out.
9dd2bfc
to
0e4a2de
Compare
|
It removes the following sensitive properties from tables created with JdbcStorageHandler:
Why are the changes needed?
Explain extended can cause a security issue if they query sys tables: Sys tables are created with JdbcStorageHandler and so that, the output of explain extended will contain sensitive jdbc connection info.
Does this PR introduce any user-facing change?
Yes, it removes the listed properties from explain extended output when a JdbcStorageHandler table is included in a query.
Is the change a dependency upgrade?
No
How was this patch tested?
QTest added: explain_systest_password.q