Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Server: Database level user table isolation #224

Open
stuartlynn opened this issue Nov 25, 2022 · 11 comments
Open

Data Server: Database level user table isolation #224

stuartlynn opened this issue Nov 25, 2022 · 11 comments

Comments

@stuartlynn
Copy link
Collaborator

Currently we have dataset assets protected at the API level through our permissions setup

#[derive(Debug, Queryable, Deserialize, Serialize, Insertable)]
pub struct Permission {
pub id: Uuid,
pub user_id: Uuid,
pub resource_id: Uuid,
pub permission: String,
pub resource_type: String,
pub created_at: NaiveDateTime,
pub updated_at: NaiveDateTime,
}

This allows us to check a user against a resource at the route and model level to determine access to said resource.

This works well for this level but we still have the problem of access at the database level.

For example lets say a user: User1 uploads a dataset: Dataset1 and marks it as private. The dataset gets loaded in to a table dataset_table When User2 tried to query that table through the API, or request tiles from that table etc, the permissions system will reject them for not having the correct permissions.

This doesn't however stop User2 from running a query against the dataset_table. For example

select * from dataset_table

will still execute when run by User2, even though User2 doesn't have permissions at the API level because we currently have no way of controlling permissions at the database level.

Indeed right now there is only a single user at the database level, which is simply the root db user which has access to everything.

There are a couple of ways we can resolve this issue.

Essentially what we need to do is create a db user for each user on signup and then run any query as that user . Then when permissions change at the API level, make sure we reflect these permissions at the database level.

Managing access

Focusing on the postgresql Data Service just now ( we will need to find a way to generalize this in the future to other backends), there are potentially a couple of ways to do this.

Approach one

We retain the root user, but before we run any query we use the SET ROLE postgres command to change the user scope to the user who is running the query

This could potentially be achieved at the QueryBuilder level where we insert the SET ROLE directive before the rest of the query is built.

Pros

  • This means we can retain a common thread pool to the database, as we don't need to reconnect with a new user scoped connection for each query ( which might overload the database active connections for rapid multiple queries like requesting MVT tiles.

  • Means we just have to modify the query builder and not the connection logic.

Cons

  • I am not sure how secure this actually is. For example, if a user runs a raw query against the database, can they include a second SET ROLE command to switch to another user before executing commands? If so is there a good way to protect against this?

  • Building this at the QueryBuilder level would require other implementations for other backends also to build the user permissions in at query execution time rather than at connection time (which is likely to be more common .

Approach two

Instead of trying to change the user at query time, we simply do so at connection time. In this case, each API request would need to establish a connection to the database with the users credentials and use that to execute the query.

It seems to me the easiest way to do this would be through a middleware. Currently the database connection is passed to our route executors through the global state object. For example, in this route we get access to the metadata database connection "state.db" and the data table database connection "state.data_db" through the state object.

async fn get_column(
state: web::Data<State>,
Path(source): Path<Source>,
Path(column_name): Path<ColName>,
web::Query(query_params): web::Query<HashMap<String, serde_json::Value>>,
web::Query(query_str): web::Query<QueryString>,
logged_in_user: AuthService,
) -> Result<HttpResponse, ServiceError> {
let user = User::from_token(&state.db, &logged_in_user.user);
let mut query = query_for_source(&state.db, &source, query_str.q, query_params).await?;
if let Some(user) = user {
query.user(user);
}
let columns = query.get_columns(&state.data_db).await?;
let column = columns.iter().find(|c| c.name == column_name.column_name);
Ok(HttpResponse::Ok().json(column))
}

We could change this up to instead get a connection through a Middleware service

async fn get_column(
    state: web::Data<State>,
    Path(source): Path<Source>,
    Path(column_name): Path<ColName>,
    web::Query(query_params): web::Query<HashMap<String, serde_json::Value>>,
    web::Query(query_str): web::Query<QueryString>,
    logged_in_user: AuthService,
    // New connection service, provides a connection scoped to the current user 
   // or if no user is logged in, ie this is an annon query, then scope it to the public 
   // user 
    user_scoped_connection: ConnectionService
) -> Result<HttpResponse, ServiceError> {
    let user = User::from_token(&state.db, &logged_in_user.user);
    let mut query = query_for_source(&state.db, &source, query_str.q, query_params).await?;


    if let Some(user) = user {
        query.user(user);
    }
    let columns = query.get_columns(&user_scoped_connection).await?;
    let column = columns.iter().find(|c| c.name == column_name.column_name);


    Ok(HttpResponse::Ok().json(column))
}

This ConnectionService middleware would need to

  1. Grab the current user from the session (if there is one)
  2. Lookup that users connection credentials from the users table (we would keep these private)
  3. Establish a connection to the DB as that user
  4. Return the connection
  5. Retire that connection once the route has completed. This could probably be done using the Drop trait if it's not already implement there for the connection pool connections

Pros

  • This feels like a more generic approach that could generalize over multiple backend stores more easily
  • This feels more secure as we are connecting to the database as a limited user and can rely on the permission system of the backend more generally.

Cons

  • We would need to find a way to manage the connection pool such that it doesn't overwhelm the database. Some thoughts here are that we could cache connections scoped to users and reuse connections at a user level. I am not sure how we would then "load balance" the connections across active users.

Managing permissions

The other part of this is we need to manage changing permissions. So for example if User1 grants read access to User2 on a given table, we need to propagate that permission to the database.

This would require modifying the grant_permissions method

pub fn grant_permissions(
db: &DbPool,
user_id: Uuid,
resource_id: Uuid,
resource_type: ResourceType,
permissions: Vec<PermissionType>,
) -> Result<(), ServiceError> {
for permission in permissions {
let new_permision = NewPermission {
user_id,
resource_id,
resource_type: resource_type.clone(),
permission,
};
Self::grant_permission(db, new_permision)?;
}
Ok(())
}

We would need to

  1. Pass through a user scoped connection to the dataset_database to the function
  2. Call a method on the DatasetProvider object for that backend to modify the database level permissions in line with the new permission object

This shouldn't be too hard but noting it here.

Other considerations to note

  • Recapping above, we will eventually need this to generalize to other backend connections. Currently we only have postgress but eventually want to expand to MongoDB, flat geopaquet files, big query etc. So any solution should ideally keep that in mind

  • We want to as much as possible avoid performance hits with this approach. The MVT tile endpoints especially need to be responsive.

  • Ideally this should be implemented in a way that allows the query builders / routes etc to be agnostic to database permission layer.

@stuartlynn
Copy link
Collaborator Author

For additional reference, the connection pool to the dataset database is set up at run time here

let data_db_connection_url = config.data_connection_string().unwrap();
let data_pool = PgPoolOptions::new()
.max_connections(config.datadb.max_connections.unwrap_or(20))
.connect(&data_db_connection_url)
.await
.expect("FAiled to connect to DB");

We need to generalize this a bit to allow for different backend connections but that's out of the scope of this ticket I think

@michplunkett
Copy link
Contributor

Is there a reason why we aren't taking a more fragmented approach in regards to user roles and permissions? Perhaps I am missing something, but linking individual users to the database's roles feels like something that would limit us in regards to database flavors. I've generally worked with central role and permissions tables as opposed to relying certain authorized connections. Relying on an authorized connection feels a little risky in the way that if you could spoof a valid connection you could get A LOT more data than if you just did a simple check for role and permission. Perhaps this would be better to pitch in a meeting, you're right.

@michplunkett
Copy link
Contributor

michplunkett commented Nov 25, 2022

On a side note, I'd like to add some columns to our tables for future auditing and reporting purposes.

 #[derive(Debug, Queryable, Deserialize, Serialize, Insertable)] 
 pub struct Permission { 
     pub id: Uuid, 
     pub user_id: Uuid, 
     pub resource_id: Uuid, 
     pub permission: String, 
     pub resource_type: String, 
     pub created_at: NaiveDateTime, 
     pub updated_at: NaiveDateTime, 
 } 

Namely, I'd like to add created_by and updated_by columns. From an admin perspective, it's been important for me to know not only when things happened but who executed those commands. Any place we have the created_at, updated_at, and deleted_at columns, I'd like to have who executed the action listed as well. I don't believe those necessarily need to be returned to the front-end, but they should definitely be kept as records on the data layer.

@stuartlynn
Copy link
Collaborator Author

Is there a reason why we aren't taking a more fragmented approach in regards to user roles and permissions? Perhaps I am missing something, but linking individual users to the database's roles feels like something that would limit us in regards to database flavors. I've generally worked with central role and permissions tables as opposed to relying certain authorized connections. Relying on an authorized connection feels a little risky in the way that if you could spoof a valid connection you could get A LOT more data than if you just did a simple check for role and permission. Perhaps this would be better to pitch in a meeting, you're right.

I think it would be useful to have a session where we discuss the architecture here and talk through some options. I think we will need user level permissions at the database level in some form, as we allow users to execute arbitrary SQL against the datasets they have permission to read etc. I think it might be hard to determine from that SQL exact ally what tables are being accessed in order to do an out of database permissions check before the query is run.

Might be worth clarifying that we have two databases in the system currently. The first holds all the business logic for the application (Users, Permissions,, etc) while the second we use to actually store a users uploaded datasets in tables and as an analytics engine using postgis. It's that second database that we need to somehow add permissions too. If there is a way we can do that without using database users and permissions at the database level I am totally open to it. Just struggling to see how we can make that work.

@stuartlynn
Copy link
Collaborator Author

On a side note, I'd like to add some columns to our tables for future auditing and reporting purposes.

 #[derive(Debug, Queryable, Deserialize, Serialize, Insertable)] 
 pub struct Permission { 
     pub id: Uuid, 
     pub user_id: Uuid, 
     pub resource_id: Uuid, 
     pub permission: String, 
     pub resource_type: String, 
     pub created_at: NaiveDateTime, 
     pub updated_at: NaiveDateTime, 
 } 

Namely, I'd like to add created_by and updated_by columns. From an admin perspective, it's been important for me to know not only when things happened but who executed those commands. Any place we have the created_at, updated_at, and deleted_at columns, I'd like to have who executed the action listed as well. I don't believe those necessarily need to be returned to the front-end, but they should definitely be kept as records on the data layer.

Yeah not having updated_at and crated_at in there is an oversight by me. Do you imagine the column to track who updated / created / deleted the table would point to a given user in the system?

@michplunkett
Copy link
Contributor

Yeah not having updated_at and crated_at in there is an oversight by me. Do you imagine the column to track who updated / created / deleted the table would point to a given user in the system?

Yup! Just as the user_id field should be referencing the users table, presuming we have one, so should the created_by, updated_by, and deleted_by fields. To do that in PostgreSQL would require a statements like these:

created_by uuid NOT NULL REFERENCES users (id)
updated_by uuid REFERENCES users (id)
deleted_by uuid REFERENCES users (id)

@michplunkett
Copy link
Contributor

I think it would be useful to have a session where we discuss the architecture here and talk through some options.

Agreed! I feel like we could both type up 1,000 words easily on this topic and go back and forth, or just have a 15 minute meeting and come to an agreement relatively quickly. I should be available later on Tuesday, after 3 PM. Throw some time on my calendar and I'll make it happen!

@michplunkett
Copy link
Contributor

In regards to your original request, do we have a means of preventing certain commands (changing roles, deleting tables, etc.) other than just text parsing?

@michplunkett
Copy link
Contributor

In regards to your original request, do we have a means of preventing certain commands (changing roles, deleting tables, etc.) other than just text parsing?

I think we could do some text parsing of the injected SQL looking for any instance of the word set or any synonymous means of doing so. The only thing I'd be scared of is that if there's another way of setting a role that we aren't aware of, we have opened ourselves to a pretty big security gap.

@michplunkett
Copy link
Contributor

The more I read over the two solutions, the more I like the second solution. It feels generally more secure and far more finite. I don't like the concept of just leaving connections open once a user has initially connected to a DB.

We would need to find a way to manage the connection pool such that it doesn't overwhelm the database.

I could very well be off base here, but I am not sure this will be a significant problem. Given that this is an in-house solution, the application SHOULDN'T be getting hundreds of connection requests. We could also set the maximum threshold for connections per data source to a number that is well below the actual performance threshold for said source.

@stuartlynn
Copy link
Collaborator Author

Great! I agree that the second solution here sounds the best. I think it will make testing easier as well as we can isolate the ConnectionService middleware an just test that.

@michplunkett a good next step might be to look at how the AuthService works as a template for building out this new service. That uses the FromRequest trait (https://docs.rs/actix-web/latest/actix_web/trait.FromRequest.html) but we might need to use the more advanced middleware setup here: https://actix.rs/docs/middleware. to be able to access the application state to get access to the connection settings for the DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants