Data Server: Database level user table isolation #224

stuartlynn · 2022-11-25T13:10:40Z

Currently we have dataset assets protected at the API level through our permissions setup

matico/matico_server/src/models/permissions.rs

Lines 88 to 97 in 651d270

    
           #[derive(Debug, Queryable, Deserialize, Serialize, Insertable)] 
        
           pub struct Permission { 
        
               pub id: Uuid, 
        
               pub user_id: Uuid, 
        
               pub resource_id: Uuid, 
        
               pub permission: String, 
        
               pub resource_type: String, 
        
               pub created_at: NaiveDateTime, 
        
               pub updated_at: NaiveDateTime, 
        
           }

This allows us to check a user against a resource at the route and model level to determine access to said resource.

This works well for this level but we still have the problem of access at the database level.

For example lets say a user: User1 uploads a dataset: Dataset1 and marks it as private. The dataset gets loaded in to a table dataset_table When User2 tried to query that table through the API, or request tiles from that table etc, the permissions system will reject them for not having the correct permissions.

This doesn't however stop User2 from running a query against the dataset_table. For example

select * from dataset_table

will still execute when run by User2, even though User2 doesn't have permissions at the API level because we currently have no way of controlling permissions at the database level.

Indeed right now there is only a single user at the database level, which is simply the root db user which has access to everything.

There are a couple of ways we can resolve this issue.

Essentially what we need to do is create a db user for each user on signup and then run any query as that user . Then when permissions change at the API level, make sure we reflect these permissions at the database level.

Managing access

Focusing on the postgresql Data Service just now ( we will need to find a way to generalize this in the future to other backends), there are potentially a couple of ways to do this.

Approach one

We retain the root user, but before we run any query we use the SET ROLE postgres command to change the user scope to the user who is running the query

This could potentially be achieved at the QueryBuilder level where we insert the SET ROLE directive before the rest of the query is built.

Pros

This means we can retain a common thread pool to the database, as we don't need to reconnect with a new user scoped connection for each query ( which might overload the database active connections for rapid multiple queries like requesting MVT tiles.
Means we just have to modify the query builder and not the connection logic.

Cons

I am not sure how secure this actually is. For example, if a user runs a raw query against the database, can they include a second SET ROLE command to switch to another user before executing commands? If so is there a good way to protect against this?
Building this at the QueryBuilder level would require other implementations for other backends also to build the user permissions in at query execution time rather than at connection time (which is likely to be more common .

Approach two

Instead of trying to change the user at query time, we simply do so at connection time. In this case, each API request would need to establish a connection to the database with the users credentials and use that to execute the query.

It seems to me the easiest way to do this would be through a middleware. Currently the database connection is passed to our route executors through the global state object. For example, in this route we get access to the metadata database connection "state.db" and the data table database connection "state.data_db" through the state object.

matico/matico_server/src/routes/data.rs

Lines 103 to 122 in 651d270

    
           async fn get_column( 
        
               state: web::Data<State>, 
        
               Path(source): Path<Source>, 
        
               Path(column_name): Path<ColName>, 
        
               web::Query(query_params): web::Query<HashMap<String, serde_json::Value>>, 
        
               web::Query(query_str): web::Query<QueryString>, 
        
               logged_in_user: AuthService, 
        
           ) -> Result<HttpResponse, ServiceError> { 
        
               let user = User::from_token(&state.db, &logged_in_user.user); 
        
               let mut query = query_for_source(&state.db, &source, query_str.q, query_params).await?; 
        
               if let Some(user) = user { 
        
                   query.user(user); 
        
               } 
        
               let columns = query.get_columns(&state.data_db).await?; 
        
               let column = columns.iter().find(|c| c.name == column_name.column_name); 
        
               Ok(HttpResponse::Ok().json(column)) 
        
           }

We could change this up to instead get a connection through a Middleware service

async fn get_column(
    state: web::Data<State>,
    Path(source): Path<Source>,
    Path(column_name): Path<ColName>,
    web::Query(query_params): web::Query<HashMap<String, serde_json::Value>>,
    web::Query(query_str): web::Query<QueryString>,
    logged_in_user: AuthService,
    // New connection service, provides a connection scoped to the current user 
   // or if no user is logged in, ie this is an annon query, then scope it to the public 
   // user 
    user_scoped_connection: ConnectionService
) -> Result<HttpResponse, ServiceError> {
    let user = User::from_token(&state.db, &logged_in_user.user);
    let mut query = query_for_source(&state.db, &source, query_str.q, query_params).await?;


    if let Some(user) = user {
        query.user(user);
    }
    let columns = query.get_columns(&user_scoped_connection).await?;
    let column = columns.iter().find(|c| c.name == column_name.column_name);


    Ok(HttpResponse::Ok().json(column))
}

This ConnectionService middleware would need to

Grab the current user from the session (if there is one)
Lookup that users connection credentials from the users table (we would keep these private)
Establish a connection to the DB as that user
Return the connection
Retire that connection once the route has completed. This could probably be done using the Drop trait if it's not already implement there for the connection pool connections

Pros

This feels like a more generic approach that could generalize over multiple backend stores more easily
This feels more secure as we are connecting to the database as a limited user and can rely on the permission system of the backend more generally.

Cons

We would need to find a way to manage the connection pool such that it doesn't overwhelm the database. Some thoughts here are that we could cache connections scoped to users and reuse connections at a user level. I am not sure how we would then "load balance" the connections across active users.

Managing permissions

The other part of this is we need to manage changing permissions. So for example if User1 grants read access to User2 on a given table, we need to propagate that permission to the database.

This would require modifying the grant_permissions method

matico/matico_server/src/models/permissions.rs

Lines 177 to 195 in 651d270

    
           pub fn grant_permissions( 
        
               db: &DbPool, 
        
               user_id: Uuid, 
        
               resource_id: Uuid, 
        
               resource_type: ResourceType, 
        
               permissions: Vec<PermissionType>, 
        
           ) -> Result<(), ServiceError> { 
        
               for permission in permissions { 
        
                   let new_permision = NewPermission { 
        
                       user_id, 
        
                       resource_id, 
        
                       resource_type: resource_type.clone(), 
        
                       permission, 
        
                   }; 
        
                   Self::grant_permission(db, new_permision)?; 
        
               } 
        
               Ok(()) 
        
           }

We would need to

Pass through a user scoped connection to the dataset_database to the function
Call a method on the DatasetProvider object for that backend to modify the database level permissions in line with the new permission object

This shouldn't be too hard but noting it here.

Other considerations to note

Recapping above, we will eventually need this to generalize to other backend connections. Currently we only have postgress but eventually want to expand to MongoDB, flat geopaquet files, big query etc. So any solution should ideally keep that in mind
We want to as much as possible avoid performance hits with this approach. The MVT tile endpoints especially need to be responsive.
Ideally this should be implemented in a way that allows the query builders / routes etc to be agnostic to database permission layer.

stuartlynn · 2022-11-25T13:15:03Z

For additional reference, the connection pool to the dataset database is set up at run time here

matico/matico_server/src/lib.rs

Lines 62 to 67 in 651d270

    
           let data_db_connection_url = config.data_connection_string().unwrap(); 
        
           let data_pool = PgPoolOptions::new() 
        
               .max_connections(config.datadb.max_connections.unwrap_or(20)) 
        
               .connect(&data_db_connection_url) 
        
               .await 
        
               .expect("FAiled to connect to DB");

We need to generalize this a bit to allow for different backend connections but that's out of the scope of this ticket I think

michplunkett · 2022-11-25T19:46:59Z

Is there a reason why we aren't taking a more fragmented approach in regards to user roles and permissions? Perhaps I am missing something, but linking individual users to the database's roles feels like something that would limit us in regards to database flavors. I've generally worked with central role and permissions tables as opposed to relying certain authorized connections. Relying on an authorized connection feels a little risky in the way that if you could spoof a valid connection you could get A LOT more data than if you just did a simple check for role and permission. Perhaps this would be better to pitch in a meeting, you're right.

michplunkett · 2022-11-25T19:49:13Z

On a side note, I'd like to add some columns to our tables for future auditing and reporting purposes.

 #[derive(Debug, Queryable, Deserialize, Serialize, Insertable)] 
 pub struct Permission { 
     pub id: Uuid, 
     pub user_id: Uuid, 
     pub resource_id: Uuid, 
     pub permission: String, 
     pub resource_type: String, 
     pub created_at: NaiveDateTime, 
     pub updated_at: NaiveDateTime, 
 }

Namely, I'd like to add created_by and updated_by columns. From an admin perspective, it's been important for me to know not only when things happened but who executed those commands. Any place we have the created_at, updated_at, and deleted_at columns, I'd like to have who executed the action listed as well. I don't believe those necessarily need to be returned to the front-end, but they should definitely be kept as records on the data layer.

stuartlynn · 2022-11-27T10:20:33Z

Is there a reason why we aren't taking a more fragmented approach in regards to user roles and permissions? Perhaps I am missing something, but linking individual users to the database's roles feels like something that would limit us in regards to database flavors. I've generally worked with central role and permissions tables as opposed to relying certain authorized connections. Relying on an authorized connection feels a little risky in the way that if you could spoof a valid connection you could get A LOT more data than if you just did a simple check for role and permission. Perhaps this would be better to pitch in a meeting, you're right.

I think it would be useful to have a session where we discuss the architecture here and talk through some options. I think we will need user level permissions at the database level in some form, as we allow users to execute arbitrary SQL against the datasets they have permission to read etc. I think it might be hard to determine from that SQL exact ally what tables are being accessed in order to do an out of database permissions check before the query is run.

Might be worth clarifying that we have two databases in the system currently. The first holds all the business logic for the application (Users, Permissions,, etc) while the second we use to actually store a users uploaded datasets in tables and as an analytics engine using postgis. It's that second database that we need to somehow add permissions too. If there is a way we can do that without using database users and permissions at the database level I am totally open to it. Just struggling to see how we can make that work.

stuartlynn · 2022-11-27T10:23:39Z

On a side note, I'd like to add some columns to our tables for future auditing and reporting purposes.
 #[derive(Debug, Queryable, Deserialize, Serialize, Insertable)] 
 pub struct Permission { 
     pub id: Uuid, 
     pub user_id: Uuid, 
     pub resource_id: Uuid, 
     pub permission: String, 
     pub resource_type: String, 
     pub created_at: NaiveDateTime, 
     pub updated_at: NaiveDateTime, 
 } 
Namely, I'd like to add created_by and updated_by columns. From an admin perspective, it's been important for me to know not only when things happened but who executed those commands. Any place we have the created_at, updated_at, and deleted_at columns, I'd like to have who executed the action listed as well. I don't believe those necessarily need to be returned to the front-end, but they should definitely be kept as records on the data layer.

Yeah not having updated_at and crated_at in there is an oversight by me. Do you imagine the column to track who updated / created / deleted the table would point to a given user in the system?

michplunkett · 2022-11-28T19:43:32Z

Yeah not having updated_at and crated_at in there is an oversight by me. Do you imagine the column to track who updated / created / deleted the table would point to a given user in the system?

Yup! Just as the user_id field should be referencing the users table, presuming we have one, so should the created_by, updated_by, and deleted_by fields. To do that in PostgreSQL would require a statements like these:

created_by uuid NOT NULL REFERENCES users (id)
updated_by uuid REFERENCES users (id)
deleted_by uuid REFERENCES users (id)

michplunkett · 2022-11-28T19:46:07Z

I think it would be useful to have a session where we discuss the architecture here and talk through some options.

Agreed! I feel like we could both type up 1,000 words easily on this topic and go back and forth, or just have a 15 minute meeting and come to an agreement relatively quickly. I should be available later on Tuesday, after 3 PM. Throw some time on my calendar and I'll make it happen!

michplunkett · 2022-12-09T22:06:44Z

In regards to your original request, do we have a means of preventing certain commands (changing roles, deleting tables, etc.) other than just text parsing?

michplunkett · 2023-01-04T16:51:23Z

In regards to your original request, do we have a means of preventing certain commands (changing roles, deleting tables, etc.) other than just text parsing?

I think we could do some text parsing of the injected SQL looking for any instance of the word set or any synonymous means of doing so. The only thing I'd be scared of is that if there's another way of setting a role that we aren't aware of, we have opened ourselves to a pretty big security gap.

michplunkett · 2023-01-04T16:56:13Z

The more I read over the two solutions, the more I like the second solution. It feels generally more secure and far more finite. I don't like the concept of just leaving connections open once a user has initially connected to a DB.

We would need to find a way to manage the connection pool such that it doesn't overwhelm the database.

I could very well be off base here, but I am not sure this will be a significant problem. Given that this is an in-house solution, the application SHOULDN'T be getting hundreds of connection requests. We could also set the maximum threshold for connections per data source to a number that is well below the actual performance threshold for said source.

stuartlynn · 2023-01-05T03:11:44Z

Great! I agree that the second solution here sounds the best. I think it will make testing easier as well as we can isolate the ConnectionService middleware an just test that.

@michplunkett a good next step might be to look at how the AuthService works as a template for building out this new service. That uses the FromRequest trait (https://docs.rs/actix-web/latest/actix_web/trait.FromRequest.html) but we might need to use the more advanced middleware setup here: https://actix.rs/docs/middleware. to be able to access the application state to get access to the connection settings for the DB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Server: Database level user table isolation #224

Data Server: Database level user table isolation #224

stuartlynn commented Nov 25, 2022

stuartlynn commented Nov 25, 2022

michplunkett commented Nov 25, 2022

michplunkett commented Nov 25, 2022 •

edited

Loading

stuartlynn commented Nov 27, 2022

stuartlynn commented Nov 27, 2022

michplunkett commented Nov 28, 2022

michplunkett commented Nov 28, 2022

michplunkett commented Dec 9, 2022

michplunkett commented Jan 4, 2023

michplunkett commented Jan 4, 2023

stuartlynn commented Jan 5, 2023

Data Server: Database level user table isolation #224

Data Server: Database level user table isolation #224

Comments

stuartlynn commented Nov 25, 2022

Managing access

Approach one

Pros

Cons

Approach two

Pros

Cons

Managing permissions

Other considerations to note

stuartlynn commented Nov 25, 2022

michplunkett commented Nov 25, 2022

michplunkett commented Nov 25, 2022 • edited Loading

stuartlynn commented Nov 27, 2022

stuartlynn commented Nov 27, 2022

michplunkett commented Nov 28, 2022

michplunkett commented Nov 28, 2022

michplunkett commented Dec 9, 2022

michplunkett commented Jan 4, 2023

michplunkett commented Jan 4, 2023

stuartlynn commented Jan 5, 2023

michplunkett commented Nov 25, 2022 •

edited

Loading