Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication of media files #46

Open
jsangmeister opened this issue Jan 26, 2022 · 2 comments
Open

Deduplication of media files #46

jsangmeister opened this issue Jan 26, 2022 · 2 comments

Comments

@jsangmeister
Copy link
Contributor

To reduce the database size, we could save the hash of each file together with the content. If one uploads a new file, the hash of it is first calculated and checked if it is already present in the database. If it is, then the content of the old file is linked with the new id which was uploaded. This requires a new 1:n table which links ids to their content. This requires a migration which I'm not sure is possible with the current setup...

@jsangmeister jsangmeister added this to the 4.1 milestone Jan 26, 2022
@reiterl reiterl linked a pull request Feb 11, 2022 that will close this issue
@jsangmeister
Copy link
Contributor Author

Regarding the migrations: If we implement this, we should make it at least a little future-proof. The simplest solution would be to just append statements to the schema.sql which is fine for schema migrations, but data migrations are not possible by doing this, and we need a data migration to initially create the "metadata"/"link" table. There are probably frameworks out there which do exactly this, but I'm no expert on this. An initial short search yielded of course results like SQLAlchemy or other full-bloated SQL frameworks, which might be overkill, but can maybe be used for only this functionality. Regarding migration-only projects, the only thing I found was yoyo, which I'm not sure if we should use it, since it seems small and not very regularily maintained. The other alternative would be of course to write the migration framework from scratch, which is probably feasible since we do not need many features, but would probably still be overkill, especially since we currently only need it for this feature, which does not have high priority.

These are my thoughts so far. Maybe someone else hase more experience regarding SQL/Postgres/Python migrations (@peb-adr @gsiv @r-peschke)?

@gsiv
Copy link
Member

gsiv commented Mar 1, 2022

The other alternative would be of course to write the migration framework from scratch, which is probably feasible since we do not need many features, but would probably still be overkill, especially since we currently only need it for this feature, which does not have high priority.

I agree that we shouldn't need a third-party framework for this but I also don't think writing a simple mechanism for ourselves would be overkill, even if we aim to enable (data) migrations from any given previous version, as we should. I expect that we'll eventually need this for the other services as well.

To achieve this, wouldn't it suffice to attach a version to the schema and apply the migrations, i.e., numbered SQL scripts, as necessary during the container's start-up routine?

The only snag is going to be, as always, the coordination between scaled services. Unlike with the backend service, we should plan for a locking mechanism from the start this time but that, too, should be attainable. Remember that it needs to work with pgbouncer's transaction pooling mode though.

@jsangmeister jsangmeister modified the milestones: 4.1, 4.2 Dec 14, 2023
@Elblinator Elblinator modified the milestones: 4.2, 4.3 Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants