-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Considerations for database backend and broker #4
Comments
I am not familiar with ORMs in general. From quick internet searches I get the feeling that this is a translator between classes and a relational database. Can an ORM create a database from scratch? Another thing to consider is the dependencies a django package will bring in. What about using a singleton ORM like SQLAlchemy? A third thing we haven't discussed: is the choice replicable in a distributed environment? We have several containers running tern and all of them have an instance of a local cache. How do we make sure all the caches in all the containers are synced? Do we switch to a distributed cache rather than a local cache? If we do, what distributed data store do we use? If we were to pick a distributed data store, can we ask it questions like "give me all the layers that have this file in them"? |
Perhaps we need to use both the distributed data store and the relational database so we can have the replicability and joins. Let's discuss our requirements and then look at our options. |
Yes, Database can be implemented from scratch using ORM's but Django provides ease of application and very easy code maintenance and testing. For CRUD queries Django is fast and efficient but for complex queries, SQL Alchemy is preferred. @nishakm, for your third point, from my understanding, we can use a distributed cache and spin it on a container and expose its port to communicate with different instances simultaneously. Now based on the choice of the framework (Django/flask/Redis) we use this distributed data store can be SQL or a Redis-based API. One thing is true that Django will definitely bring dependencies that will slow down the speed. Redis is ultra-fast as it is basically a key-value pair way of storing data. But the problem with Redis is that when more data gets added, it will become less queryable. Both Django and Redis have their own pros and cons. If we consider speed then Redis is better, but if we consider flexibility then Django is better. |
I think we would be using the CRUD operations on the distributed data store and in whatever migration we use to the database. But the database will still need to be available to ask complex queries. I am envisioning distributed caching being used on a daily basis and the database as a centralized store (let's not think about how this will be replicated for now. I think I am sold on the Django ORM, but the immediate need is for a CRUD API and a distributed cache. CouchDB looks very promising but I am open to other options. |
Currently, Redis is most popular when it comes to cache because of its incredibly fast speed. |
@nishakm So, in regards to which database to use, I suggest also checking for MongoDB (NoSQL Database), which can be used with Django. Redis is good and fast. Also, can you describe the type of queries that we would be making? From the current JSON-file backend, I don't think that it's going to be more than some simple joins, which is definitely possible in Django. If there are not many queries then maybe we can continue using a schemaless database. Here's a comparison between MongoDB and CouchDb: https://www.xplenty.com/blog/couchdb-vs-mongodb/ |
@nisha What I understood from the concept of distributed cache is that, let's say there are 100 requests coming on tern, then those requests will be distributed over let say 10 containers, each container having its own cache, now each of those containers is interconnected and now from there, we can get all the cache. Something similar to the below image. Also, Redis was primarily made for cache, plus |
Redis usually commits data to permanent storage on a periodic basis, whereas take example of some SQL Database like Postgres will usually commit before each transaction is marked as complete. This means Postgres is slower because it commits more frequently (but provides ACID guarantee), but Redis usually has a time window where data loss may occur even when the client was told that their update was handled successfully. From what I can understand, the whole data-store functionality is such that we can compromise on some data, as it would be generated again. So, yeah, Redis is really a good choice for the cache. The type/complexity of queries may help us judge the best database (NoSQL/SQL) and the corresponding ORM, if SQL databases are used. |
Thanks for the discussion @jaindhairyahere and @m1-key :). I learned a lot from it! Honestly, the choice of distributed cache and database is less of a concern than actually implementing something such that these choices can be swapped out easily. However the API needs to be decided on first. I would request trying out the state of this project first before moving on to this. Looks like we are in agreement about the django ORM for the database and Redis for the distributed cache? |
Here's how I think we should move towards making an API for tern, according to what we discussed in the last meeting:
Create a Django Application and turn the tern project into an installable django package. Django is a python based web-development framework which is known for its ORM. Django ORM provides backends to interact with different databases hosted locally or in some other container or on a completely different server. This allows us to directly skip Issues #862 #863, apart from other advantages that django provides to us.
Creating a Django Package from pure python will take some time and effort but will help us resolve many possible issues that may occur if we design our own backend, especially things like optimizations. I'm hoping to describe this in my GSOC Proposal too. I've already started some work on this and presently I'm trying to make an apt database schema to be able to use a relational database. Please let me know what you think.
The text was updated successfully, but these errors were encountered: