Improving the backend abstraction #5154
Replies: 4 comments 3 replies
-
cc @giovannipizzi, @sphuber, @ltalirz who may well have some initial discussion points 😄 |
Beta Was this translation helpful? Give feedback.
-
Thanks @chrisjsewell ; I agree with your observation and your overall thoughts. Let's just with each of the possible changes keep the cost-benefit balance in mind (cost=break user code). This is an abstraction that we are designing for N=2 (or N=3, if you count django and sqla as separate) without knowing whether there will ever be a ++N, so I think there's no need to strive for perfect purity.
Another way to look at this is to distinguish between "searchable data" and "non-searchable data" (that's what we tell plugin developers when they need to make the decision where to place their data) Re 4. : Are you just talking about moving the code or can we keep the method on the profile as well? Re 5.: Hopefully this would also enable a more automatic integration with things like the REST API. P.S. I took the liberty to transform your list of concrete suggestions into a numbered one to make it easier to refer to them. |
Beta Was this translation helpful? Give feedback.
-
Hi @chrisjsewell - indeed, I think this is a good discussion to have. Currently I think the backend evolved incrementally to enable supporting multiple backends, but in practice (both because all our implementations are SQL specific and they slowly converged together, and there was some attempt to avoid code duplication): on one side I think they are "DatabaseBackends" more than just "Backends", and in practice as you say they are SQL-specific. As @ltalirz mentions, at this stage it would be important to understand how many backends we expect to implement in the mid term. We had this idea in the past to implement also some graph-database backend; but in practice, the resources we have are limited (and it's not even clear if a graph DB would give us real speed benefits). So we might want to decide not to make things too general, if this simplifies the implantation. Having said that: before discussing what can be done, I'd like to agree on what the Backend should (or should not) do, and what should be inside or outside (e.g. is it a database backend, i.e. just an abstaction to how to access the DB part of a profile as it's now, or it should be a backend interface to any storage for a profile made of nodes (and computers, groups, ...)?) I've no specific suggestion at this moment and I'm open to various options as long as enough people in the team agree it's worth changing the current design and this does not impact performance. But I agree with your initial suggestions on what it should be. Pinging @muhrin as I think he designed the current frontend/backend distinction |
Beta Was this translation helpful? Give feedback.
-
With this, actually the other conceptual thing, is to remove global variables from the backend abstractions. Because the repository is obtained with I know why we have global variables, for usability, but where possible we should really try to eliminate their use in the "core" code. Other examples of this are with archive_format(path, "r") as reader:
result = traverse_graph(..., querybuilder=reader.querybuilder)
graph = Graph(..., querybuilder=reader.querybuilder) This will allow for features like importing entity sub-sets from archives, and for graph visualizations to be created directly from archives. |
Beta Was this translation helpful? Give feedback.
-
Ok this is basically me thinking through it (and I'll likely continue to edit this post), but comments are welcome 😄
This issue was already in my thinking, but has now popped up in a few places, so I wanted to centralise discussion: #5088 (comment), #5145 (comment)
What should the
Backend
be?The backend should be an abstraction for how you get/add/update data to a profile; this includes both entity "field" (e.g. database table rows) and "binary" (e.g. repository files) data.
Before here, more than a distinction between fields and files, it is a distinction between fast access data (available via both Entity instances and the QueryBuilder) and slow access data (available only via Entity instances)
It could then be e.g., a database (postgresql, mongodb, etc) plus an object store (e.g. a file-system, disk-objectstore), or something more exotic
It needs to provide an implementation of the backend entities (user, node, ...), a
QueryBuilder
implementation, a migration mechanism, some additional things for archive imports (transaction context, bulk insert/update methods), and also a way to initialise the storage for a profile.Ideally the backend would be "pluggable", i.e. via an entry point group
What it is at present?
At present the backend is very much tied to PotgreSQL; you have the 'django' and 'sqlalchemy' backends hard baked into AiiDA, but actually (at least after #5097) these are simply two routes to access the same database.
This manifests in a few areas; the config file has a bunch of top level keys relating directly to postgresql,
verdi quicksetup
directly usesPGSU
to initialize the database, ... (before #5093 theQueryBuilder
was also very much tied to sqlalchemy)Essentially, at present, it would not actually be possible to implement a new backend.
Of note, with #5145 I am essentially turning the archive in to a backend, albeit one that has minimal concurrent write capabilities, so not one you would want to run computations with, just for accessing data.
Really, you could now conceptualize an archive export/import as just a "translation" between backends.
Also to note, for both the
QueryBuilder
and archive functionality we currently assume two things:id
,uuid
,attributes
, etc)id
->int
)I think this makes sense, but should be more formalised (e.g. #5088)
Some things that can be done
Ideally I would rename
class Backend
to something likeclass DataBackend
, but that may be difficult for back compatibility. At a minimum though I would re-write the incredibly vague docstring lol:Restructure the
AIIDADB_
keys in theconf.json
under a top-level "backend" keyconfig.json
should also be backend specificMove the
PGSU
code inverdi quicksetup
to be accessed via theBackend
classMove the repository to be accessed via the
Backend
; it is very much baked into the backend, for example migrations also include change to the repository.Profile.get_repository
toBackend.get_repository
For migrations, currently django/sqlalchemy have different database schema versions, this covers everything about the schema including things like what indexes are on the database. But there should also be a "backend agnostic" schema version, which effectively guarantees the assumptions I noted above: a certain set of column names, and that they serialize to/from a certain Python object
EXPORT_VERSION
; so that you can work out on import, if an archive DB is compatible with the main DBRemove the
get_session
method from theBackend
. This is very much tied to sqlalchemy, and is only used in a few places:Beta Was this translation helpful? Give feedback.
All reactions