Improving the backend abstraction #5154

chrisjsewell · 2021-09-25T13:06:53Z

chrisjsewell
Sep 25, 2021

Ok this is basically me thinking through it (and I'll likely continue to edit this post), but comments are welcome 😄

This issue was already in my thinking, but has now popped up in a few places, so I wanted to centralise discussion: #5088 (comment), #5145 (comment)

What should the `Backend` be?

The backend should be an abstraction for how you get/add/update data to a profile; this includes both entity "field" (e.g. database table rows) and "binary" (e.g. repository files) data.
Before here, more than a distinction between fields and files, it is a distinction between fast access data (available via both Entity instances and the QueryBuilder) and slow access data (available only via Entity instances)

It could then be e.g., a database (postgresql, mongodb, etc) plus an object store (e.g. a file-system, disk-objectstore), or something more exotic

It needs to provide an implementation of the backend entities (user, node, ...), a QueryBuilder implementation, a migration mechanism, some additional things for archive imports (transaction context, bulk insert/update methods), and also a way to initialise the storage for a profile.

Ideally the backend would be "pluggable", i.e. via an entry point group

What it is at present?

At present the backend is very much tied to PotgreSQL; you have the 'django' and 'sqlalchemy' backends hard baked into AiiDA, but actually (at least after #5097) these are simply two routes to access the same database.

This manifests in a few areas; the config file has a bunch of top level keys relating directly to postgresql, verdi quicksetup directly uses PGSU to initialize the database, ... (before #5093 the QueryBuilder was also very much tied to sqlalchemy)
Essentially, at present, it would not actually be possible to implement a new backend.

Of note, with #5145 I am essentially turning the archive in to a backend, albeit one that has minimal concurrent write capabilities, so not one you would want to run computations with, just for accessing data.
Really, you could now conceptualize an archive export/import as just a "translation" between backends.

Also to note, for both the QueryBuilder and archive functionality we currently assume two things:

A certain set of backend entities (user, node, ...)
A certain set of field (a.k.a column) names on the backend entities (id, uuid, attributes, etc)
That these fields will serialize to / deserialize from a certain Python type (e.g. id -> int)

I think this makes sense, but should be more formalised (e.g. #5088)

Some things that can be done

Ideally I would rename class Backend to something like class DataBackend, but that may be difficult for back compatibility. At a minimum though I would re-write the incredibly vague docstring lol:
- The public interface that defines a backend factory that creates backend specific concrete objects.
Restructure the AIIDADB_ keys in the conf.json under a top-level "backend" key
- The schema under this key should be backend specific (i.e. what sub-keys are required)
- The migration of this part of the config.json should also be backend specific
Move the PGSU code in verdi quicksetup to be accessed via the Backend class
Move the repository to be accessed via the Backend; it is very much baked into the backend, for example migrations also include change to the repository.
- So here for example, we would move Profile.get_repository to Backend.get_repository
For migrations, currently django/sqlalchemy have different database schema versions, this covers everything about the schema including things like what indexes are on the database. But there should also be a "backend agnostic" schema version, which effectively guarantees the assumptions I noted above: a certain set of column names, and that they serialize to/from a certain Python object
- Here then, multiple database schema versions may map to a certain backend schema version
- This would also replace the archive EXPORT_VERSION; so that you can work out on import, if an archive DB is compatible with the main DB
Remove the get_session method from the Backend. This is very much tied to sqlalchemy, and is only used in a few places:

aiida/orm/implementation/sqlalchemy/backend.py:
  77:         session = self.get_session()

aiida/orm/implementation/sqlalchemy/querybuilder/main.py:
   175:         return self._backend.get_session()
   207:             for resultrow in self.get_session().execute(stmt):
   220:             for row in self.get_session().execute(stmt):
   239:             self.get_session().close()
   303:         self._query = self.get_session().query(firstalias.id)
  1031:         rows = self.get_session().execute(text(f'EXPLAIN{options} {compiled.string}')).fetchall()
  1035:         session = self.get_session()

aiida/restapi/common/utils.py:
  835:         get_manager().get_backend().get_session().close()

tests/backends/aiida_sqlalchemy/test_migrations.py:
   455:         with self.get_session() as session:

chrisjsewell · 2021-09-25T13:07:33Z

chrisjsewell
Sep 25, 2021
Author

cc @giovannipizzi, @sphuber, @ltalirz who may well have some initial discussion points 😄

0 replies

ltalirz · 2021-09-25T19:37:17Z

ltalirz
Sep 25, 2021

Thanks @chrisjsewell ; I agree with your observation and your overall thoughts.
If these changes can be done without much disruption, and if they can simplify the codebase / remove certain issues, I'm generally in favor.

Let's just with each of the possible changes keep the cost-benefit balance in mind (cost=break user code). This is an abstraction that we are designing for N=2 (or N=3, if you count django and sqla as separate) without knowing whether there will ever be a ++N, so I think there's no need to strive for perfect purity.

between fast access data (available via both Entity instances and the QueryBuilder) and slow access data (available only via Entity instances)

Another way to look at this is to distinguish between "searchable data" and "non-searchable data" (that's what we tell plugin developers when they need to make the decision where to place their data)

Re 4. : Are you just talking about moving the code or can we keep the method on the profile as well?

Re 5.: Hopefully this would also enable a more automatic integration with things like the REST API.
It does sound like a bit of work, though - should this be done before dropping one of the backends or should this wait?

P.S. I took the liberty to transform your list of concrete suggestions into a numbered one to make it easier to refer to them.

2 replies

chrisjsewell Sep 28, 2021
Author

Are you just talking about moving the code or can we keep the method on the profile as well?

Yeh, I think we should move the method from the profile.
This method, I believe, was only added with the new v2 repository, which obviously has not yet been "officially" released in a version.
So it would be better to make this change now (before v2 release), before there is any potential issues with deprecations and breaking user code.

It does sound like a bit of work, though - should this be done before dropping one of the backends or should this wait?

Hopefully it shouldn't be too much work since, as mentioned, it is literally what we already assume. I imagine, at its simplest, just adding a dictionary mapping between the sqla/django schema versions and the AiiDA schema versions.
I kind of already need this for #5145 anyway

ltalirz Oct 6, 2021

Yeh, I think we should move the method from the profile.
This method, I believe, was only added with the new v2 repository, which obviously has not yet been "officially" released in a version.
So it would be better to make this change now (before v2 release), before there is any potential issues with deprecations and breaking user code.

In that case I'm fully with you.

Hopefully it shouldn't be too much work since, as mentioned, it is literally what we already assume. I imagine, at its simplest, just adding a dictionary mapping between the sqla/django schema versions and the AiiDA schema versions.
I kind of already need this for #5145 anyway

Cool.

I know the last couple of weeks I've been somewhat absent from the code; I'm still going to be busy for ~1-2 weeks more but then I should be again available more, including reviews etc.

giovannipizzi · 2021-09-30T14:53:08Z

giovannipizzi
Sep 30, 2021
Maintainer

Hi @chrisjsewell - indeed, I think this is a good discussion to have. Currently I think the backend evolved incrementally to enable supporting multiple backends, but in practice (both because all our implementations are SQL specific and they slowly converged together, and there was some attempt to avoid code duplication): on one side I think they are "DatabaseBackends" more than just "Backends", and in practice as you say they are SQL-specific.

As @ltalirz mentions, at this stage it would be important to understand how many backends we expect to implement in the mid term. We had this idea in the past to implement also some graph-database backend; but in practice, the resources we have are limited (and it's not even clear if a graph DB would give us real speed benefits). So we might want to decide not to make things too general, if this simplifies the implantation.

Having said that: before discussing what can be done, I'd like to agree on what the Backend should (or should not) do, and what should be inside or outside (e.g. is it a database backend, i.e. just an abstaction to how to access the DB part of a profile as it's now, or it should be a backend interface to any storage for a profile made of nodes (and computers, groups, ...)?)

I've no specific suggestion at this moment and I'm open to various options as long as enough people in the team agree it's worth changing the current design and this does not impact performance. But I agree with your initial suggestions on what it should be.
(And I think that changing the name to something more specific would be OK and good - I don't know if this is a problem, shouldn't it all be "private" to AiiDA?)

Pinging @muhrin as I think he designed the current frontend/backend distinction

1 reply

chrisjsewell Sep 30, 2021
Author

Thanks @giovannipizzi

I'd like to agree on what the Backend should (or should not) do, and what should be inside or outside (e.g. is it a database backend, i.e. just an abstaction to how to access the DB part of a profile as it's now, or it should be a backend interface to any storage for a profile made of nodes (and computers, groups, ...)?)

One thing I would note here is that, with the new repository implementation, it is not really the case any more that the DB is separate from the repository.
This is because (1) the repository_metadata, held in the DB, relies on the file keys being compatible with the repository (e.g. if you changed your container to use md5 hashkeys instead of sha256, you then need to change all the repository_metadata) and also (2) the repository is migrated as part of the Backend.migrate.

chrisjsewell · 2021-10-06T01:00:50Z

chrisjsewell
Oct 6, 2021
Author

Move the repository to be accessed via the Backend; it is very much baked into the backend, for example migrations also include change to the repository.

With this, actually the other conceptual thing, is to remove global variables from the backend abstractions.

Because the repository is obtained with get_manager().get_profile().get_repository(), it basically means currently it's impossible for you to have multiple backends loaded at the same time, and e.g. for me to create a BackendNode for the archive (without horrible monkey patching of the profile).

I know why we have global variables, for usability, but where possible we should really try to eliminate their use in the "core" code.

Other examples of this are aiida.tools.graph.graph_traversers.traverse_graph and aiida.tools.visualization.graph.Graph, which both use QueryBuiler()`, that implicitly loads the "global" backend. Instead you should be able to supply them with a "QueryBuilder creator", so I can do e.g.

with archive_format(path, "r") as reader:
   result = traverse_graph(..., querybuilder=reader.querybuilder)
   graph = Graph(..., querybuilder=reader.querybuilder)

This will allow for features like importing entity sub-sets from archives, and for graph visualizations to be created directly from archives.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AiiDA team

Improving the backend abstraction #5154

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AiiDA team

Improving the backend abstraction #5154

chrisjsewell Sep 25, 2021

What should the Backend be?

What it is at present?

Some things that can be done

Replies: 4 comments · 3 replies

chrisjsewell Sep 25, 2021 Author

ltalirz Sep 25, 2021

chrisjsewell Sep 28, 2021 Author

ltalirz Oct 6, 2021

giovannipizzi Sep 30, 2021 Maintainer

chrisjsewell Sep 30, 2021 Author

chrisjsewell Oct 6, 2021 Author

chrisjsewell
Sep 25, 2021

What should the `Backend` be?

Replies: 4 comments 3 replies

chrisjsewell
Sep 25, 2021
Author

ltalirz
Sep 25, 2021

chrisjsewell Sep 28, 2021
Author

giovannipizzi
Sep 30, 2021
Maintainer

chrisjsewell Sep 30, 2021
Author

chrisjsewell
Oct 6, 2021
Author