-
Notifications
You must be signed in to change notification settings - Fork 5
In The Cloud
I will assert - but not argue in detail here - that today a large number of repositories would be better off running in the cloud than in a local deployment environment, and that within a few years that will be true for the vast majority. What constitutes better off? Cheaper, easier, more reliable, more scalable, more performant, some or all of the above? These are interesting and sometimes complex questions, but not the focus of this post. Rather, I want to explore what the design implications of this shift might be: what does this imply for repository software architecture? How would DSpace/mds be different? I will make claims - some of which may be controversial - about what directions the cloud will push software design towards. This is not a purely theoretical discussion in that mds has or will adopt most of the recommended changes. What follows are a few general principles and deductions from them.
A central aspect of cloud computing is that the service provisions not only the hardware but some of the software environment. How encompassing that environment will be varies with the type of cloud service: a PaaS-type offering will provide more than a IaaS offering, maybe less than a SaaS, e.g. In any case, since the platform fixes the environment, the application need no longer support a variety of environments (beyond what would keep it cloud vendor-agnostic). For DSpace/mds I think one of the first conclusions to draw is that OS-independence seems like a dispensable trait. Why support Windows when there is always an alternative? Linux-only will be fine, and easier to test/support etc. While this amounts to a fairly modest change - since DSpace is java-based and therefore mostly abstracts the OS away anyway - one could go further and consider other multi-environment dimensions to DSpace.
Foremost among these is multi-database support (specifically PostgreSQL and Oracle, but there has always been a desire to support MySQL, etc). The rationale for this characteristic had always been predicated on the idea that the software would be locally run: if your institution/central IT group/sysadmin had a preference or requirement to use a particular RDBMs system, then DSpace could accommodate. This flexibility has a cost, however, and having to code/test/maintain every database-touching aspect of the application for each supported database has been a challenge. Various attempts have been made (or contemplated) to mitigate this cost by better design. One early effort was the so-called 'DAO' work in an abandoned branch of DSpace that factored out the DB-specific code into DAOs, thereby more cleanly separating peculiarities of each supported DB. Another approach that has been discussed and prototyped is using an 'ORM' framework (e.g. Hibernate) which would abstract away DB differences.
But if one eliminates the local deploy assumption, is this rationale still compelling? I would argue not, and further claim that there are substantial gains to be made by eliminating multi-DB support altogether. Even using an ORM, one has to deploy and test extensively for each DB backend supported, which adds to the complexity of the development environment.
DSpace contains a large number of what we can call locality constraints. These are hard-coded requirements that application resources co-reside in a given file system or environment. A simple concrete example: when one deploys the DSpace WAR to a Tomcat container, the Tomcat application context is configured with an initialization parameter that refers to a directory ('dspace.dir') where DSpace is installed. This directory is used to find the configuration files and other resources. But this also means that Tomcat is required to run on the same host as the base installation. This means that very simple strategies like clustering Application Servers on different hosts become immediately problematic. One could imagine solutions such as mounting an NFS volume on each of the cluster hosts that points back to a common file system on one host, but NFS configuration is not typically available on a cloud server, and would in fact be very difficult to manage even if it were.
What would it take to relax all the locality constraints in DSpace? An instructive example can be found in the evolution of the search interface. Originally, the Lucene index files all were configured to live in a directory relative to the above-mentioned 'dspace.dir'. With the advent of SOLR, however, the indexes are accessed via an http URL. Now it is true that today, the SOLR instance runs in the same Tomcat container as the web UI, and so locality is preserved, but there is no fundamental reason SOLR could not move to a completely different host. Of course, there are other locality constraints that the SOLR implementation smuggles in, so this would not actually work today, but it points the way to how DSpace could evolve. MDS will attempt to eliminate any locality constraints, and it uses a multi-pronged strategy to do so. The main techniques are:
A lot of small resource files, such as email templates, are found by the system by constructing a relative path from 'dspace.dir' (thus enshrining co-locality). MDS has moved all these files, with the exception of config files, into the database. This means, since the database can be accessed from remote hosts, that the constraint falls away.
For configuration files (kernel.cfg and modules), which might not lend themselves to database storage, it is possible to include a copy of them with each distinct application package (such as a WAR file), and rely on location-relative means (java's #getResource() method, e.g.) to access them.
This was the lesson of the SOLR example above, and this strategy will be expanded to include all the major components of DSpace: the asset store, email service, handle server etc. This strategy really amounts to transforming MDS into a set of cooperating micro-services, along the lines of SOA.