Skip to content

Commit

Permalink
improved the abstract
Browse files Browse the repository at this point in the history
  • Loading branch information
stefjoosten committed Nov 18, 2023
1 parent 62f56f3 commit bbae9d0
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 40 deletions.
97 changes: 60 additions & 37 deletions 2022Migration/articleMigrationFACS.tex
Original file line number Diff line number Diff line change
Expand Up @@ -177,22 +177,21 @@
\maketitle % typeset the header of the contribution
%
\begin{abstract}
Software generators compile a specification into a working information system.
This can help increase the frequency of releases and their reliability
because the generator saves on development time and prevents human-induced mistakes.
However, many generators do not support data migrations.
Data migration is necessary when an incremental deployment changes the system's schema.
As a result, developers tend to avoid migrations or migrate data ``by hand''.

To address this problem, this paper proposes a theory for data migrations aimed at automating the migration process.
The problem at large is how to preserve the semantics of that data under a changing schema.
This paper proposes a theory for deploying incremental change.
The theory is applicable in general but will be implemented in a software generator called Ampersand.

This paper defines a migration process that preserves the semantics of data by means of invariants (constraints)
with zero downtime for users.
The migration process is based on assumptions and requirements that aim to capture a large class of data migrations in practice.
This paper focuses on correctness of the migration. Efficiency of the migration is outside its scope.
Software generators that compile and deploy a specification into a functional information system
can help to increase the frequency of releases and their reliability.
They achieve this by reducing development time and minimizing human-induced errors.
However, a common drawback is the lack of support for data migration,
which becomes essential during incremental deployments that alter the system's schema.
Consequently, schema-changing data migrations often face challenges, leading developers to resort to manual migration or employ workarounds.


To address this issue, this paper proposes a foundational theory for data migration,
aiming to generate migration scripts for automating the migration process.
The overarching challenge is preserving the business semantics of data amidst schema changes.
Specifically, the paper tackles the task of generating a migration script based on the schemas of both the existing and the desired system with zero downtime.
The proposed solution was validated by a prototype demonstrating its efficacy.
Notably, the theory is technology-independent, articulating systems in terms of invariants, thereby ensuring applicability across various scenarios.
The migration script generator will be implemented in a software generator named Ampersand.
\keywords{Generative software \and Incremental software deployment \and Data migration \and Relation algebra \and Ampersand \and Schema change \and Invariants \and Zero downtime}
\end{abstract}

Expand Down Expand Up @@ -229,27 +228,32 @@ \section{Introduction}
where the semantic integrity of data must be preserved across schema changes.
Another use case is application integration for multiple dispersed data sources with explicit schemas.

Another practical problem is that of data quality.
Migrations typically suffer from a backlog of deteriorated data, incurring work to clean it up.
A practical complication in many data migration projects is the presence of deteriorated data.
To clean it up may incur much work.
Some of that work must be done before the migration; some can wait till after the migration.
In all cases, data pollution in an existing system requires careful analysis and planning.
We can capture the automatable part of the data quality problem by regarding it as a requirement to satisfy semantic constraints.
E.g. the constraint that the combination of street name, house number, postal code, and city occurs in
a registration of valid addresses can be checked automatically.
In a formalism like Ampersand, which allows us to express such constraints, we can add constraints for data quality to the schema.
In a formalism like Ampersand~\cite{JoostenRAMiCS2017,Joosten-JLAMP2018}, which allows us to express such constraints, we can add constraints for data quality to the schema.
This allows us to signal the data pollution at runtime.
Some forms of data pollution are not automatable, however.
An example is when a person has deliberately specified a false name without violating any constraint in the system.

The complexity of data migration has prompted us to develop a theory first,
which we present in this contribution.
We have validated the theory by prototyping because a formal proof of correctness is beyond our reach.
This theory perceives an information system as a data set with constraints, so
The existence of a software generator

for changing schemas in production systems before automating SCDMs.
The next section analyzes SCDMs with an eye on zero downtime and data quality.
It sketches the outline of a procedure for SCDMs.
Section~\ref{sct:Definitions} formalizes the concepts that we need to define the procedure.
Section~\ref{sct:Generating} defines the algoritm for generating a migration system, to automate SCDMs.
This algorithm has been proven correct in Isabelle/HOL.
The outline of this proof is discussed in section~\ref{sct:Proof} and the proof itself is published on the internet.

For experimental validation we have used the language Ampersand~\cite{JoostenRAMiCS2017,Joosten-JLAMP2018}
Section~\ref{sct:PoC} demonstrates the prototype of a migration system, which we used to validate our theory experimentally.
For this purpose we have used the language Ampersand
because its syntax and semantics correspond directly to the definitions in section~\ref{sct:Definitions}.
In this way, prototypes built in Ampersand have helped to refine the theory.

\section{Analysis}
\label{sct:Analysis}
Expand All @@ -270,7 +274,7 @@ \subsection{Information Systems}

Figure~\ref{fig:pre-migration} depicts the situation before migration.
An existing application service ingests traffic through an ingress and persists data in a data set, which is typically a database.
This research assumes that the structure and business semantics are represented in a schema, from which the system is generated.
Our research assumes that the structure and business semantics are represented in a schema, from which the system is generated.
Actors (both users and computers) are changing the data in a system continually.
The state of the system is represented by a data set, typically represented in some form of persistent store such as a database.
Events that the system detects may cause the state to change.
Expand All @@ -279,20 +283,19 @@ \subsection{Information Systems}
including SQL databases, object-oriented databases, graph databases, triple stores, and other no-SQL databases.

We assume that constraints implement the business semantics of the data.
Constraints represent business concerns formally, so they can be checked automatically and be used to generate software.
Figure~\ref{fig:pre-migration} depicts them as rules.
Some constraints must be satisfied through human intervention, while others are being satisfied by the system.
So, we distinguish different kinds of constraint:
Constraints represent business concerns formally, so they can be checked automatically and can be used to generate software.
Some of the constraints require human intervention, while others require a system to intervene.
In this paper, we distinguish three different kinds of constraint:
\begin{enumerate}
\item Blocking invariant\\
A \define{blocking invariant} is a constraint that is always true in a system.
It serves to constrain the data set at runtime.
When the data set changes in a way that violates a blocking invariant, the system produces an error message and refuses the change.
\item Transactional invariant\\
The classical database transaction can be understood as an invariant,
The classical database transaction can be understood as a \define{transactional invariant},
which the system keeps satisfied by adding or deleting triples to the dataset.
As soon as the data violates this constraint, the system restores it without human intervention.
So, the outside world experiences a constraint that is always true, i.e.~an invariant.
So, the outside world experiences this as a constraint that is always true, i.e.~an invariant.
\item Business constraint\\
A \define{business constraint} is a constraint that users can violate temporarily until someone restores it.
Example: ``An authorized manager has to sign every purchase order.''
Expand All @@ -304,19 +307,41 @@ \subsection{Information Systems}
Of the three types of constraint, only two are invariants.

\subsection{Ampersand}
We use Ampersand as a prototyping language to demonstrate our theory.
More importantly, we intend to enhance the Ampersand compiler with the theory presented in this paper,
to generate migration systems automatically.

Ampersand is a language that specifies information systems as a system of concepts, relations, and constraints.
It features the types of constraints that we need for this paper, so we have used it to try our theory in practice.
Ampersand has a compiler that generates the information system from a script.
A developer specifies constraints in heterogeneous relation algebra~\cite{Hattensperger1999,Alloy2006}.
It features the kinds of constraint we need for this paper, so we have used it to try our theory in practice.
Ampersand has a compiler that generates the information system from a script, so it satisfies our research requirement.
A developer specifies constraints in heterogeneous relation algebra~\cite{Hattensperger1999, Alloy2006}.
The generated system keeps the invariants satisfied and signals violations of business constraints to a user.
Having all invariants explicit in the Ampersand source code makes Ampersand a suitable platform on which to implement our theory.
The absence of imperative code in an Ampersand script makes it easier to reason about the system.
The type system of Ampersand~\cite{vdWoude2011} features static typing,
which has well established advantages for the software engineering process~\cite{HanenbergKRTS14,Petersen2014}.
which has well-established advantages for the software engineering process~\cite{HanenbergKRTS14, Petersen2014}.
Constraints carry the business semantics, so they allow us to be explicit about ``preserving the meaning as much as possible''.
An Ampersand script contains just enough information to generate a complete system,
which means that a classical database schema (i.e.\ data structure plus constraints) can be extracted from the Ampersand script.

Ampersand has been used in practice both in education (Open University of the Netherlands)
and in industry (Ordina and TNO-ICT).
For example, Ordina designed a proof-of-concept in 2007 of the INDIGO-system.
This design was based on Ampersand, to obtain correct, detailed results in the least amount of time.
Today INDIGO is in use as the core information system of the Dutch immigration authority, IND.
More recently, Ampersand was used to design an information system called DTV for the Dutch Food Authority, NVWA.
A prototype of DTV was built in Ampersand and was used as a model to build the actual system.
TNO-ICT, a major Dutch industrial research laboratory, is using Ampersand for research purposes.
For example, TNO-ICT did a study of international standardization efforts such as
RBAC (Role Based Access Control) in 2003 and architecture (IEEE 1471-2000)~\cite{IEEE1471} in 2004.
Several inconsistencies were found in the last (draft) RBAC standard~\cite{RBAC}.
TNO-ICT has also used the technique in conceiving several patents%
\footnote{e.g. patents DE60218042D, WO2006126875, EP1727327, WO2004046848, EP1563361, NL1023394C, EP1420323, WO03007571, and NL1013450C.}.
At the Open University of the Netherlands, Ampersand is being taught in a course called Rule-based Design~\cite{RBD}.
In this course, students use a platform called RAP, which has been built in Ampersand~\cite{Michels2015}.
RAP has been the first Ampersand application that has run in production.


\subsection{Zero downtime}
To make the case for zero downtime, consider this problem:
Suppose we have an invariant, $u$, in the {\em desired system}, which is not part of the {\em existing system}.
Expand Down Expand Up @@ -584,10 +609,8 @@ \section{Generating a Migration Script}
\label{sct:Generating}
The complexity of migrating data to date yields expensive and error-prone migration projects.
By generating this code we can eliminate human induced errors.
We enhance the quality of the generator further by proving its correctness in Isabelle/HOL.

This section starts with a presentation of the algorithm.
Then we discuss its proof of correctness.

\subsection{Algorithm for generating a migration script}
(\@Bas, ik heb de relevante delen van {\tt Kurk.adl} als \LaTeX\ commentaar bij elke equation in de \LaTeX\ broncode toegevoegd,
Expand Down
6 changes: 3 additions & 3 deletions 2022Migration/doc.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1051,10 +1051,10 @@ @inproceedings{vdWoude2011
title = {{Relational Heterogeneity Relaxed by Subtyping}},
year = {2011}}

@article{RBD,
@book{RBD,
author = {Joosten, Stef and Wedemeijer, Lex and Michels, Gerard},
journal = {Open Universiteit, Heerlen},
title = {{Rule Based Design}},
publisher = {Open Universiteit, Heerlen},
title = {{Rule-Based Design}},
year = {2013}}

@article{BrinkBritzSchmidt94,
Expand Down

0 comments on commit bbae9d0

Please sign in to comment.