The following are the steps to perform a deployment in production. In case you want to develop, please follow the steps specified in the README.md.
- All the deployment was configured to be simple from the Docker Compose tool.
- Create the needed volumes:
docker volume create --name=multiomics_intermediate_mongo_data docker volume create --name=multiomics_intermediate_mongo_config docker volume create --name=multiomics_intermediate_postgres_data docker volume create --name=multiomics_intermediate_redis_data docker volume create --name=multiomics_intermediate_static_data docker volume create --name=multiomics_intermediate_media_data docker volume create --name=multiomics_intermediate_logs_data
- Copy the file
docker-compose_dist.yml
and paste it with the namedocker-compose.yml
. - Set the environment variables that are empty with the data of the connection to the DB, MongoDB, among others. They
are listed below by category:
- Django:
DJANGO_SETTINGS_MODULE
: indicates thesettings.py
file to read. In the production case the valuedocker-compose_dist.yml
is left in thedocker-compose_dist.yml
:multiomics_intermediate.settings_prod
.ENABLE_SECURITY
: set the stringtrue
to enable Django's security mechanisms. In addition to this parameter, to have a secure site you must configure the HTTPS server, for more information on the latter see the section Enable SSL/HTTPS. Defaultfalse
.CSRF_TRUSTED_ORIGINS
: in Django >= 4.x, it's mandatory to define this in production when you are using Daphne through NGINX. The value is a single host or list of hosts separated by comma. 'http://', 'https://' prefixes are mandatory. Examples of values: 'http://127.0.0.1', 'http://127.0.0.1,https://127.0.0.1:8000', etc. You can read more here.SECRET_KEY
: secret key used by Django. If not specified one is generated with generate-secret-key application automatically.ALLOWED_HOSTS
: list of allowed host. Defaultmultiomix
.COR_ANALYSIS_SOFT_TIME_LIMIT
: Time limit in seconds for a correlation analysis to be computed. If the experiment is not finished in this time, it is marked asTIMEOUT_EXCEEDED
. Default to10800
(3 hours).FS_SOFT_TIME_LIMIT
: Time limit in seconds for a Feature Selection experiment to be computed. If the experiment is not finished in this time, it is marked asTIMEOUT_EXCEEDED
. Default to10800
(3 hours).STAT_VALIDATION_SOFT_TIME_LIMIT
: Time limit in seconds for a StatisticalValidation to be computed. If It's not finished in this time, it is marked asTIMEOUT_EXCEEDED
. Default to10800
(3 hours).TRAINED_MODEL_SOFT_TIME_LIMIT
: Time limit in seconds for a TrainedModel to be computed. If It's not finished in this time, it is marked asTIMEOUT_EXCEEDED
. Default to10800
(3 hours).INFERENCE_SOFT_TIME_LIMIT
: Time limit in seconds for an InferenceExperiment to be computed. If It's not finished in this time, it is marked asTIMEOUT_EXCEEDED
. Default to10800
(3 hours).SYNC_STUDY_SOFT_TIME_LIMIT
: Time limit in seconds for a CGDSStudy to be synchronized. If It's not finished in this time, it is marked asTIMEOUT_EXCEEDED
. Default to3600
(1 hour).RESULT_DATAFRAME_LIMIT_ROWS
: maximum number of tuples of an experiment result to save in DB. If it has a larger amount it is truncated by warning the user. The bigger the size the longer it takes to save the resulting combinations of a correlation analysis in Postgres. Set it to0
to save all the resulting combinations. Default to300000
.EXPERIMENT_CHUNK_SIZE
: the size of the batches/chunks in which each dataset of an experiment is processed. By default,500
.SORT_BUFFER_SIZE
: number of elements in memory to perform external sorting (i.e. disk sorting) in the case of having to sort by fit. This impacts the final sorting performance during the computation of an experiment, at the cost of higher memory consumption. Default2_000_000
of elements.NUMBER_OF_LAST_EXPERIMENTS
: number of last experiments shown to each user in theLast experiments
panel in thePipeline
page. Default4
.MAX_NUMBER_OF_OPEN_TABS
: maximum number of experiment result tabs that the user can open. When the limit is reached it throws a prompt asking to close some tabs to open more. The more experiment tabs you open, the more memory is consumed. Default8
.CGDS_CONNECTION_TIMEOUT
: timeout in seconds of the connection to the cBioPortal server when a study is synchronized. Default5
seconds.CGDS_READ_TIMEOUT
: timeout in seconds of the waiting time until the server sends new information about a CGDS study being downloaded. Useful to avoid synchronization getting stuck due to cBioPortal problems, if the download does not continue in that time, it is cut off. Default60
. seconds.CGDS_CHUNK_SIZE
: size in bytes of the chunk in which the files of a CGDS study are downloaded, the bigger it is, the faster the download is, but the more server memory it consumes. Default2097152
, i.e. 2MB.THRESHOLD_ORDINAL
: number of different values for the GEM (CNA) information to be considered ordinal, if the number is <= to this value then it is considered categorical/ordinal and a boxplot is displayed, otherwise, it is considered continuous and the common correlation graph is displayed. Default5
.THRESHOLD_GEM_SIZE_TO_COLLECT
: GEM file size threshold (in MB) for the GEM dataset to be available in memory. This has a HUGE impact on the performance of the analysis. If the size is less than or equal to this threshold, it is allocated in memory, otherwise, it will be read lazily from the disk. If None GGCA automatically allocates in memory when the GEM dataset size is small (<= 100MB). Therefore, if you want to force to always use RAM to improve performance you should set a very high threshold, on the contrary, if you want a minimum memory usage at the cost of poor performance, set it to0
. DefaultNone
.
- PostgreSQL:
POSTGRES_USERNAME
: PostgreSQL connection username. Must be equal toPOSTGRES_USER
.POSTGRES_PASSWORD
: PostgreSQL connection password. Must be equal toPOSTGRES_PASSWORD
.POSTGRES_HOST
: PostgreSQL connection host.POSTGRES_PORT
: PostgreSQL connection port.POSTGRES_DB
: PostgreSQL database where the tables to be managed in the system will be created. Must be equal toPOSTGRES_DB
.
- Mongo DB:
MONGO_USERNAME
: MongoDB connection username. Must be equal toMONGO_INITDB_ROOT_USERNAME
.MONGO_PASSWORD
: MongoDB connection password. Must be equal toMONGO_INITDB_ROOT_PASSWORD
.MONGO_HOST
: MongoDB connection host.MONGO_PORT
: MongoDB connection port.MONGO_DB
: MongoDB database where the collections managed in the system will be created. Must be equal toMONGO_INITDB_DATABASE
.MONGO_TIMEOUT_MS
: maximum timeout in milliseconds for DB connections. Default5000
ms.
- Emailing:
EMAIL_NEW_USER_CONFIRMATION_ENABLED
: set the stringtrue
to send an email with a confirmation token when a user is created from the Sign-Up panel. Defaultfalse
.EMAIL_HOST
: Only ifEMAIL_NEW_USER_CONFIRMATION_ENABLED
is set totrue
. SMTP host to use for sending email.EMAIL_PORT
: Only ifEMAIL_NEW_USER_CONFIRMATION_ENABLED
is set totrue
. Port to use for the SMTP server defined inEMAIL_HOST
.EMAIL_HOST_USER
: Only ifEMAIL_NEW_USER_CONFIRMATION_ENABLED
is set totrue
. Username to use for the SMTP server defined inEMAIL_HOST
. If empty, Django won’t attempt authentication.EMAIL_HOST_PASSWORD
: Only ifEMAIL_NEW_USER_CONFIRMATION_ENABLED
is set totrue
. Password to use for the SMTP server defined inEMAIL_HOST
. This setting is used in conjunction withEMAIL_HOST_USER
when authenticating to the SMTP server. If either of these settings is empty, Django won’t attempt authentication.DEFAULT_FROM_EMAIL
: Default email address to use for various automated correspondence from the site manager(s) (more here). Default[email protected]
.EMAIL_MAIL_SUBJECT
: Only ifEMAIL_NEW_USER_CONFIRMATION_ENABLED
is set totrue
. Title of the new account verification email. By defaultConfirm your email
.
- Modulector:
MODULECTOR_HOST
: Modulector connection host. Default127.0.0.1
.MODULECTOR_PORT
: Modulector connection port. Default8001
.
- BioAPI:
BIOAPI_HOST
: BioAPI connection host. Default127.0.0.1
.BIOAPI_PORT
: BioAPI connection port. Default8002
.
- Experiment result table:
TABLE_PAGE_SIZE
: number per rows to display in the table by default. Default10
.
- Feature Selection:
N_JOBS_RF
: Number of cores used to run the survival RF model. Set it to-1
to use all cores. Default1
.N_JOBS_CV
: Number of cores used to compute CrossValidation. Set it to-1
to use all cores. Default1
.COX_NET_GRID_SEARCH_N_JOBS
: Number of cores used to compute GridSearch for the CoxNetSurvivalAnalysis. Set it to-1
to use all cores. Default2
.MIN_ITERATIONS_METAHEURISTICS
: Minimum number of iterations user can select to run the BBHA/PSO algorithm. Default1
.MAX_ITERATIONS_METAHEURISTICS
: Maximum number of iterations user can select to run the BBHA/PSO algorithm. Default20
.MIN_STARS_BBHA
: Minimum number of stars in the BBHA algorithm. Default5
.MAX_STARS_BBHA
: Maximum number of stars in the BBHA algorithm. Default90
.MIN_POPULATION_SIZE_GA
: Minimum number of population size in the GA algorithm. Default5
.MAX_POPULATION_SIZE_GA
: Maximum number of population size in the GA algorithm. Default200
.MAX_FEATURES_COX_REGRESSION
: Maximum number of features to select in the CoxRegression algorithm. Default60
.MAX_FEATURES_BLIND_SEARCH
: Maximum number of features to allow to run a Blind Search algorithm (the number of computed combination is N!). If the number of features is greater than this value, the algorithm is disabled and only metaheuristic algorithms are allowed. Default7
.MIN_FEATURES_METAHEURISTICS
: Minimum number of features to allow the user to run metaheuristics algorithms (>=). This prevents to run metaheuristics on datasets with a small number of features which leads to experiments with more metaheuristics agents than number of total features combinations. We recommend to set this parameter to a value N such that N! > maxNumberOfAgents * maxNumberMetaheuristicsIterations. Also, this must be less than or equal to the MAX_FEATURES_BLIND_SEARCH value. Default7
.MIN_COMBINATIONS_SPARK
: Minimum number of combinations to allow the Spark execution (if less, the execution is done locally). This is computed as the number of agents in the metaheuristic multiplied by the number of iterations. This prevents to run Spark jobs (which are slow to start) on small experiments to save time and resources. Only considered if the Spark execution is enabled (ENABLE_AWS_EMR_INTEGRATION
= True). Default `60- `.
- Multiomix AWS EMR integration:
ENABLE_AWS_EMR_INTEGRATION
: set the stringtrue
to enable the Multiomix-aws-emr integration service. Defaultfalse
.EMR_DEBUG_IS_ENABLED
: set the stringtrue
to send thedebug
parameter to Multiomix-aws-emr service to log the Spark execution. Defaultfalse
.AWS_EMR_HOST
: AWS-EMR integration service connection host. Default127.0.0.1
.AWS_EMR_PORT
: AWS-EMR integration service connection port. Default8003
.AWS_EMR_SHARED_FOLDER_DATA
: Share folder with the AWS-EMR integration service to move the datasets to be consumed by the integration service. Default/data-spark
.AWS_EMR_SHARED_FOLDER_RESULTS
: Share folder with the AWS-EMR integration service to retrieve the results generated by the integration service. Default/results-spark
.
- Redis server for WebSocket connections:
REDIS_HOST
: IP of the Redis server, if Docker is used it should be the name of the service since Docker has its own DNS and can resolve it. The default isredis
which is the name of the service.REDIS_PORT
: Redis server port. Default6379
.
- Celery workers:
CONCURRENCY
: Number of workers to run in parallel.
- Django:
- Return to the root folder of the project and start up all services with:
- Docker Compose:
- Start:
docker-compose up -d
. The application will be accessible from the address127.0.0.1
. - Stop:
docker-compose down
.
- Start:
- Docker Swarm (For clusters. See the section below to see how to configure one):
- Start:
docker stack deploy --compose-file docker-compose.yml multiomix
. - Stop:
docker stack rm multiomix
.
- Start:
- Docker Compose:
- (Optional) Create a superuser to be able to access the admin panel (
<URL:port>/admin
).- Log in to the container:
docker container exec -it multiomix_backend bash
. - Run:
python3 manage.py createsuperuser
- Exit the container with
exit
- Log in to the container:
- (Optional) Optimize PostgreSQL by changing the settings in the
config/postgres/postgres.conf
file. A good place to calculate parameters from machine performance is PTune. Thepostgres_dist.conf
file is the template that comes in the container by default, it is left to have the template and the official structure. Important: in case you change the parameters, do not forget to put thelisten_addresses = '*'
statement, otherwise it will not work because the rest of the containers will not be able to access it (more info in the official Docker image page). - (Optional) Optimize Mongo by changing the configuration in the
config/mongo/mongod.conf
file.
To manage a cluster, Docker Swarm is used. The steps to follow are:
- Start the cluster:
docker swarm init
. - Generate a token for the workers to join (official doc):
docker swarm join-token worker
. It will print on screen a token in a command that must be executed in all the workers to be added. - Run the command generated in the previous step:
docker swarm join --token <token generated> <HOST>:<PORT>
.
To enable HTTPS, follow the steps below:
- Set the
ENABLE_SECURITY
parameter totrue
as explained in the Instructions section. - Copy the file
config/nginx/multiomics_intermediate_safe_dist.conf
and paste it intoconfig/nginx/conf.d/
with the namemultiomics_intermediate.conf
. - Get the
.crt
and.pem
files for both the certificate and the private key and put them in theconfig/nginx/certificates
folder. - Edit the
multiomics_intermediate.conf
file that we pasted in point 2.- Uncomment one of the two
listen
staments lines. You have to choosehttp/1.1
orhttp/2
. Not uncomment both. - Uncomment the lines where both
.crt
and.pem
files must be specified.
- Uncomment one of the two
- Edit the
docker-compose.yml
file so that thenginx
service exposes both port 8000 and 443. Also, you need to addcertificates
folder tovolumes
section. It should look something like this:# ... nginx: image: nginx:1.27 ports: - 80:8000 - 443:443 # ... volumes: # ... - ./config/nginx/certificates:/etc/nginx/certificates # ...
- Redo the deployment with Docker.
Django provides in its official documentation a configuration checklist that must be present in the production settings_prod.py
file. To verify that every security rule complies, you could run the following command once the server is already up (this is because several environment variables are required to be set in the docker-compose.yml
).
docker container exec multiomics_backend python3 manage.py check --deploy --settings multiomics_intermediate.settings_prod
.
Otherwise, you could set all the mandatory variables found in settings_prod.py
(such as REDIS_HOST
) and run them directly without the need to raise any services:
python3 manage.py check --deploy --settings multiomics_intermediate.settings_prod
.
IMPORTANT: note that as Multiomix is deployed in Docker there are some parameters like SECURE_SSL_REDIRECT
that are not required as the solution is provided by the NGINX service. Probably the above commands will throw warnings that are covered and do not represent a problem.
To integrate with Modulector and/or BioAPI using docker-compose.yml
files in the same host:
- Create a common network with the
overlay
driver:docker network create --driver overlay --attachable multiomix-network
. - Add the following block to the end of the Multiomix, Modulector and BioAPI
docker-compose.yml
files:networks: default: external: name: 'multiomix-network'
- The new versions of BioAPI and Modulector already come with service names suitable for integration with Multiomix. But if you have any old version of those platforms, change the Modulector and BioAPI configuration so that it does not conflict with the Multiomix configuration:
- Rename all the services in the Modulector and BioAPI
docker-compose.yml
files with the suffix_modulector
and_bioapi
. And renameweb
service tomodulector
orbioapi
respectively. NOTE: do not forget to rename thedepends_on
parameters, and the database connection parameters to point to the new services names. - Change the following block in the NGINX configuration files. In Modulector it's
config/nginx/conf.d/modulector.conf
, in BioAPI it's/nginx/conf.d/default.conf
:
# Old # upstream web { # ip_hash; # server web:8000; # } # New upstream web { ip_hash; server modulector:8000; # Or bioapi, dependening on which config file you're }
- Rename all the services in the Modulector and BioAPI
- Set Multiomix parameters:
MODULECTOR_HOST
tonginx_modulector
MODULECTOR_PORT
to8000
BIOAPI_HOST
tonginx_bioapi
BIOAPI_PORT
to8000
- Redo the deployment with Docker.
Multiomix provides functions to make available metaheuristics and other algorithms that can take a long time to execute. In order to shorten these times, it distributes the processing across an Apache Spark cluster.
Integration with that cluster (whether it is mounted on an in-house server or a cloud provider like AWS) is done through a microservice called multiomix-aws-emr.
In order to activate the integration, it is necessary to follow the deployment steps found in the official multiomix-aws-emr repository.
Then the following environment variables must be configured:
ENABLE_AWS_EMR_INTEGRATION
to"true"
AWS_EMR_HOST
AWS_EMR_PORT
AWS_EMR_SHARED_FOLDER
Multiomix uses Celery to distribute the computational load of its most expensive tasks (such as correlation analysis, Biomarkers Feature Selection, static validations, Machine Learning model training, etc.). This requires the user to have a messaging broker, such as RabbitMQ or Redis, installed and configured. In this project, Redis is used and a worker is deployed for each of the execution queues serving a different type of task. The Docker configuration is left ready to run in Docker Compose or Docker Swarm and K8S.
As for the execution flow, the following error scenarios are considered:
- The task is submitted but the Celery worker is down: in this case the task remains queued in Redis until the worker gets up and runs it.
- The task is submitted and is being executed by Celery, but the Django server is down: in this case the task continues to run in Celery until it finishes, and the user can check the status of the task in the user interface when the Django instance becomes available again.
- The task is submitted and being executed by Celery but the Celery worker crashes: in this case the task remains in
PENDING
state and will be executed by Celery when the worker starts thanks to the script implemented in thecelery.py
file.
The specification of all the Celery services is available both in the Docker Compose/Docker Swarm (file docker-compose_dist.yml
) or in the K8S configuration files.
In those files each of these services has a parameter called CONCURRENCY
(default 2
, except for sync-datasets-worker
used to sync CGDS datasets which is a non-frequent task) that specifies how many computing instances can run on that Celery worker. Increasing this parameter will allow more tasks to run in parallel.
You can also increase the number of replicas of the Docker service to create the required instances (by default only one instance of each service is raised).
In order to create a database dump you can execute the following command:
docker exec -t [name of DB container] pg_dump [db name] --data-only | gzip > multiomix_postgres_backup.sql.gz
That command will create a compressed file with the database dump inside. Note that --data-only
flag is present as DB structure is managed by Django Migrations, so they are not necessary.
- Due to major changes, it's probably that an import thrown several errors when importing. To prevent that you could do the following steps before doing the importation:
- Stop all the services using
docker compose down
or removing the Docker Swarm stack. - Start only the DB with
docker compose up -d db
- Drop all the tables from the DB:
docker exec -i [name of the DB container] psql postgres -U postgres -c "DROP DATABASE multiomics;"
- Create an empty database:
docker exec -i [name of the DB container] psql postgres -U postgres -c "CREATE DATABASE multiomics;"
- Stop all the services using
- Restore the db:
zcat multiomix_postgres_backup.sql.gz | docker exec -i [name of the DB container] psql multiomics -U multiomics
. This command will restore the database using a compressed dump as source, keep in mind that could take several minutes to finish the process.- NOTE: in case you are working on Windows, the command must be executed from Git Bash or WSL.
In order to create a database dump you can execute the following command inside the container:
mongodump --username [username] --password [password] --authenticationDatabase admin -d multiomics --gzip --archive='multiomix_mongo_backup.txt.gz'
- Restore the db running the following commando inside the container:
mongorestore --username [username] --password [password] --authenticationDatabase admin --nsFrom 'multiomics.*' --nsTo '[target DB name].*' --archive='multiomix_mongo_backup.txt.gz' --gzip
That command will restore the database using a compressed dump as source. You can use the flags --numInsertionWorkersPerCollection [number of workers]
to increase importing speed or -vvvv
to check importing status.
To export Docker media volume run the following command: ./tools/export_media.sh
. This will create a tar.gz file with the current datetime in the media_backups
folder.
To import a media
folder backup inside a new environment you must (from the root project folder):
- Extract the
media
file insidesrc/media
folder:tar -zxf backup.tar.gz --directory ./src/media
- Run the script
./tools/import_media.sh
.