Skip to content

Latest commit

 

History

History
110 lines (92 loc) · 6.69 KB

parameter_optimization.md

File metadata and controls

110 lines (92 loc) · 6.69 KB

Parameter optimization via mlrMBO

Robin Browaeys 2018-02-20

This vignette shows how we optimized both hyperparameters and data source weights via model-based optimization (see manuscript for more information). Because the optimization requires intensive parallel computation, we performed optimization in parallel on a gridengine cluster via the qsub package (https://cran.r-project.org/web/packages/qsub/qsub.pdf). This script is merely illustrative and should be adapted by the user to work on its own system.

The input data used in this vignette can be found at: DOI.

First, we will load in the required packages and networks we will use to construct the models which we will evaluate during the optimization procedure.

library(nichenetr)
library(tidyverse)
library(qsub)
library(mlrMBO)

# in the NicheNet framework, ligand-target links are predicted based on collected biological knowledge on ligand-receptor, signaling and gene regulatory interactions
lr_network = readRDS(url("https://zenodo.org/record/3260758/files/lr_network.rds"))
sig_network = readRDS(url("https://zenodo.org/record/3260758/files/signaling_network.rds"))
gr_network = readRDS(url("https://zenodo.org/record/3260758/files/gr_network.rds"))

We will load in the ligand treatment validation datasets and try to optimize the parameters to maximize both target gene and ligand activity prediction. In this vignette, we do the optimization on all datasets. Alternatively, you can select a specific subset of datasets and evaluate the final performance on the left-out datasets.

# The ligand treatment expression datasets used for validation can be downloaded from Zenodo:
expression_settings_validation = readRDS(url("https://zenodo.org/record/3260758/files/expression_settings.rds"))

Define the optimization wrapper function and config information for the qsub package

mlrmbo_optimization_wrapper = function(...){
  library(nichenetr)
  library(mlrMBO)
  library(parallelMap)
  library(dplyr)
  output = mlrmbo_optimization(...)
  return(output)
}

qsub_config = create_qsub_config(
  remote = "[email protected]:1234",
  local_tmp_path = "/tmp/r2gridengine",
  remote_tmp_path = "/scratch/personal/myuser/r2gridengine",
  num_cores = 24,
  memory = "10G",
  wait = FALSE, 
  max_wall_time = "500:00:00"
)

Perform optimization:

additional_arguments_topology_correction = list(source_names = source_weights_df$source %>% unique(), 
                                                algorithm = "PPR", 
                                                correct_topology = FALSE,
                                                lr_network = lr_network, 
                                                sig_network = sig_network, 
                                                gr_network = gr_network, 
                                                settings = lapply(expression_settings_validation,convert_expression_settings_evaluation), 
                                                secondary_targets = FALSE, 
                                                remove_direct_links = "no", 
                                                cutoff_method = "quantile")
nr_datasources = additional_arguments_topology_correction$source_names %>% length()

obj_fun_multi_topology_correction = makeMultiObjectiveFunction(name = "nichenet_optimization",
                                                               description = "data source weight and hyperparameter optimization: expensive black-box function", 
                                                               fn = model_evaluation_optimization, 
                                                               par.set = makeParamSet(
                                                                 makeNumericVectorParam("source_weights", len = nr_datasources, lower = 0, upper = 1), 
                                                                 makeNumericVectorParam("lr_sig_hub", len = 1, lower = 0, upper = 1),  
                                                                 makeNumericVectorParam("gr_hub", len = 1, lower = 0, upper = 1),  
                                                                 makeNumericVectorParam("ltf_cutoff", len = 1, lower = 0.9, upper = 0.999),  
                                                                 makeNumericVectorParam("damping_factor", len = 1, lower = 0.01, upper = 0.99)), 
                                                               has.simple.signature = FALSE,
                                                               n.objectives = 4, 
                                                               noisy = FALSE,
                                                               minimize = c(FALSE,FALSE,FALSE,FALSE))
set.seed(1)

# Run with: 50 iterations, 24 desings evaluated in parallel and 240 start designs

job_mlrmbo = qsub_lapply(X = 1,
                               FUN = mlrmbo_optimization_wrapper,
                               object_envir = environment(mlrmbo_optimization_wrapper),
                               qsub_config = qsub_config,
                               qsub_environment = NULL, 
                               qsub_packages = NULL,
                               obj_fun_multi_topology_correction, 
                               50, 24, 240, 
                               additional_arguments_topology_correction)

Once the job is finised (which can take a few days - for shorter running time: reduce the number of iterations), run:

res_job_mlrmbo = qsub_retrieve(job_mlrmbo)

Get now the most optimal parameter setting as a result of this analysis

optimized_parameters = res_job_mlrmbo %>% process_mlrmbo_nichenet_optimization(source_names = source_weights_df$source %>% unique())

When you would be interested to generate a context-specific model, it could be possible that you would like to optimize the parameters specifically on your dataset of interest and not on the general ligand treatment datasets (be aware for overfitting, though!). Because for your own data, you don't know the true active ligands, you could only optimize target gene prediction performance and not ligand activity performance. In order to this, you would need to change the expression settings in the optimization functions such that they include your data, and use the function model_evaluation_optimization_application instead of model_evaluation_optimization (define this as the function parameter in makeMultiObjectiveFunction shown here above).