This is the implementation of ActiveDR, an Activeness-based Data Retention Solution for HPC Facilities.
Please notice that our implementation will continue to improve as the project progresses. Our implementation is written in python, and we use (or plan to use) the following packages in the program.
pyyaml
numpy
neo4j
pandas
scipy
networkx
mpi4py
sortedcontainers
The targeted working environment of our implementation is currently on Cori supercomputer, which is hosted by NERSC. For the rest of the document, we introduce how to install and run our implementation.
git clone "https://github.com/zhangwei217245/ActiveDR.git"
It is highly recommended that you use your own conda environment different than the one that is globally available. By doing this, you will be able to avoid conflicts between various versions of the packages.
If luckily you are working on Cori supercomputing, you may to the following to set up your own conda environment.
As we use mpi4py
in our project, when installing the package in your own conda environment, it needs to be complied from scratch.
To ensure that the compilation can be done successfully, you need to load the a series of modules to guarantee a working environment for the compilation.
module unload PrgEnv-intel
module load PrgEnv-gnu/6.0.5
module load cmake/3.14.4
module load gcc
module load openmpi/3.1.3
module load llvm/10.0.0
module load python
First, create a conda environment named ActiveDR_env
conda create -n ActiveDR_env python=3
Now initialize your conda environment
conda init
Then, activate your conda environment
conda activate ActiveDR_env
If you need to get back to the original default environment, do the following:
conda deactivate
First, make sure the conda environment is activated
conda activate ActiveDR_env
Then, install pip
conda install pip
Now, you can install required packages:
pip install -r requirements.txt
Please click the DOI link below to get request access to the demo dataset:
Once you download the data_mintar.gz
file, put it in ${PROJECT_HOME}
.
- Note:
${PROJECT_HOME}
is where your local copy of this code repo is.
tar zxvf data_min.tar.gz
This should create directory ${PROJECT_HOME}/data
and put all the dataset necessary in it.
cd ${PROJECT_HOME}/bin
nohup python -u user_activity_analyzer.py -m local -d 20160823 > nohup.out 2>&1 &
cd bin
nohup python -u user_activity_analyzer.py -d 20160823 > nohup.out 2>&1 &
cat ${PROJECT_HOME}/nohup.out
ls -al ${PROJECT_HOME}/data/purge_result_2
cd ${PROJECT_HOME}/bin
nohup python -u user_activity_analyzer.py -d 20160823 > nohup.out 2>&1 &
- Note: The default value of argument
-m
ishpc
, which refers to HPC environment.
sbatch run_active_eva_knl.sh
sbatch run_active_eva.sh
Here we list all relevant source codes for the ActiveDR
┣━┓ lib
┣━┓ data_source
┃ ┣━┓ csv
┃ ┃ ┣━━ CSVReader.py # a CSV reader class that can be reused for reading CSV file and generating pandas dataframe
┃ ┣━┓ ornl
┃ ┃ ┣━━ PurgeSimulator.py # a purge policy simulator that runs by maintaining counters of the purged files of various types of users.
┃ ┃ ┣━━ UserActivityAnalyzer.py # a user activeness analyzer that evaluate user's activeness based on their job submission and research publications.
Script Name | Description |
---|---|
bin/user_activity_analyzer.py | The python script for running ActiveDR as a single process. |
bin/user_activity_analyzer_mpi.py | The python script for running ActiveDR with MPI support. |
bin/run_active_eva.sh | A bash script for running user_activity_analyzer_mpi.py on Cori supercomputer Haswell nodes. |
bin/run_active_eva_knl.sh | A bash script for running user_activity_analyzer_mpi.py on Cori supercomputer KNL nodes. |
bin/run_0823_active_eva.sh | A bash script for running user_activity_analyzer_mpi.py on Cori Haswell nodes with a specific metadata snapshot (20160823) |
bin/reduce_active_result.sh | A bash script for merging the results generated by user_activity_analyzer_mpi.py onto a single node or a local computer |