- There are a few steps for incoming lab members, If you are a full time lab member be sure to run through this checklist
- If you are a part-time member and/or student follow this checklist
-
Subscribe to the lab calendar. This is where all meetings and events are organized. To do so, select this link then, subscribe by either:
- Selecting the + Google Calendar button at the bottom of the Im Lab Calendar, which will take you to your own Google Calendar and ask if you would like to add it. Or,
- Log into your Google Calendar. On the left side, select Add Calendar, and then From URL. Copy and paste the URL from the Im Lab Calendar.
-
Go through the RStudio primers 1 to 6 (if they are too basic, skip all except for the reproducibility section)
- After finishing each of the following tutorials, fill out this form.
- The Basics
- Work with Data
- Visualize Data
- Tidy Your Data
- Iterate
- Write Functions
- Report Reproducibility
-
Github intro click here
- When going through the tutorial skip the setting up ssh section
- Fill out this form
-
TODO: post your first note following the instructions here
-
TODO: run your first GWAS, QC included, following these instructions
-
TODO: run imputed transcriptome association, colocalization, and Mendelian Randomization following this lab
- Begin in the optional items section and first set up your system for the lab
- If working on a lab destop you may need to update/install miniconda -install from the bash with the .sh file and where the code in the lab calls for conda you will enter the file path ./miniconda3/bin/conda
-
TODO: read and write a short post for the in the internal-notes.hakyimlab.org with a graphical summary of the following papers
We work with many different tools on many different projects. The training resources are organized into functional groups below. You may want to skip reading the material in some groups, and it may be worthwhile to spend a longer time with other groups.
- GitHub
- Introduction to Data Science
- Blogdown
- Genomics
- Computational Resources
- Miscellaneous
- Hands on training
We use GitHub to store and organize our code. There is a introduction here. If you are curious about when one would use certain GitHub features, look at this link which describes 'GitHub flow'.
The lab's main GitHub page can be found at https://github.com/hakyimlab. If you have been added to lab-members
and you are logged in, you can see the lab's private repositories as well.
GitHub has stoped useing passwords in the terminal and Rstudio be sure to set up your token. Instructions on how to do so here
- An introduction to machine learning problems and model metrics: link
- We work fairly heavily with the generalized linear model, so it may be good to brush up on it:
- This is a python course for data science, and covers running commands in the shell link
- SQLite in Python link
- Introduction to Data Analysis with R link
- Another data science course in R: link
- R Studio's cheatsheets: link
- Hadley's R Style link
- R tools for reporting data analyses in a reproducible manner link
- Some basics on tidyverse and ggplot2
- This course introduces ggplot2, plyr, dplyr, tidyr, and knitr for data analysis link
- Our lab does a lot of work with SQLite databases using the RSQLite package
- Data Manipulation in R with dplyr link
- Data Visualization in R with ggplot2 link1, link2
- A machine learning package for R, mlr link
- Docker is not really an R package, but this presentation gives a good overview of use cases for Docker, and how to integrate with R link
- Data Wrangling download pdf
- R Markdown download pdf
- Data visualization download pdf
CRI Gardner, RCC midway, and most of the Bionimbus virtual machines all run on Linux, so we use the command line a lot.
- If you haven't used a bash command line before, here is a good place to start: link
- This lesson covers more commandslink
- This is a great cheatsheet for using the command line and shell scripting, including flow control and function declaration: link
- Knowledge of some bash commands can go a long way. Comfort with
grep
,awk
,sed
, andxargs
might go a long way.
Some knowledge of sqlite will be useful. See vignette here
- UCLA Big Bio: intro to genomics videos. These are very helpful to understand the field of genomics at a high level.
- The New Genetics is an NIH publication surveying what we know about the biological mechanisms of genetics.
For more background, the projects the lab is currently working on are similar to the ones in these papers.
- The UK Biobank resource with deep phenotyping and genomic data
- A gene-based association method for mapping traits using reference transcriptome data
- A brief history of human disease genetics
- The GTEx Consortium atlas of genetic regulatory effects across human tissues
- Enhancing GTEx by bridging the gaps between genotype, gene expression, and disease The eGTEx Project
- Widespread dose-dependent effects of RNA expression and splicing on complex diseases and traits
GTEx Consortium: The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 2015, 348:648–660.
The 1000 Genomes Consortium: A global reference for human genetic variation link
Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–30.
Li YI, van de Geijn B, Raj A, Knowles DA, Petti AA, Golan D, et al. RNA splicing is a primary link between genetic variation and disease. Science. American Association for the Advancement of Science; 2016;352:600–4.
Albert FW, Kruglyak L: The role of regulatory variation in complex traits and disease. Nat Rev Genet 2015, 16:197–212.
Das S, Abecasis GR, Browning BL: Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet 2018;19:73-96.
Im HK, Gamazon ER, Nicolae DL, Cox NJ: On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am J Hum Genet 2012, 90:591–598.
Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P.-R., et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics 2015, 47:1228-1235.
Finucane HK, Reshef YA, Anttila V, Slowikowski K, Gusev A, Byrnes A, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics 2018, 50:621-629.
Visscher PM: Human Complex Trait Genetics in the 21st Century. Genetics 2016, 202:377–379.
Gardner is a large, high-performance computing cluster and data storage system. We use it to run computations and store data. The lab's group folder is located at /gpfs/data/im-lab/
- UChicago CRI Workshop Tutorials: CRI does a seminar series each academic year. You can find the schedule here: link
- Intro to Gardner: this is a good explanation of what Gardner does, and why a high-performance computing cluster is important to bioinformatics: link
- Gardner uses Torque as its job scheduler, which means that the submission types are PBS files.
- A short, incomplete list of commands that may help when using PBS:
- To submit a job,
qsub <path to whatever job file>
. It will print to the console thejob_id
, which is often useful for searching the queue and finding logs. - To view the status of your jobs,
qstat
- To delete a job,
qdel <job_id>
- Gardner has a few different queues to which you can submit jobs. Knowing the resources alotted to jobs in each queue can help. Jobs will be submitted faster if you request fewer resources. You can use
qstat -q
to list all queues with current usage statistics, and you can useqstat -Qf <queue name>
for details on the resources. qstat | grep Q
will list only queued jobs, and if you're submitting a bunch of them,qstat | grep Q | wc -l
will count the jobs in the queue.- Hopefully this doesn't happen, but if you need to cancel all of your queued jobs, run
qselect -s Q | xargs qdel
.
- To submit a job,
- If you need to run a long file submission, like a python script that submits jobs for hours, you don't have to keep a terminal window open to continue the process if you use
screen
. Here are the steps I used:$ ssh gardner [cri-gardner-in001] $ screen [cri-gardner-in001] $ <the command you wanted to run>
- The important thing is to exit the screen by
ctrl+a d
. Then you should see a message[detached]
- The important thing is to exit the screen by
- On MacOS: from Finder, click 'Go', then 'connect to server', then connect to
smb://prfs.cri.uchicago.edu/im-lab
- On Linux: mounting via
sftp://cri-syncmon.cri.uchicago.edu/gpfs/data/im-lab
has worked for us in the past.
Bionimbus Protected Data Cloud is a storage/computation resource where the lab is alotted a certain amount of processors and storage, and we store and compute on virtual machines. If you'll be working on Bioinimbus, make sure to begin your application(s) quickly because the process has multiple steps.
For both Gardner and Bionimbus, you'll be working through ssh tunnels a good deal, so it will pay off to configure your ssh settings once and not have to fill in passwords all the time.
First, to avoid having to enter a password at each login, generate and forward an ssh keypair. To create a RSA keypair, open terminal and type
$ ssh-keygen -t rsa
Press enter when you are prompted to Enter a file in which to save the key
Type and enter a password
Your private key will be generated using the default filename (for example, id_rsa) or the filename you specified, and stored on your computer in a .ssh directory off your home directory (for example, ~/.ssh/id_rsa ).
Your public key will be generated using the same filename (but with a .pub extension added) and stored in the same location (for example, ~/.ssh/id_rsa.pub). Do not share your private key. Only share your public one.
Once you have your RSA keypair, you will copy and paste your public key into ~/.ssh/authorized_keys
on the host you are trying to access.
If your account doesn't already contain a ~/.ssh/authorized_keys
file, create one
mkdir -p ~/.ssh
touch ~/.ssh/authorized_keys
Copy and paste your public id (for example, ~/id_rsa.pub), using
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
Create and configure your SSH config file
touch ~/.ssh/config
chmod 600 ~/.ssh/config
emacs ~/.ssh/config
Enter the following
Host gardner
HostName gardner.cri.uchicago.edu
IdentityFile ~/.ssh/username
User yourusername
Host midway2
HostName midway2.rcc.uchicago.edu
IdentityFile ~/.ssh/username
User yourusername
Host bionimbus
HostName bionimbus-pdc.opensciencedatacloud.org
IdentityFile ~/.ssh/username
User yourusername
Host argonne
Hostname login.mcs.anl.gov
User yourusername
IdentityFile ~/.ssh/username
Host washington
HostName washington.cels.anl.gov
User yourusername
IdentityFile ~/.ssh/username
ProxyCommand ssh -q -A argonne -W %h:%p
Now you should be able to directly ssh into any of the above hosts.
If you want to be able to log in with your rsa key pair instead of password, you need to add your public key to the authorized_keys file in the remote host. For example, if you want to log in directly to gardner, go to
cd ~/.ssh
vi authorized_keys
and paste in your public key.
- BigQuery tutorials link
- Google Cloud training document link
- To do uploads from CRI to BigQuery, you will need to install the Google Cloud SDK link
- This is another great collection of tools / intros for genomics and computational biology. It's like this training page, but has even more resources.
- Read genomic data user code of conduct
- Reproducible Research link
- Get CITI training link
- Basics of Health Privacy
- Responsible Conduct of Research (RCR) Basic
- Human Subjects Research – Biomedical
- Basics of Information Security
- Conflict of Interest
- Enloc-coloc comparison
- Jeff Leek's Github page
install git (brew install git) or
upgrade git (brew upgrade git)
install git-lfs (brew install git-lfs)
https://www.notion.so/Heather-Wheeler-s-tutorials-f2e3a612d3d040a08db1becc139449b4