Skip to content

Commit

Permalink
edit lab
Browse files Browse the repository at this point in the history
  • Loading branch information
gbdias committed Oct 28, 2024
1 parent 95295fe commit 966d4b8
Showing 1 changed file with 13 additions and 12 deletions.
25 changes: 13 additions & 12 deletions lab_loadingdata.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ output:

# Introduction

Up until now we have mostly created the object we worked with on the fly from within R. The most common use-case is however to read in different data sets that are stored as files, either somewhere on a server or locally on your computer. In this exercise we will test some common ways to import data in R and also show to save data from R. After this exercise you will know how to:
Up until now we have mostly created the object we worked with on the fly from within R. The most common use-case is however to read in different data sets that are stored as files, either somewhere on a server or locally on your computer. In this exercise we will test some common ways to import data in R and also how to save data from R. After this exercise you will know how to:

- Read data from txt files and save the information as a vector, data frame or a list.
- Identify missing data and correctly encode this at import
Expand Down Expand Up @@ -66,17 +66,17 @@ shelley.vec[381]
2. Go back and fix the way you read in the text to make sure that you get a vector with all words in chapter as individual entries also filter any non-letter characters and now identify the longest word.

```{r,accordion=TRUE}
shelley.vec2 <- scan(file="https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/book_chapter.txt", what='character', sep=' ', quote=NULL)
shelley.vec2 <- scan(file="https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/book_chapter.txt", what=character(), sep=' ', quote=NULL)
shelley.filt2 <- gsub(pattern='[^[:alnum:] ]', replacement="", x=shelley.vec2)
which(nchar(shelley.filt2) == max(nchar(shelley.filt2)))
shelley.filt2[301]
longest <- which(nchar(shelley.filt2) == max(nchar(shelley.filt2)))
shelley.filt2[longest]
```

# `read.table()`

This is the by far most common way to get data into R. As the function creates a data frame at import it will only work for data set that fits those criteria, meaning that the data needs to have a set of columns of equal length that are separated with a common string eg. tab, comma, semicolon etc.

In this code block with first import the data from [normalized.txt](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt) and accept the defaults for all other arguments in the function. With this settings R will read it as a tab delimited file and will use the first row of the data as colnames (header) and the first column as rownames.
In this code block we first import the data from [normalized.txt](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt) and accept the defaults for all other arguments in the function. With this settings R will read it as a tab delimited file and will use the first row of the data as colnames (header) and the first column as rownames.

```{r,accordion=TRUE}
expr.At <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt")
Expand All @@ -85,7 +85,7 @@ head(expr.At)

One does however not have to have all data as a file an the local disk, instead one can read data from online resources. The following command will read in a file from a web server.

```{r,accordion=TRUE, error=T}
```{r,accordion=TRUE}
url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone <- read.table(url, header=FALSE , sep=',')
head(abalone)
Expand All @@ -94,28 +94,28 @@ head(abalone)
1. Read this [example data](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data) to R using the `read.table()` function. This files consist of gene expression values. Once you have the object in R validate that it looks okay and export it using the `write.table` function.

```{r,accordion=TRUE}
ed <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data", sep=":")
ed <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data", sep=":", header = T)
head(ed)
str(ed)
```

Encode all NA values as "missing", at export.

```{r,eval=FALSE,accordion=TRUE}
write.table(x=ed, na="missing", file="example_mis.data")
write.table(x=ed, na="missing", file="example_write.txt")
```

2. Read in the file you just created and double-check that you have the same data as earlier.

```{r,eval=FALSE,accordion=TRUE}
df.test <- read.table("example_mis.data", na.strings="missing")
df.test <- read.table("example_write.txt", na.strings="missing")
```

3. Analysing genome annotation in R using read.table

For this exercise we will load a GTF file into R and calculate some basic summary statistics from the file. In the first part we will use basic manipulations of data frames to extract the information. In the second part you get a try out a library designed to work with annotation data, that stores the information in a more complex format, that allow for easy manipulation and calculation of summaries from genome annotation files.

For those not familiar with the gtf format it is a file format containing annotation information for a genome. It does not contain the actual DNA sequence of the organism, but instead refers to positions along the genome.
For those not familiar with the GTF format it is a file format containing annotation information for a genome. It does not contain the actual DNA sequence of the organism, but instead refers to positions along the genome.

A valid GTF file should contain the following tab delimited fields (taken from the ensembl home page).

Expand All @@ -136,7 +136,7 @@ A valid GTF file should contain the following tab delimited fields (taken from t

The last column can contain a large number of attributes that are semicolon-separated.

As these files for many organisms are large we will in this exercise use the latest version of Drosophila melanogaster genome annotation available at `ftp://ftp.ensembl.org/pub/release-86/gtf/drosophila_melanogaster` that is small enough for analysis even on a laptop.
As these files for many organisms are large we will in this exercise use the latest version of *Drosophila melanogaster* genome annotation available at `ftp://ftp.ensembl.org/pub/release-86/gtf/drosophila_melanogaster` that is small enough for analysis even on a laptop.

Download the file named **Drosophila_melanogaster.BDGP6.86.gtf.gz** to your computer. Unzip this file and keep track of where your store the file.

Expand Down Expand Up @@ -166,13 +166,14 @@ str(d.gtf)
1. How many chromosome names can be found in the annotation file?

```{r,accordion=TRUE}
levels(d.gtf$Chromosome)
length(levels(as.factor(d.gtf$Chromosome)))
```

2. How many **exons** is there in total and per chromosome? (hint: first extract lines that have `feature == 'exon'`)

```{r,accordion=TRUE}
d.gtf.exons <- d.gtf[(d.gtf$Feature == 'exon'),]
nrow(d.gtf.exons)
aggregate(d.gtf.exons$Feature, by=list(d.gtf.exons$Chromosome), summary)
```

Expand Down

0 comments on commit 966d4b8

Please sign in to comment.