DESeq2: A Bioconductor Package
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How do I install and use a Bioconductor package?
How can I find which genes are differentially expressed in RNA-Seq data?
Objectives
Install the Bioconductor package DESeq2 using biocLite.
Follow a workflow in a Bioconductor package vignette.
Use DESeq2 to identify differentially expressed genes in sample RNA-Seq data.
What is DESeq2?
DESeq2 is a popular Bioconductor package that is used to identify differentially expressed genes in RNA-Seq data. It is well-documented and widely used: the paper describing it has been cited thousands of times.
Installing DESeq2
We will use biocLite to install DESeq2 by using the following command in RStudio:
biocLite("DESeq2")
For this lesson, we will be following a modified version of the DESeq2 vignette. The original vignette and reference manual have much more information.
Finding Reference Manuals and Vignettes
As we saw earlier, the Reference Manual and Vignette for DESeq2 are available on the biocViews page.
You can also find the Vignette by typing the following command in RStudio:
browseVignettes("DESeq2")
This method is especially useful because it ensures that you are looking at the vignette that corresponds to the version of DESeq2 installed on your computer.
Input
To detect differentially expressed genes, you must first have a list of read counts per gene (or transcript). In an RNA-Seq experiment you might get a list of counts from a program like Salmon or htseq-count. In this example, we’ll start with a list of read counts that is included in pasilla, a Bioconductor experiment package.
Start by installing and loading the pasilla package:
biocLite("pasilla")
library("pasilla")
We’re interested in two files: the counts of RNA-Seq reads per gene and the description of the samples.
pasCts <- system.file("extdata",
"pasilla_gene_counts.tsv",
package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
"pasilla_sample_annotation.csv",
package="pasilla", mustWork=TRUE)
Now we’ll read in these files:
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
Take a look at cts and coldata by clicking on them in the RStudio environment panel. We will only need two columns of coldata, so run the following command:
coldata <- coldata[,c("condition","type")]
Also, we need to make sure that the sample names are consistent between cts and coldata, and in the same order. First, get rid of the “fb” on the row names of coldata:
rownames(coldata) <- sub("fb", "", rownames(coldata))
Now, put the columns of the cts matrix in the same order as in coldata:
cts <- cts[, rownames(coldata)]
Now, the data is ready to be analyzed with DESeq2.
The DESeqDataSet
First, load the DESeq2 package:
library("DESeq2")
DESeq2 stores read counts and information from the statistical analysis in a DESeqDataSet object. Run the following command to make a DESeqDataSet from cts and coldata:
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ condition)
Pre-filtering
To use less memory and increase speed, we will pre-filter our dataset, to remove genes with very low expression. Run the following commands to keep only the genes that have at least 10 reads total:
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
Set your control
The following command will explicitly set the “untreated” group as the reference sample:
dds$condition <- relevel(dds$condition, ref = "untreated")
Differential Expression analysis
Now that we have set up the experiment, we can do the statistical analysis to determine which genes are differentially expressed:
dds <- DESeq(dds)
res <- results(dds)
Let’s get a summary of our results:
summary(res)
Exploring and Exporting results
DESeq2 also has some built-in plotting functions. For example:
plotMA(res, ylim=c(-2,2))
Exporting results to CSV format
In order to share your results, you might want to export them to a file that can be opened as a spreadsheet.
resOrdered <- res[order(res$pvalue),]
write.csv(as.data.frame(resOrdered),
file="condition_treated_results.csv")
Check the DESeq2 vignette for much more information about options for analysis, visualization and export.
Key Points
DESeq2 is one example of a well-documented Bioconductor package.
DESeq2 works as one step in a data analysis pipeline, detecting differentially expressed genes from gene counts.
In addition to performing statistical calculations, DESeq2 has functions for visualization and export of results.