DESeq2: A Bioconductor Package

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How do I install and use a Bioconductor package?

How can I find which genes are differentially expressed in RNA-Seq data?

Objectives

Install the Bioconductor package DESeq2 using biocLite.

Follow a workflow in a Bioconductor package vignette.

Use DESeq2 to identify differentially expressed genes in sample RNA-Seq data.

What is DESeq2?

DESeq2 is a popular Bioconductor package that is used to identify differentially expressed genes in RNA-Seq data. It is well-documented and widely used: the paper describing it has been cited thousands of times.

Installing DESeq2

We will use biocLite to install DESeq2 by using the following command in RStudio:

biocLite("DESeq2")

For this lesson, we will be following a modified version of the DESeq2 vignette. The original vignette and reference manual have much more information.

Finding Reference Manuals and Vignettes

As we saw earlier, the Reference Manual and Vignette for DESeq2 are available on the biocViews page.

You can also find the Vignette by typing the following command in RStudio: browseVignettes("DESeq2") This method is especially useful because it ensures that you are looking at the vignette that corresponds to the version of DESeq2 installed on your computer.

Input

To detect differentially expressed genes, you must first have a list of read counts per gene (or transcript). In an RNA-Seq experiment you might get a list of counts from a program like Salmon or htseq-count. In this example, we’ll start with a list of read counts that is included in pasilla, a Bioconductor experiment package.

Start by installing and loading the pasilla package:

biocLite("pasilla")
library("pasilla")

We’re interested in two files: the counts of RNA-Seq reads per gene and the description of the samples.

pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)

Now we’ll read in these files:

cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)

Take a look at cts and coldata by clicking on them in the RStudio environment panel. We will only need two columns of coldata, so run the following command:

coldata <- coldata[,c("condition","type")]

Also, we need to make sure that the sample names are consistent between cts and coldata, and in the same order. First, get rid of the “fb” on the row names of coldata:

rownames(coldata) <- sub("fb", "", rownames(coldata))

Now, put the columns of the cts matrix in the same order as in coldata:

cts <- cts[, rownames(coldata)]

Now, the data is ready to be analyzed with DESeq2.

The DESeqDataSet

First, load the DESeq2 package:

library("DESeq2")

DESeq2 stores read counts and information from the statistical analysis in a DESeqDataSet object. Run the following command to make a DESeqDataSet from cts and coldata:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)

Pre-filtering

To use less memory and increase speed, we will pre-filter our dataset, to remove genes with very low expression. Run the following commands to keep only the genes that have at least 10 reads total:

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

Set your control

The following command will explicitly set the “untreated” group as the reference sample:

dds$condition <- relevel(dds$condition, ref = "untreated")

Differential Expression analysis

Now that we have set up the experiment, we can do the statistical analysis to determine which genes are differentially expressed:

dds <- DESeq(dds)
res <- results(dds)

Let’s get a summary of our results:

summary(res)

Exploring and Exporting results

DESeq2 also has some built-in plotting functions. For example:

plotMA(res, ylim=c(-2,2))

Exporting results to CSV format

In order to share your results, you might want to export them to a file that can be opened as a spreadsheet.

resOrdered <- res[order(res$pvalue),]
write.csv(as.data.frame(resOrdered),
          file="condition_treated_results.csv")

Check the DESeq2 vignette for much more information about options for analysis, visualization and export.

Key Points

DESeq2 is one example of a well-documented Bioconductor package.

DESeq2 works as one step in a data analysis pipeline, detecting differentially expressed genes from gene counts.

In addition to performing statistical calculations, DESeq2 has functions for visualization and export of results.

previous episode

Introduction to Bioconductor

lesson home