This lesson is being piloted (Beta version)

Introduction to data analysis with R and Bioconductor

Data organisation with Spreadsheets

Overview

Teaching: XX min
Exercises: XX min
Questions
  • How to organise tabular data?

Objectives
  • Learn about spreadsheets, their strenghts and weaknesses

  • How do we format data in spreadsheets for effective data use?

  • Learn about common spreadsheet errors and how to correct them.

  • Organize your data according to tidy data principles.

  • Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated formats.

Spreadsheet programs

Question

Objective

Keypoint

Good data organization is the foundation of your research project. Most researchers have data or do data entry in spreadsheets. Spreadsheet programs are very useful graphical interfaces for designing data tables and handling very basic data quality control functions. See also @Broman:2018.

Spreadsheet outline

Spreadsheets are good for data entry. Therefore we have a lot of data in spreadsheets. Much of your time as a researcher will be spent in this ‘data wrangling’ stage. It’s not the most fun, but it’s necessary. We’ll teach you how to think about data organization and some practices for more effective data wrangling.

What this lesson will not teach you

If you’re looking to do this, a good reference is Head First Excel, published by O’Reilly.

Why aren’t we teaching data analysis in spreadsheets

Many spreadsheet programs are available. Since most participants utilise Excel as their primary spreadsheet program, this lesson will make use of Excel examples. A free spreadsheet program that can also be used is LibreOffice. Commands may differ a bit between programs, but the general idea is the same.

Spreadsheet programs encompass a lot of the things we need to be able to do as researchers. We can use them for:

Spreadsheet programs use tables to represent and display data. Data formatted as tables is also the main theme of this chapter, and we will see how to organise data into tables in a standardised way to ensure efficient downstream analysis.

Problems with Spreadsheets

Spreadsheets are good for data entry, but in reality we tend to use spreadsheet programs for much more than data entry. We use them to create data tables for publications, to generate summary statistics, and make figures.

Generating tables for publications in a spreadsheet is not optimal - often, when formatting a data table for publication, we’re reporting key summary statistics in a way that is not really meant to be read as data, and often involves special formatting (merging cells, creating borders, making it pretty). We advise you to do this sort of operation within your document editing software.

The latter two applications, generating statistics and figures, should be used with caution: because of the graphical, drag and drop nature of spreadsheet programs, it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else’s), particularly if your stats or figures require you to do more complex calculations. Furthermore, in doing calculations in a spreadsheet, it’s easy to accidentally apply a slightly different formula to multiple adjacent cells. When using a command-line based statistics program like R or SAS, it’s practically impossible to apply a calculation to one observation in your dataset but not another unless you’re doing it on purpose.

Using Spreadsheets for Data Entry and Cleaning

However, there are circumstances where you might want to use a spreadsheet program to produce “quick and dirty” calculations or figures, and data cleaning will help you use some of these features. Data cleaning also puts your data in a better format prior to importation into a statistical analysis program. We will show you how to use some features of spreadsheet programs to check your data quality along the way and produce preliminary summary statistics.

In this lesson, we will assume that you are most likely using Excel as your primary spreadsheet program - there are others (gnumeric, Calc from OpenOffice), and their functionality is similar, but Excel seems to be the program most used by biologists and biomedical researchers.

In this lesson we’re going to talk about:

  1. Formatting data tables in spreadsheets
  2. Formatting problems
  3. Dates as data
  4. Quality control
  5. Exporting data

Formatting data tables in spreadsheets

Questions

Objectives

Keypoints

The most common mistake made is treating spreadsheet programs like lab notebooks, that is, relying on context, notes in the margin, spatial layout of data and fields to convey information. As humans, we can (usually) interpret these things, but computers don’t view information the same way, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fits together.

Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal).

This is why it’s extremely important to set up well-formatted tables from the outset - before you even start entering data from your very first preliminary experiment. Data organization is the foundation of your research project. It can make it easier or harder to work with your data throughout your analysis, so it’s worth thinking about when you’re doing your data entry or setting up your experiment. You can set things up in different ways in spreadsheets, but some of these choices can limit your ability to work with the data in other programs or have the you-of-6-months-from-now or your collaborator work with the data.

Note: the best layouts/formats (as well as software and interfaces) for data entry and data analysis might be different. It is important to take this into account, and ideally automate the conversion from one to another.

Keeping track of your analyses

When you’re working with spreadsheets, during data clean up or analyses, it’s very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when a reviewer or instructor asks for a different analysis, you should

This might be an example of a spreadsheet setup:

Put these principles in to practice today during your exercises.

While versioning is out of scope for this course, you can look at the Carpentries lesson on ‘Git’ to learn how to maintain version control over your data. See also this blog post for a quick tutorial or @Perez-Riverol:2016 for a more research-oriented use-case.

Structuring data in spreadsheets

The cardinal rules of using spreadsheet programs for data:

  1. Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’.
  2. Put each observation in its own row.
  3. Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
  4. Leave the raw data raw - don’t change it!
  5. Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.

For instance, we have data from patients that visited several hospitals from Brussels, Belgium. They recorded the date of the visit, the hospital, the patients’ gender, weight and blood group.

If we were to keep track of the data like this:

the problem is that the ABO and Rhesus groups are in the same Blood type column. So, if they wanted to look at all observations of the A group or look at weight distributions by ABO group, it would be tricky to do this using this data setup. If instead we put the ABO and Rhesus groups in different columns, you can see that it would be much easier.

An important rule when setting up a datasheet, is that columns are used for variables and rows are used for observations:

Challenge: We’re going to take a messy data and describe how we would clean it up.

  1. Download a messy data by clicking here.

  2. Open up the data in a spreadsheet program.

  3. You can see that there are two tabs. The data contains various clinical variables recorded in various hospitals in Brussels during the first and second COVID-19 waves in 2020. As you can see, the data have been recorded differently during the March and November waves. Now you’re the person in charge of this project and you want to be able to start analyzing the data.

  4. With the person next to you, identify what is wrong with this spreadsheet. Also discuss the steps you would need to take to clean up first and second wave tabs, and to put them all together in one spreadsheet.

Important: Do not forget our first piece of advice: to create a new file (or tab) for the cleaned data, never modify your original (raw) data. However in this case we will have taken care of this for you!

After you go through this exercise, we’ll discuss as a group what was wrong with this data and how you would fix it.

Challenge: Once you have tidied up the data, answer the following questions:

  • How many men and women took part in the study?
  • How many A, AB, and B types have been tested?
  • As above, but disregarding the contaminated samples?
  • How many Rhesus + and - have been tested?
  • How many universal donors (0-) have been tested?
  • What is the average weight of AB men?
  • How many samples have been tested in the different hospitals?

An excellent reference, in particular with regard to R scripting is the Tidy Data paper @Wickham:2014.

Common Spreadsheet Errors

Questions

Objectives

Keypoints

There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets, can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.

Using multiple tables

A common strategy is creating multiple data tables within one spreadsheet. This confuses the computer, so don’t do this! When you create multiple tables within one spreadsheet, you’re drawing false associations between things for the computer, which sees each row as an observation. You’re also potentially using the same field name in multiple places, which will make it harder to clean your data up into a usable form. The example below depicts the problem:

In the example above, the computer will see (for example) row 4 and assume that all columns A-AF refer to the same sample. This row actually represents four distinct samples (sample 1 for each of four different collection dates - May 29th, June 12th, June 19th, and June 26th), as well as some calculated summary statistics (an average (avr) and standard error of measurement (SEM)) for two of those samples. Other rows are similarly problematic.

Using multiple tabs

But what about workbook tabs? That seems like an easy way to organize data, right? Well, yes and no. When you create extra tabs, you fail to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or scripting to ensure this connection). Say, for instance, you make a separate tab for each day you take a measurement.

This isn’t good practice for two reasons:

  1. you are more likely to accidentally add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and

  2. even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyze the data because you will have to combine these data into a single datatable. You will have to explicitly tell the computer how to combine tabs - and if the tabs are inconsistently formatted, you might even have to do it manually.

The next time you’re entering data, and you go to create another tab or table, ask yourself if you could avoid adding this tab by adding another column to your original spreadsheet. We used multiple tabs in our example of a messy data file, but now you’ve seen how you can reorganize your data to consolidate across tabs.

Your data sheet might get very long over the course of the experiment. This makes it harder to enter data if you can’t see your headers at the top of the spreadsheet. But don’t repeat your header row. These can easily get mixed into the data, leading to problems down the road. Instead you can freeze the column headers so that they remain visible even when you have a spreadsheet with many rows.

Not filling in zeros

It might be that when you’re measuring something, it’s usually a zero, say the number of times a rabbit is observed in the survey. Why bother writing in the number zero in that column, when it’s mostly zeros?

However, there’s a difference between a zero and a blank cell in a spreadsheet. To the computer, a zero is actually data. You measured or counted it. A blank cell means that it wasn’t measured and the computer will interpret it as an unknown value (also known as a null or missing value).

The spreadsheets or statistical programs will likely mis-interpret blank cells that you intend to be zeros. By not entering the value of your observation, you are telling your computer to represent that data as unknown or missing (null). This can cause problems with subsequent calculations or analyses. For example, the average of a set of numbers which includes a single null value is always null (because the computer can’t guess the value of the missing observations). Because of this, it’s very important to record zeros as zeros and truly missing data as nulls.

Using problematic null values

Example: using -999 or other numerical values (or zero) to represent missing data.

Solutions:

There are a few reasons why null values get represented differently within a dataset. Sometimes confusing null values are automatically recorded from the measuring device. If that’s the case, there’s not much you can do, but it can be addressed in data cleaning with a tool like OpenRefine before analysis. Other times different null values are used to convey different reasons why the data isn’t there. This is important information to capture, but is in effect using one column to capture two pieces of information. Like for [using formatting to convey information]((#formatting) it would be good here to create a new column like ‘data_missing’ and use that column to capture the different reasons.

Whatever the reason, it’s a problem if unknown or missing data is recorded as -999, 999, or 0.

Many statistical programs will not recognize that these are intended to represent missing (null) values. How these values are interpreted will depend on the software you use to analyze your data. It is essential to use a clearly defined and consistent null indicator.

Blanks (most applications) and NA (for R) are good choices. @White:2013 explain good choices for indicating null values for different software applications in their article:

Using formatting to convey information

Example: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.

Solution: create a new field to encode which data should be excluded.

Using formatting to make the data sheet look pretty

Example: merging cells.

Solution: If you’re not careful, formatting a worksheet to be more aesthetically pleasing can compromise your computer’s ability to see associations in the data. Merged cells will make your data unreadable by statistics software. Consider restructuring your data in such a way that you will not need to merge cells to organize your data.

Placing comments or units in cells

Most analysis software can’t see Excel or LibreOffice comments, and would be confused by comments placed within your data cells. As described above for formatting, create another field if you need to add notes to cells. Similarly, don’t include units in cells: ideally, all the measurements you place in one column should be in the same unit, but if for some reason they aren’t, create another field and specify the units the cell is in.

Entering more than one piece of information in a cell

Example: Recording ABO and Rhesus groups in one cell, such as A+, B+, A-, …

Solution: Don’t include more than one piece of information in a cell. This will limit the ways in which you can analyze your data. If you need both these measurements, design your data sheet to include this information. For example, include one column the ABO group and one for the Rhesus group.

Using problematic field names

Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be misinterpreted by parsers that use whitespace as delimiters and some programs don’t like field names that are text strings that start with numbers.

Underscores (_) are a good alternative to spaces. Consider writing names in camel case (like this: ExampleFileName) to improve readability. Remember that abbreviations that make sense at the moment may not be so obvious in 6 months, but don’t overdo it with names that are excessively long. Including the units in the field names avoids confusion and enables others to readily interpret your fields.

Examples

Good Name Good Alternative Avoid
Max_temp_C MaxTemp Maximum Temp (°C)
Precipitation_mm Precipitation precmm
Mean_year_growth MeanYearGrowth Mean growth/year
sex sex M/F
weight weight w.
cell_type CellType Cell Type
Observation_01 first_observation 1st Obs

Using special characters in data

Example: You treat your spreadsheet program as a word processor when writing notes, for example copying data directly from Word or other applications.

Solution: This is a common strategy. For example, when writing longer text in a cell, people often include line breaks, em-dashes, etc in their spreadsheet. Also, when copying data in from applications such as Word, formatting and fancy non-standard characters (such as left- and right-aligned quotation marks) are included. When exporting this data into a coding/statistical environment or into a relational database, dangerous things may occur, such as lines being cut in half and encoding errors being thrown.

General best practice is to avoid adding characters such as newlines, tabs, and vertical tabs. In other words, treat a text cell as if it were a simple web form that can only contain text and spaces.

Inclusion of metadata in data table

Example: You add a legend at the top or bottom of your data table explaining column meaning, units, exceptions, etc.

Solution: Recording data about your data (“metadata”) is essential. You may be on intimate terms with your dataset while you are collecting and analysing it, but the chances that you will still remember that the variable “sglmemgp” means single member of group, for example, or the exact algorithm you used to transform a variable or create a derived one, after a few months, a year, or more are slim.

As well, there are many reasons other people may want to examine or use your data - to understand your findings, to verify your findings, to review your submitted publication, to replicate your results, to design a similar study, or even to archive your data for access and re-use by others. While digital data by definition are machine-readable, understanding their meaning is a job for human beings. The importance of documenting your data during the collection and analysis phase of your research cannot be overestimated, especially if your research is going to be part of the scholarly record.

However, metadata should not be contained in the data file itself. Unlike a table in a paper or a supplemental file, metadata (in the form of legends) should not be included in a data file since this information is not data, and including it can disrupt how computer programs interpret your data file. Rather, metadata should be stored as a separate file in the same directory as your data file, preferably in plain text format with a name that clearly associates it with your data file. Because metadata files are free text format, they also allow you to encode comments, units, information about how null values are encoded, etc. that are important to document but can disrupt the formatting of your data file.

Additionally, file or database level metadata describes how files that make up the dataset relate to each other; what format they are in; and whether they supercede or are superceded by previous files. A folder-level readme.txt file is the classic way of accounting for all the files and folders in a project.

(Text on metadata adapted from the online course Research Data MANTRA by EDINA and Data Library, University of Edinburgh. MANTRA is licensed under a Creative Commons Attribution 4.0 International License.)

Exporting data

Question

Objectives

Keypoints

Storing the data you’re going to work with for your analyses in Excel default file format (*.xls or *.xlsx - depending on the Excel version) isn’t a good idea. Why?

Storing data in a universal, open, and static format will help deal with this problem. Try tab-delimited (tab separated values or TSV) or comma-delimited (comma separated values or CSV). CSV files are plain text files where the columns are separated by commas, hence ‘comma separated values’ or CSV. The advantage of a CSV file over an Excel/SPSS/etc. file is that we can open and read a CSV file using just about any software, including plain text editors like TextEdit or NotePad. Data in a CSV file can also be easily imported into other formats and environments, such as SQLite and R. We’re not tied to a certain version of a certain expensive program when we work with CSV files, so it’s a good format to work with for maximum portability and endurance. Most spreadsheet programs can save to delimited text formats like CSV easily, although they may give you a warning during the file export.

To save a file you have opened in Excel in CSV format:

  1. From the top menu select ‘File’ and ‘Save as’.
  2. In the ‘Format’ field, from the list, select ‘Comma Separated Values’ (*.csv).
  3. Double check the file name and the location where you want to save it and hit ‘Save’.

An important note for backwards compatibility: you can open CSV files in Excel!

Saving an Excel file to CSV.

A note on R and xls: There are R packages that can read xls files (as well as Google spreadsheets). It is even possible to access different worksheets in the xls documents.

But

Caveats on commas

In some datasets, the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. This is because the commas which are a part of the data values will be interpreted as delimiters.

For example, our data might look like this:

species_id,genus,species,taxa
AB,Amphispiza,bilineata,Bird
AH,Ammospermophilus,harrisi,Rodent, not censused
AS,Ammodramus,savannarum,Bird
BA,Baiomys,taylori,Rodent

In the record AH,Ammospermophilus,harrisi,Rodent, not censused the value for taxa includes a comma (Rodent, not censused). If we try to read the above into Excel (or other spreadsheet program), we will get something like this:

The risks of having commas inside comma-separated data.

The value for taxa was split into two columns (instead of being put in one column D). This can propagate to a number of further errors. For example, the extra column will be interpreted as a column with many missing values (and without a proper header). In addition to that, the value in column D for the record in row 3 (so the one where the value for ‘taxa’ contained the comma) is now incorrect.

If you want to store your data in csv format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values in quotes (“”). Applying this rule, our data might look like this:

species_id,genus,species,taxa
"AB","Amphispiza","bilineata","Bird"
"AH","Ammospermophilus","harrisi","Rodent, not censused"
"AS","Ammodramus","savannarum","Bird"
"BA","Baiomys","taylori","Rodent"

Now opening this file as a csv in Excel will not lead to an extra column, because Excel will only use commas that fall outside of quotation marks as delimiting characters.

Alternatively, if you are working with data that contains commas, you likely will need to use another delimiter when working in a spreadsheet1. In this case, consider using tabs as your delimiter and working with TSV files. TSV files can be exported from spreadsheet programs in the same way as CSV files.

If you are working with an already existing dataset in which the data values are not included in “” but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning. If the dataset you’re dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - “”) is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.

Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.

Summary

A typical data analysis workflow.

A typical data analysis worflow is illustrated in the figure above, where data is repeatedly tranformed, visualised, modelled. This iteration is repeated multiple times until the data is understood. In many real-life cases, however, most time is spent cleaning up and preparing the data, rather than actually analysing and understanding it.

An agile data analysis workflow, with several fast iterations of the transform/visualise/model cycle is only feasible is the data is formatted in a predictable way and one can reason about the data without having to look at it and/or fix it.

  1. This is particularly relevant in European countries where the comma is used as a decimal separator. In such cases, the default value separator in a csv file will be the semi-colon (;), or values will be systematically quoted. 

Key Points

  • Good data organization is the foundation of any research project.


R and RStudio

Overview

Teaching: XX min
Exercises: XX min
Questions
  • What are R and RStudio?

Objectives
  • Describe the purpose of the RStudio Script, Console, Environment, and Plots panes.

  • Organize files and directories for a set of analyses as an R project, and understand the purpose of the working directory.

  • Use the built-in RStudio help interface to search for more information on R functions.

  • Demonstrate how to provide sufficient information for troubleshooting with the R user community.

What is R? What is RStudio?

The term R is used to refer to both the programming language, the environment for statistical computing and the software that interprets the scripts written using it.

RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software1. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

The RStudio IDE Cheat Sheet provides much more information that will be covered here, but can be useful to learn keyboard shortcuts and discover new features.

Why learn R?

R does not involve lots of pointing and clicking, and that’s a good thing

The learning curve might be steeper than with other software, but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that’s a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results; you just have to run your script again.

Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.

Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.

R code is great for reproducibility

Reproducibility means that someone else (including your future self) can obtain the same results from the same dataset when using the same analysis code.

R integrates with other tools to generate manuscripts or reports from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript or report are updated automatically.

An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.

R is interdisciplinary and extensible

With 10000+ packages2 that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.

Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.

R works on data of all shapes and sizes

The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you.

R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient.

R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.

R produces high-quality graphics

The plotting functionalities in R are extensive, and allow you to adjust any aspect of your graph to convey most effectively the message from your data.

R has a large and welcoming community

Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, or on the RStudio community. These broad user community extends to specialised areas such as bioinformatics.

Not only is R free, but it is also open-source and cross-platform

Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.

Knowing your way around RStudio

Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.

The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, Inc.

We will use the RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.

RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.

The RStudio window is divided into 4 “Panes”:

The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).

One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.

Getting set up

It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory. All of the scripts within this folder can then use relative paths to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work.

RStudio provides a helpful set of tools to do this through its “Projects” interface, which not only creates a working directory for you, but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and open files to make it easier to resume work after a break. Go through the steps for creating an “R Project” for this tutorial below.

  1. Start RStudio.
  2. Under the File menu, click on New project. Choose New directory, then New project.
  3. Enter a name for this new folder (or “directory”), and choose a convenient location for it. This will be your working directory for this session (or whole course) (e.g., bioc-intro).
  4. Click on Create project.
  5. (Optional) Set Preferences to ‘Never’ save workspace in RStudio.

RStudio’s default preferences generally work well, but saving a workspace to .RData can be cumbersome, especially if you are working with larger datasets. To turn that off, go to Tools –> ‘Global Options’ and select the ‘Never’ option for ‘Save workspace to .RData’ on exit.’

Set 'Save workspace to .RData on exit' to 'Never'

To avoid character encoding issue between Windows and other operating systems, we are going to set UTF-8 by default:

Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).

Organizing your working directory

Using a consistent folder structure across your projects will help keep things organized, and will also make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for scripts, data, and documents.

You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory.

Example of a working directory structure.

For this course, we will need a data/ folder to store our raw data, and we will use data_output/ for when we learn how to export data as CSV files, and fig_output/ folder for the figures that we will save.

Challenge: create your project directory structure

Create an RStudio project for this workshop called e.g. bioc-intro, as described above under Getting set up, if you haven’t already done that. Under the Files tab on the right of the screen, click on New Folder and create a folder named data within your newly created working directory (e.g., ~/bioc-intro/data). (Alternatively, type dir.create("data") at your R console.) Repeat these operations to create a data_output/ and a fig_output folders.

Your working directory should now look like this:

How it should look like at the beginning of this lesson

The working directory

The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.

Using RStudio projects makes this easy and ensures that your working directory is set properly. If you need to check it, you can use getwd(). If for some reason your working directory is not what it should be, you can change it in the RStudio interface by navigating in the file browser where your working directory should be, and clicking on the blue gear icon More, and select Set As Working Directory. Alternatively you can use setwd("/path/to/working/directory") to reset your working directory. However, your scripts should not include this line because it will fail on someone else’s computer.

Example

The schema below represents the working directory bioc-intro with the data and fig_output sub-directories, and 2 files in the latter:

bioc-intro/data/
          /fig_output/fig1.pdf
          /fig_output/fig2.png

If we were in the working directory, we could refer to the fig1.pdf file using the relative path fig_output/fig1.pdf or the absolute path /home/user/bioc-intro/fig_output/fig1.pdf.

If we were in the data directory, we would use the relative path ../fig_output/fig1.pdf or the same absolute path /home/user/bioc-intro/fig_output/fig1.pdf.

Interacting with R

The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.

There are two main ways of interacting with R: by using the console or by using scripts (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session.

Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer. Note, however, that merely typing the commands in the script does not automatically run them - they still need to be sent to the console for execution.

RStudio allows you to execute commands directly from the script editor by using the Ctrl + Enter shortcut (on Macs, Cmd + Return will work, too). The command on the current line in the script (indicated by the cursor) or all of the commands in the currently selected text will be sent to the console and executed when you press Ctrl + Enter. You can find other keyboard shortcuts in this RStudio cheatsheet about the RStudio IDE.

At some point in your analysis you may want to check the content of a variable or the structure of an object, without necessarily keeping a record of it in your script. You can type these commands and execute them directly in the console. RStudio provides the Ctrl + 1 and Ctrl + 2 shortcuts allow you to jump between the script and the console panes.

If R is ready to accept commands, the R console shows a > prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using Ctrl + Enter), R will try to execute it, and when ready, will show the results and come back with a new > prompt to wait for new commands.

If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. This is because you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press Esc; this will cancel the incomplete command and return you to the > prompt.

How to learn more during and after the course?

The material we cover during this course will give you an initial taste of how you can use R to analyse data for your own research. However, you will need to learn more to do advanced operations such as cleaning your dataset, using statistical methods, or creating beautiful graphics3. The best way to become proficient and efficient at R, as with any other tool, is to use it to address your actual research questions. As a beginner, it can feel daunting to have to write a script from scratch, and given that many people make their code available online, modifying existing code to suit your purpose might make it easier for you to get started.

plot of chunk kitten

Seeking help

Use the built-in RStudio help interface to search for more information on R functions

RStudio help interface.

One of the fastest ways to get help, is to use the RStudio help interface. This panel by default can be found at the lower right hand panel of RStudio. As seen in the screenshot, by typing the word “Mean”, RStudio tries to also give a number of suggestions that you might be interested in. The description is then shown in the display window.

I know the name of the function I want to use, but I’m not sure how to use it

If you need help with a specific function, let’s say barplot(), you can type:

?barplot

If you just need to remind yourself of the names of the arguments, you can use:

args(lm)

I want to use a function that does X, there must be a function for it but I don’t know which one…

If you are looking for a function to do a particular task, you can use the help.search() function, which is called by the double question mark ??. However, this only looks through the installed packages for help pages with a match to your search request

??kruskal

If you can’t find what you are looking for, you can use the rdocumentation.org website that searches through the help files across all packages available.

Finally, a generic Google or internet search “R <task>” will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question.

I am stuck… I get an error message that I don’t understand

Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”). If the message is very generic, you might also include the name of the function or package you’re using in your query.

However, you should check Stack Overflow. Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers:

http://stackoverflow.com/questions/tagged/r

The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.

The R FAQ is dense and technical but it is full of useful information.

Asking for help

The key to receiving help from someone is for them to rapidly grasp your problem. You should make it as easy as possible to pinpoint where the issue might be.

Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem.

If possible, try to reduce what doesn’t work to a simple reproducible example. If you can reproduce the problem using a very small data frame instead of your 50000 rows and 10000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question. For instance instead of using a subset of your real dataset, create a small (3 columns, 5 rows) generic one. For more information on how to write a reproducible example see this article by Hadley Wickham.

To share an object with someone else, if it’s relatively small, you can use the function dput(). It will output R code that can be used to recreate the exact same object as the one in memory:

## iris is an example data frame that comes with R and head() is a
## function that returns the first part of the data frame
dput(head(iris))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), 
    Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, 
    1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, 
    0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, 
    1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), row.names = c(NA, 
6L), class = "data.frame")

If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your question is not related to a data frame, you can save any R object to a file[^export]:

saveRDS(iris, file="/tmp/iris.rds")

The content of this file is however not human readable and cannot be posted directly on Stack Overflow. Instead, it can be sent to someone by email who can read it with the readRDS() command (here it is assumed that the downloaded file is in a Downloads folder in the user’s home directory):

some_data <- readRDS(file="~/Downloads/iris.rds")

Last, but certainly not least, always include the output of sessionInfo() as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.

sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.39

loaded via a namespace (and not attached):
[1] compiler_4.1.3 magrittr_2.0.3 tools_4.1.3    stringi_1.7.6  highr_0.9     
[6] stringr_1.4.0  xfun_0.31      evaluate_0.15 

Where to ask for help?

More resources

R packages

Loading packages

As we have seen above, R packages play a fundamental role in R. The make use of a package’s functionality, assuming it is installed, we first need to load it to be able to use it. This is done with the library() function. Below, we load ggplot2.

library("ggplot2")

Installing packages

The default package repository is The Comprehensive R Archive Network (CRAN), and any package that is available on CRAN can be installed with the install.packages() function. Below, for example, we install the dplyr package that we will learn about late.

install.packages("dplyr")

This command will install the dplyr package as well as all its dependencies, i.e. all the packages that it relies on to function.

Bioconductor packages are managed installed using a dedicated package, namely BiocManager, that can be installed from CRAN with

install.packages("BiocManager")

Individual packages such as SummarizedExperiment (we will use it later), DESeq2 (for RNA-Seq analysis), and many more can then be installed with BiocManager::install.

BiocManager::install("SummarizedExperiment")
BiocManager::install("DESeq2")
  1. As opposed to using R directly from the command line console. There exist other software that interface and integrate with R, but RStudio is particularly well suited for beginners and while providing numerous very advanced features. 

  2. i.e. add-ons that confer R with new functionality, such as bioinformatics data analysis. 

  3. We will introduce most of these (except statistics) here, but will only manage to scratch the surface of the wealth of what is possible to do with R. 

Key Points

  • Start using R and RStudio


Introduction to R

Overview

Teaching: XX min
Exercises: XX min
Questions
  • First commands in R

Objectives
  • Define the following terms as they relate to R: object, assign, call, function, arguments, options.

  • Assign values to objects in R.

  • Learn how to name objects

  • Use comments to inform script.

  • Solve simple arithmetic operations in R.

  • Call functions and use arguments to change their default options.

  • Inspect the content of vectors and manipulate their content.

  • Subset and extract values from vectors.

  • Analyze vectors with missing data.

Creating objects in R

You can get output from R simply by typing math in the console:

3 + 5
[1] 8
12 / 7
[1] 1.714286

However, to do useful and interesting things, we need to assign values to objects. To create an object, we need to give it a name followed by the assignment operator <-, and the value we want to give it:

weight_kg <- 55

<- is the assignment operator. It assigns values on the right to objects on the left. So, after executing x <- 3, the value of x is 3. The arrow can be read as 3 goes into x. For historical reasons, you can also use = for assignments, but not in every context. Because of the slight differences in syntax, it is good practice to always use <- for assignments.

In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke in a PC, while typing Option + - (push Option at the same time as the - key) does the same in a Mac.

Naming variables {-}

Objects can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit and not too long. They cannot start with a number (2x is not valid, but x2 is). R is case sensitive (e.g., weight_kg is different from Weight_kg). There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names (e.g., c, T, mean, data, df, weights). If in doubt, check the help to see if the name is already in use. It’s also best to avoid dots (.) within an object name as in my.dataset. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it’s best to avoid them. It is also recommended to use nouns for object names, and verbs for function names. It’s important to be consistent in the styling of your code (where you put spaces, how you name objects, etc.). Using a consistent coding style makes your code clearer to read for your future self and your collaborators. In R, some popular style guides are Google’s, the tidyverse’s style and the Bioconductor style guide. The tidyverse’s is very comprehensive and may seem overwhelming at first. You can install the lintr package to automatically check for issues in the styling of your code.

Objects vs. variables What are known as objects in R are known as variables in many other programming languages. Depending on the context, object and variable can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects

When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:

weight_kg <- 55    # doesn't print anything
(weight_kg <- 55)  # but putting parenthesis around the call prints the value of `weight_kg`
[1] 55
weight_kg          # and so does typing the name of the object
[1] 55

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):

2.2 * weight_kg
[1] 121

We can also change an object’s value by assigning it a new one:

weight_kg <- 57.5
2.2 * weight_kg
[1] 126.5

This means that assigning a value to one object does not change the values of other objects For example, let’s store the animal’s weight in pounds in a new object, weight_lb:

weight_lb <- 2.2 * weight_kg

and then change weight_kg to 100.

weight_kg <- 100

Challenge:

What do you think is the current content of the object weight_lb? 126.5 or 220?

Comments

The comment character in R is #, anything to the right of a # in a script will be ignored by R. It is useful to leave notes, and explanations in your scripts.

RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.

Challenge

What are the values after each statement in the following?

mass <- 47.5            # mass?
age  <- 122             # age?
mass <- mass * 2.0      # mass?
age  <- age - 20        # age?
mass_index <- mass/age  # mass_index?

Functions and their arguments

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually gets one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function (‘running it’) is called calling the function. An example of a function call is:

b <- sqrt(a)

Here, the value of a is given to the sqrt() function, the sqrt() function calculates the square root, and returns the value which is then assigned to the object b. This function is very simple, because it takes just one argument.

The return ‘value’ of a function need not be numerical (like that of sqrt()), and it also does not need to be a single item: it can be a set of things, or even a dataset. We’ll see that when we read data files into R.

Arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.

Let’s try a function that can take multiple arguments: round().

round(3.14159)
[1] 3

Here, we’ve called round() with just one argument, 3.14159, and it has returned the value 3. That’s because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the round function. We can use args(round) or look at the help for this function using ?round.

args(round)
function (x, digits = 0) 
NULL
?round

We see that if we want a different number of digits, we can type digits=2 or however many we want.

round(3.14159, digits = 2)
[1] 3.14

If you provide the arguments in the exact same order as they are defined you don’t have to name them:

round(3.14159, 2)
[1] 3.14

And if you do name the arguments, you can switch their order:

round(digits = 2, x = 3.14159)
[1] 3.14

It’s good practice to put the non-optional arguments (like the number you’re rounding) first in your function call, and to specify the names of all optional arguments. If you don’t, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you’re doing. By specifying the name of the arguments you are also safeguarding against possible future changes in the function interface, which may potentially add new arguments in between the existing ones.

Vectors and data types

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, such as numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of animal weights and assign it to a new object weight_g:

weight_g <- c(50, 60, 65, 82)
weight_g
[1] 50 60 65 82

A vector can also contain characters:

molecules <- c("dna", "rna", "protein")
molecules
[1] "dna"     "rna"     "protein"

The quotes around “dna”, “rna”, etc. are essential here. Without the quotes R will assume there are objects called dna, rna and protein. As these objects don’t exist in R’s memory, there will be an error message.

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

length(weight_g)
[1] 4
length(molecules)
[1] 3

An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates the class (the type of element) of an object:

class(weight_g)
[1] "numeric"
class(molecules)
[1] "character"

The function str() provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:

str(weight_g)
 num [1:4] 50 60 65 82
str(molecules)
 chr [1:3] "dna" "rna" "protein"

You can use the c() function to add other elements to your vector:

weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
[1] 30 50 60 65 82 90

In the first line, we take the original vector weight_g, add the value 90 to the end of it, and save the result back into weight_g. Then we add the value 30 to the beginning, again saving the result back into weight_g.

We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.

An atomic vector is the simplest R data type and is a linear vector of a single type. Above, we saw 2 of the 6 main atomic vector types that R uses: "character" and "numeric" (or "double"). These are the basic building blocks that all R objects are built from. The other 4 atomic vector types are:

You can check the type of your vector using the typeof() function and inputting your vector as the argument.

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame), factors (factor) and arrays (array).

Challenge:

We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?

Solution

R implicitly converts them to all be the same type

Challenge:

What will happen in each of these examples? (hint: use class() to check the data type of your objects):

num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")

Solution

class(num_char)
[1] "character"
class(num_logical)
[1] "numeric"
class(char_logical)
[1] "character"
class(tricky)
[1] "character"

Challenge:

Why do you think it happens?

Solution

Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a common denominator that doesn’t lose any information.

Homework Challenge:

How many values in combined_logical are "TRUE" (as a character) in the following example:

num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
combined_logical <- c(num_logical, char_logical)

Homework Challenge:

In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?

Solution

logical → numeric → character ← logical

Subsetting vectors

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:

molecules <- c("dna", "rna", "peptide", "protein")
molecules[2]
[1] "rna"
molecules[c(3, 2)]
[1] "peptide" "rna"    

We can also repeat the indices to create an object with more elements than the original one:

more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)]
more_molecules
[1] "dna"     "rna"     "peptide" "rna"     "dna"     "protein"

R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Finally, it is also possible to get all the elements of a vector except some specified elements using negative indices:

molecules ## all molecules
[1] "dna"     "rna"     "peptide" "protein"
molecules[-1] ## all but the first one
[1] "rna"     "peptide" "protein"
molecules[-c(1, 3)] ## all but 1st/3rd ones
[1] "rna"     "protein"
molecules[c(-1, -3)] ## all but 1st/3rd ones
[1] "rna"     "protein"

Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
[1] 21 39 54

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:

## will return logicals with TRUE for the indices that meet
## the condition
weight_g > 50
[1] FALSE FALSE FALSE  TRUE  TRUE
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
[1] 54 55

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

weight_g[weight_g < 30 | weight_g > 50]
[1] 21 54 55
weight_g[weight_g >= 30 & weight_g == 21]
numeric(0)

Here, < stands for “less than”, > for “greater than”, >= for “greater than or equal to”, and == for “equal to”. The double equal sign == is a test for numerical equality between the left and right hand sides, and should not be confused with the single = sign, which performs variable assignment (similar to <-).

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

molecules <- c("dna", "rna", "protein", "peptide")
molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna
[1] "dna" "rna"
molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")
[1]  TRUE  TRUE FALSE  TRUE
molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")]
[1] "dna"     "rna"     "peptide"

Homework Challenge:

Can you figure out why "four" > "five" returns TRUE?

Solution

"four" > "five"
[1] TRUE

When using > or < on strings, R compares their alphabetical order. Here "four" comes after "five", and therefore is greater than it.

Names

It is possible to name each element of a vector. The code chunk below show a initial vector without any names, how names are set, and retrieved.

x <- c(1, 5, 3, 5, 10)
names(x) ## no names
NULL
names(x) <- c("A", "B", "C", "D", "E")
names(x) ## now we have names
[1] "A" "B" "C" "D" "E"

When a vector has names, it is possible to access elements by their name, in addition to their index.

x[c(1, 3)]
A C 
1 3 
x[c("A", "C")]
A C 
1 3 

Missing data

As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values.

heights <- c(2, 4, 4, NA, 6)
mean(heights)
[1] NA
max(heights)
[1] NA
mean(heights, na.rm = TRUE)
[1] 4
max(heights, na.rm = TRUE)
[1] 6

If your data include missing values, you may want to become familiar with the functions is.na(), na.omit(), and complete.cases(). See below for examples.

## Extract those elements which are not missing values.
heights[!is.na(heights)]
[1] 2 4 4 6
## Returns the object with incomplete cases removed.
## The returned object is an atomic vector of type `"numeric"`
## (or `"double"`).
na.omit(heights)
[1] 2 4 4 6
attr(,"na.action")
[1] 4
attr(,"class")
[1] "omit"
## Extract those elements which are complete cases.
## The returned object is an atomic vector of type `"numeric"`
## (or `"double"`).
heights[complete.cases(heights)]
[1] 2 4 4 6

Homework Challenge:

  1. Using this vector of heights in inches, create a new vector with the NAs removed.
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
  1. Use the function median() to calculate the median of the heights vector.
  2. Use R to figure out how many people in the set are taller than 67 inches.

Solution

heights_no_na <- heights[!is.na(heights)]
## or
heights_no_na <- na.omit(heights)
median(heights, na.rm = TRUE)
[1] 64
heights_above_67 <- heights_no_na[heights_no_na > 67]
length(heights_above_67)
[1] 6

Generating vectors

Constructors {-}

There exists some functions to generate vectors of different type. To generate a vector of numerics, one can use the numeric() constructor, providing the length of the output vector as parameter. The values will be initialised with 0.

numeric(3)
[1] 0 0 0
numeric(10)
 [1] 0 0 0 0 0 0 0 0 0 0

Note that if we ask for a vector of numerics of length 0, we obtain exactly that:

numeric(0)
numeric(0)

There are similar constructors for characters and logicals, named character() and logical() respectively.

Homework Challenge:

What are the defaults for character and logical vectors?

Solution

character(2) ## the empty charater
[1] "" ""
logical(2)   ## FALSE
[1] FALSE FALSE

Replicate elements

The rep function allow to repeat a value a certain number of times. If we want to initiate a vector of numerics of length 5 with the value -1, for example, we could do the following:

rep(-1, 5)
[1] -1 -1 -1 -1 -1

Similarly, to generate a vector populated with missing values, which is often a good way to start, without setting assumptions on the data to be collected:

rep(NA, 5)
[1] NA NA NA NA NA

rep can take vectors of any length as input (above, we used vectors of length 1) and any type. For example, if we want to repeat the values 1, 2 and 3 five times, we would do the following:

rep(c(1, 2, 3), 5)
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Homework Challenge:

What if we wanted to repeat the values 1, 2 and 3 five times, but obtain five 1s, five 2s and five 3s in that order? There are two possibilities - see ?rep or ?sort for help.

Solution

rep(c(1, 2, 3), each = 5)
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
sort(rep(c(1, 2, 3), 5))
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Sequence generation

Another very useful function is seq, to generate a sequence of numbers. For example, to generate a sequence of integers from 1 to 20 by steps of 2, one would use:

seq(from = 1, to = 20, by = 2)
 [1]  1  3  5  7  9 11 13 15 17 19

The default value of by is 1 and, given that the generate of a sequence of one value to another with steps of 1 is frequently used, there’s a shortcut:

seq(1, 5, 1)
[1] 1 2 3 4 5
seq(1, 5) ## default by
[1] 1 2 3 4 5
1:5
[1] 1 2 3 4 5

To generate a sequence of numbers from 1 to 20 of final length of 3, one would use:

seq(from = 1, to = 20, length.out = 3)
[1]  1.0 10.5 20.0

Random samples and permutations

A last group of useful functions are those that generate random data. The first one, sample, generates a random permutation of another vector. For example, to draw a random order to 10 students oral exam, I first assign each student a number from 1 to then (for instance based on the alphabetic order of their name) and then:

sample(1:10)
 [1]  9  4  7  1  2  5  3 10  6  8

Without further arguments, sample will return a permutation of all elements of the vector. If I want a random sample of a certain size, I would set this value as second argument. Below, I sample 5 random letters from the alphabet contained in the pre-defined letters vector:

sample(letters, 5)
[1] "s" "a" "u" "x" "j"

If I wanted an output larger than the input vector, or being able to draw some elements multiple times, I would need to set the replace argument to TRUE:

sample(1:5, 10, replace = TRUE)
 [1] 2 1 5 5 1 1 5 5 2 2

Homework Challenge:

When trying the functions above out, you will have realised that the samples are indeed random and that one doesn’t get the same permutation twice. To be able to reproduce these random draws, one can set the random number generation seed manually with set.seed() before drawing the random sample.

  • Test this feature with your neighbour. First draw two random permutations of 1:10 independently and observe that you get different results.

  • Now set the seed with, for example, set.seed(123) and repeat the random draw. Observe that you now get the same random draws.

  • Repeat by setting a different seed.

Solution

Different permutations

sample(1:10)
 [1]  9  1  4  3  6  2  5  8 10  7
sample(1:10)
 [1]  4  9  7  6  1 10  8  3  2  5

Same permutations with seed 123

set.seed(123)
sample(1:10)
 [1]  3 10  2  8  6  9  1  7  5  4
set.seed(123)
sample(1:10)
 [1]  3 10  2  8  6  9  1  7  5  4

A different seed

set.seed(1)
sample(1:10)
 [1]  9  4  7  1  2  5  3 10  6  8
set.seed(1)
sample(1:10)
 [1]  9  4  7  1  2  5  3 10  6  8

Homework reading: Drawing samples from a normal distribution

The last function we are going to see is rnorm, that draws a random sample from a normal distribution. Two normal distributions of means 0 and 100 and standard deviations 1 and 5, noted noted N(0, 1) and N(100, 5), are shown below

Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right.

The three arguments, n, mean and sd, define the size of the sample, and the parameters of the normal distribution, i.e the mean and its standard deviation. The defaults of the latter are 0 and 1.

rnorm(5)
[1]  0.69641761  0.05351568 -1.31028350 -2.12306606 -0.20807859
rnorm(5, 2, 2)
[1]  1.3744268 -0.1164714  2.8344472  1.3690969  3.6510983
rnorm(5, 100, 5)
[1] 106.45636  96.87448  95.62427 100.71678 107.12595

Now that we have learned how to write scripts, and the basics of R’s data structures, we are ready to start working with larger data, and learn about data frames.

Key Points

  • How to interact with R


Starting with data

Overview

Teaching: XX min
Exercises: XX min
Questions
  • First data analysis in R

Objectives
  • Describe what a data frame is.

  • Load external data from a .csv file into a data frame.

  • Summarize the contents of a data frame.

  • Describe what a factor is.

  • Convert between strings and factors.

  • Reorder and rename factors.

  • Format dates.

  • Export and save data.

Presentation of the gene expression data

We are going to use part of the data published by Blackmore et al. (2017), The effect of upper-respiratory infection on transcriptomic changes in the CNS. The goal of the study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occuring in the cerebellum and spinal cord post infection. Gender matched eight week old C57BL/6 mice were inoculated with saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.

The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single RNA expression measurement, and the columns represent:

Column Description
gene The name of the gene that was measured
sample The name of the sample the gene expression was measured in
expression The value of the gene expression
organism The organism/species - here all data stem from mice
age The age of the mouse (all mice were 8 weeks here)
sex The sex of the mouse
infection The infection state of the mouse, i.e. infected with Influenza A or not infected.
strain The Influenza A strain; C57BL/6 in all cases.
time The duration of the infection (in days).
tissue The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord.
mouse The mouse unique identifier.

We have already downloaded the data and it can be found in the directory course-data/data/GSE96870/.

You are now ready to load the data:

rna <- read.csv("course-data/data/GSE96870/rnaseq.csv")

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name

rna

Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function head():

head(rna)
     gene     sample expression     organism age    sex  infection  strain time
1     Asl GSM2545336       1170 Mus musculus   8 Female InfluenzaA C57BL/6    8
2    Apod GSM2545336      36194 Mus musculus   8 Female InfluenzaA C57BL/6    8
3 Cyp2d22 GSM2545336       4060 Mus musculus   8 Female InfluenzaA C57BL/6    8
4    Klk6 GSM2545336        287 Mus musculus   8 Female InfluenzaA C57BL/6    8
5   Fcrls GSM2545336         85 Mus musculus   8 Female InfluenzaA C57BL/6    8
6  Slc2a4 GSM2545336        782 Mus musculus   8 Female InfluenzaA C57BL/6    8
      tissue mouse ENTREZID
1 Cerebellum    14   109900
2 Cerebellum    14    11815
3 Cerebellum    14    56448
4 Cerebellum    14    19144
5 Cerebellum    14    80891
6 Cerebellum    14    20528
                                                                       product
1                               argininosuccinate lyase, transcript variant X1
2                                       apolipoprotein D, transcript variant 3
3 cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2
4                         kallikrein related-peptidase 6, transcript variant 2
5                Fc receptor-like S, scavenger receptor, transcript variant X1
6          solute carrier family 2 (facilitated glucose transporter), member 4
     ensembl_gene_id external_synonym chromosome_name   gene_biotype
1 ENSMUSG00000025533    2510006M18Rik               5 protein_coding
2 ENSMUSG00000022548             <NA>              16 protein_coding
3 ENSMUSG00000061740             2D22              15 protein_coding
4 ENSMUSG00000050063             Bssp               7 protein_coding
5 ENSMUSG00000015852    2810439C17Rik               3 protein_coding
6 ENSMUSG00000018566           Glut-4              11 protein_coding
                            phenotype_description
1           abnormal circulating amino acid level
2                      abnormal lipid homeostasis
3                        abnormal skin morphology
4                         abnormal cytokine level
5 decreased CD8-positive alpha-beta T cell number
6              abnormal circulating glucose level
  hsapiens_homolog_associated_gene_name
1                                   ASL
2                                  APOD
3                                CYP2D6
4                                  KLK6
5                                 FCRL2
6                                SLC2A4
## Try also
## View(rna)

Note

read.csv() assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delineator. If you want to read in this type of files in R, you can use the read.csv2() function. It behaves exactly like read.csv() but uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help for read.csv() by typing ?read.csv to learn more. There is also the read.delim() function for reading tab separated data files. It is important to note that all of these functions are actually wrapper functions for the main read.table() function with different arguments. As such, the data above could have also been loaded by using read.table() with the separation argument as ,. The code is as follows:

rna <- read.table(file = "course-data/data/GSE96870/rnaseq.csv",
                  sep = ",",
                  header = TRUE)

The header argument has to be set to TRUE to be able to read the headers as by default read.table() has the header argument set to FALSE.

What are data frames?

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.

A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web).

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

We can see this when inspecting the structure of a data frame with the function str():

str(rna)
'data.frame':	32428 obs. of  19 variables:
 $ gene                                 : chr  "Asl" "Apod" "Cyp2d22" "Klk6" ...
 $ sample                               : chr  "GSM2545336" "GSM2545336" "GSM2545336" "GSM2545336" ...
 $ expression                           : int  1170 36194 4060 287 85 782 1619 288 43217 1071 ...
 $ organism                             : chr  "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
 $ age                                  : int  8 8 8 8 8 8 8 8 8 8 ...
 $ sex                                  : chr  "Female" "Female" "Female" "Female" ...
 $ infection                            : chr  "InfluenzaA" "InfluenzaA" "InfluenzaA" "InfluenzaA" ...
 $ strain                               : chr  "C57BL/6" "C57BL/6" "C57BL/6" "C57BL/6" ...
 $ time                                 : int  8 8 8 8 8 8 8 8 8 8 ...
 $ tissue                               : chr  "Cerebellum" "Cerebellum" "Cerebellum" "Cerebellum" ...
 $ mouse                                : int  14 14 14 14 14 14 14 14 14 14 ...
 $ ENTREZID                             : int  109900 11815 56448 19144 80891 20528 97827 118454 18823 14696 ...
 $ product                              : chr  "argininosuccinate lyase, transcript variant X1" "apolipoprotein D, transcript variant 3" "cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2" "kallikrein related-peptidase 6, transcript variant 2" ...
 $ ensembl_gene_id                      : chr  "ENSMUSG00000025533" "ENSMUSG00000022548" "ENSMUSG00000061740" "ENSMUSG00000050063" ...
 $ external_synonym                     : chr  "2510006M18Rik" NA "2D22" "Bssp" ...
 $ chromosome_name                      : chr  "5" "16" "15" "7" ...
 $ gene_biotype                         : chr  "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
 $ phenotype_description                : chr  "abnormal circulating amino acid level" "abnormal lipid homeostasis" "abnormal skin morphology" "abnormal cytokine level" ...
 $ hsapiens_homolog_associated_gene_name: chr  "ASL" "APOD" "CYP2D6" "KLK6" ...

Inspecting data.frame Objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

Size:

Content:

Names:

Summary:

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge:

Based on the output of str(rna), can you answer the following questions?

  • What is the class of the object rna?
  • How many rows and how many columns are in this object?
  • How many genes (as defined by the gene variable) have been measured in this experiment?

Solution

  • class: data frame
  • how many rows: 32428, how many columns: 19
  • how many genes: 1477

Indexing and subsetting data frames

Our rna data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

# first element in the first column of the data frame (as a vector)
rna[1, 1]
# first element in the 6th column (as a vector)
rna[1, 6]
# first column of the data frame (as a vector)
rna[, 1]
# first column of the data frame (as a data.frame)
rna[1]
# first three elements in the 7th column (as a vector)
rna[1:3, 7]
# the 3rd row of the data frame (as a data.frame)
rna[3, ]
# equivalent to head_rna <- head(rna)
head_rna <- rna[1:6, ]
head_rna

: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance. See section \@ref(sec:genvec) for details.

You can also exclude certain indices of a data frame using the “-” sign:

rna[, -1]          ## The whole data frame, except the first column
rna[-c(7:32428), ] ## Equivalent to head(rna)

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

rna["gene"]       # Result is a data.frame
rna[, "gene"]     # Result is a vector
rna[["gene"]]     # Result is a vector
rna$gene          # Result is a vector

In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.

Challenge

  1. Create a data.frame (rna_200) containing only the data in row 200 of the rna dataset.

  2. Notice how nrow() gave you the number of rows in a data.frame?

  • Use that number to pull out just that last row in the inital rna data frame.

  • Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.

  • Pull out that last row using nrow() instead of the row number.

  • Create a new data frame (rna_last) from that last row.

  1. Use nrow() to extract the row that is in the middle of the rna dataframe. Store the content of this row in an object named rna_middle.

  2. Combine nrow() with the - notation above to reproduce the behavior of head(rna), keeping just the first through 6th rows of the rna dataset.

Solution

## 1.
rna_200 <- rna[200, ]
## 2.
## Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(rna)
rna_last <- rna[n_rows, ]
## 3.
rna_middle <- rna[n_rows / 2, ]
## 4.
rna_head <- rna[-(7:n_rows), ]

Factors

Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male", "female"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can see this by using the function levels() and you can find the number of levels using nlevels():

levels(sex)
[1] "female" "male"  
nlevels(sex)
[1] 2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be:

sex ## current order
[1] male   female female male   female
Levels: female male
sex <- factor(sex, levels = c("male", "female"))
sex ## after re-ordering
[1] male   female female male   female
Levels: male female

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "female", "male" is more descriptive than 1, 2. Which one is “male”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset).

Converting to factors {-}

If you need to convert a factor to a character vector, you use as.character(x).

as.character(sex)
[1] "male"   "female" "female" "male"   "female"

Renaming factors {-}

When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. Let’s look at the number of males and females in our data.

plot(sex)

Bar plot of the number of females and males.

If we want to rename these factor, it is sufficient to change its levels:

levels(sex)
[1] "male"   "female"
levels(sex) <- c("M", "F")
sex
[1] M F F M F
Levels: M F
plot(sex)

plot of chunk unnamed-chunk-17

Homework Challenge:

  • Rename “female” and “male” to “Female” and “Male” respectively.

Challenge:

We have seen how data frames are created when using read.csv(), but they can also be created by hand with the data.frame() function. There are a few mistakes in this hand-crafted data.frame. Can you spot and fix them? Don’t hesitate to experiment!

animal_data <- data.frame(
       animal = c(dog, cat, sea cucumber, sea urchin),
       feel = c("furry", "squishy", "spiny"),
       weight = c(45, 8 1.1, 0.8))

Solution

  • missing quotations around the names of the animals
  • missing one entry in the “feel” column (probably for one of the furry animals)
  • missing one comma in the weight column

Homework Challenge:

Can you predict the class for each of the columns in the following example?

Check your guesses using str(country_climate):

  • Are they what you expected? Why? Why not?

  • Try again by adding stringsAsFactors = TRUE after the last variable when creating the data frame? What is happening now? stringsAsFactors can also be set when reading text-based spreadsheets into R using read.csv().

country_climate <- data.frame(
       country = c("Canada", "Panama", "South Africa", "Australia"),
       climate = c("cold", "hot", "temperate", "hot/temperate"),
       temperature = c(10, 30, 18, "15"),
       northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"),
       has_kangaroo = c(FALSE, FALSE, FALSE, 1)
       )

The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance).

Learn more in this RStudio tutorial

Matrices

Before proceeding, now that we have learnt about dataframes, let’s recap package installation and learn about a new data type, namely the matrix. Like a data.frame, a matrix has two dimensions, rows and columns. But the major difference is that all cells in a matrix must be of the same type: numeric, character, logical, … In that respect, matrices are closer to a vector than a data.frame.

The default constructor for a matrix is matrix. It takes a vector of values to populate the matrix and the number of row and/or columns1. The values are sorted along the columns, as illustrated below.

m <- matrix(1:9, ncol = 3, nrow = 3)
m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Homework Challenge:

Using the function installed.packages(), create a character matrix containing the information about all packages currently installed on your computer. Explore it.

Solution:

## create the matrix
ip <- installed.packages()
head(ip)
## try also View(ip)
## number of package
nrow(ip)
## names of all installed packages
rownames(ip)
## type of information we have about each package
colnames(ip)

It is often useful to create large random data matrices as test data. The exercise below asks you to create such a matrix with random data drawn from a normal distribution of mean 0 and standard deviation 1, which can be done with the rnorm() function.

Homework Challenge:

Construct a matrix of dimension 1000 by 3 of normally distributed data (mean 0, standard deviation 1)

Solution

set.seed(123)
m <- matrix(rnorm(3000), ncol = 3)
dim(m)
[1] 1000    3
head(m)
            [,1]        [,2]       [,3]
[1,] -0.56047565 -0.99579872 -0.5116037
[2,] -0.23017749 -1.03995504  0.2369379
[3,]  1.55870831 -0.01798024 -0.5415892
[4,]  0.07050839 -0.13217513  1.2192276
[5,]  0.12928774 -2.54934277  0.1741359
[6,]  1.71506499  1.04057346 -0.6152683

Summary of R objects

So far, we have seen several types of R object varying in the number of dimensions and whether they could store a single of multiple data types:

Lists

A data type that we haven’t seen yet, but that is useful to know, and follows from the summary that we have just seen are lists:

Below, let’s create a list containing a vector of numbers, characters, a matrix, a dataframe and another list:

l <- list(1:10, ## numeric
          letters, ## character
          installed.packages(), ## a matrix
          cars, ## a data.frame
          list(1, 2, 3)) ## a list
length(l)
[1] 5
str(l)
List of 5
 $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ : chr [1:26] "a" "b" "c" "d" ...
 $ : chr [1:399, 1:16] "abind" "annotate" "AnnotationDbi" "AnnotationHub" ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:399] "abind" "annotate" "AnnotationDbi" "AnnotationHub" ...
  .. ..$ : chr [1:16] "Package" "LibPath" "Version" "Priority" ...
 $ :'data.frame':	50 obs. of  2 variables:
  ..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
  ..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
 $ :List of 3
  ..$ : num 1
  ..$ : num 2
  ..$ : num 3

List subsetting is done using [] to subset a new sub-list or [[]] to extract a single element of that list (using indices or names, of the list is named).

l[[1]] ## first element
 [1]  1  2  3  4  5  6  7  8  9 10
l[1:2] ## a list of length 2
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
l[1]   ## a list of length 1
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

Exporting and saving data

Exporting tabular data {-}

We have seen how to read a text-based spreadsheet into R using the read.table family of functions. To export a data.frame to a text-based spreadsheet, we can use the write.table set of functions (write.csv, write.delim, …). They all take the variable to be exported and the file to be exported to. For example, to export the rna data to an rna.csv file in the data_output directory, we would execute:

write.csv(rna, file = "data_output/rna.csv")

This new csv file can now be shared with other collaborators who aren’t familiar with R.

Saving data {-}

Exporting data to a spreadsheet has several limitations, such as those described in the first chapter such as possible inconsistencies with , and . for decimal separators and lack of variable type definitions. Furthermore, exporting data to a spreadsheet is only relevant for rectangular data such as dataframes and matrices.

A more general way to save data, that is specific to R and is guaranteed to work on any operating system, is to use the save function. Saving objects will generate a binary representation of the object on disk, a R Data file (rda extension) that guarantees to produce the same object once loaded back into R using the load function.

save(rna, file = "data_output/rna.rda")
rm(rna)
load("data_output/rna.rda")
head(rna)

Note about how the function load loads the object in the file directly in the global environment.

There is also the saveRDS and readRDS functions that save R objects to binary files (using the rds extension here) and read these back into R. From a user’s perspective, main difference is that, load loads an object in the global environment while readRDS reads the data from disk and returns it. It is thus necessary to store the output of readRDS in a variable:

saveRDS(rna, file = "data_output/rna.rds")
rm(rna)
rna <- readRDS("data_output/rna.rds")
head(rna)

To conclude, when it comes to saving data from R that will be loaded again in R, saving and loading is the preferred approach. If tabular data need to be shared with somebody that is not using R, then exporting to a text-based spreadsheet is a good alternative.

  1. Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don’t add up. 

Key Points

  • Tabular data in R


Manipulating and analyzing data with dplyr

Overview

Teaching: XX min
Exercises: XX min
Questions
  • Data analysis in R using the tidyverse meta-package

Objectives
  • Describe the purpose of the dplyr and tidyr packages.

  • Describe several of their functions that are extremely useful to manipulate data.

  • Describe the concept of a wide and a long table format, and see how to reshape a data frame from one format to the other one.

Data Manipulation using dplyr and tidyr

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.

Some packages can greatly facilitate our task when we manipulate data. Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; Loading packages can give you access to other specific functions. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.

To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

To install and load the tidyverse package type:

# no need to run this command as it's already been installed for you
# BiocManager::install("tidyverse")
## load the tidyverse packages, incl. dplyr
library("tidyverse")

Loading data with tidyverse

Instead of read.csv(), we will read in our data using the read_csv() function, from the tidyverse package readr, .

rna <- read_csv("course-data/data/GSE96870/rnaseq.csv")
## view the data
rna
# A tibble: 32,428 × 19
   gene    sample  expression organism   age sex   infection strain  time tissue
   <chr>   <chr>        <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr> 
 1 Asl     GSM254…       1170 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 2 Apod    GSM254…      36194 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 3 Cyp2d22 GSM254…       4060 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 4 Klk6    GSM254…        287 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 5 Fcrls   GSM254…         85 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 6 Slc2a4  GSM254…        782 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 7 Exd2    GSM254…       1619 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 8 Gjc2    GSM254…        288 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 9 Plp1    GSM254…      43217 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
10 Gnb4    GSM254…       1071 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
#   product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
#   chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>

Notice that the class of the data is now referred to as a “tibble”.

Tibbles tweak some of the behaviors of the data frame objects we introduced in the previously. The data structure is very similar to a data frame. For our purposes the only differences are that:

  1. It displays the data type of each column under its name. Note that <dbl> is a data type defined to hold numeric values with decimal points.

  2. It only prints the first few rows of data and only as many columns as fit on one screen.

We are now going to learn some of the most common dplyr functions:

Selecting columns and filtering rows

To select columns of a data frame, use select(). The first argument to this function is the data frame (rna), and the subsequent arguments are the columns to keep.

select(rna, gene, sample, tissue, expression)
# A tibble: 32,428 × 4
   gene    sample     tissue     expression
   <chr>   <chr>      <chr>           <dbl>
 1 Asl     GSM2545336 Cerebellum       1170
 2 Apod    GSM2545336 Cerebellum      36194
 3 Cyp2d22 GSM2545336 Cerebellum       4060
 4 Klk6    GSM2545336 Cerebellum        287
 5 Fcrls   GSM2545336 Cerebellum         85
 6 Slc2a4  GSM2545336 Cerebellum        782
 7 Exd2    GSM2545336 Cerebellum       1619
 8 Gjc2    GSM2545336 Cerebellum        288
 9 Plp1    GSM2545336 Cerebellum      43217
10 Gnb4    GSM2545336 Cerebellum       1071
# … with 32,418 more rows

To select all columns except certain ones, put a “-“ in front of the variable to exclude it.

select(rna, -tissue, -organism)
# A tibble: 32,428 × 17
   gene    sample   expression   age sex   infection strain  time mouse ENTREZID
   <chr>   <chr>         <dbl> <dbl> <chr> <chr>     <chr>  <dbl> <dbl>    <dbl>
 1 Asl     GSM2545…       1170     8 Fema… Influenz… C57BL…     8    14   109900
 2 Apod    GSM2545…      36194     8 Fema… Influenz… C57BL…     8    14    11815
 3 Cyp2d22 GSM2545…       4060     8 Fema… Influenz… C57BL…     8    14    56448
 4 Klk6    GSM2545…        287     8 Fema… Influenz… C57BL…     8    14    19144
 5 Fcrls   GSM2545…         85     8 Fema… Influenz… C57BL…     8    14    80891
 6 Slc2a4  GSM2545…        782     8 Fema… Influenz… C57BL…     8    14    20528
 7 Exd2    GSM2545…       1619     8 Fema… Influenz… C57BL…     8    14    97827
 8 Gjc2    GSM2545…        288     8 Fema… Influenz… C57BL…     8    14   118454
 9 Plp1    GSM2545…      43217     8 Fema… Influenz… C57BL…     8    14    18823
10 Gnb4    GSM2545…       1071     8 Fema… Influenz… C57BL…     8    14    14696
# … with 32,418 more rows, and 7 more variables: product <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>

This will select all the variables in rna except tissue and organism.

To choose rows based on a specific criteria, use filter():

filter(rna, sex == "Male")
# A tibble: 14,740 × 19
   gene    sample  expression organism   age sex   infection strain  time tissue
   <chr>   <chr>        <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr> 
 1 Asl     GSM254…        626 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 2 Apod    GSM254…      13021 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 3 Cyp2d22 GSM254…       2171 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 4 Klk6    GSM254…        448 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 5 Fcrls   GSM254…        180 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 6 Slc2a4  GSM254…        313 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 7 Exd2    GSM254…       2366 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 8 Gjc2    GSM254…        310 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 9 Plp1    GSM254…      53126 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
10 Gnb4    GSM254…       1355 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
# … with 14,730 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
#   product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
#   chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>
filter(rna, sex == "Male" & infection == "NonInfected")
# A tibble: 4,422 × 19
   gene    sample  expression organism   age sex   infection strain  time tissue
   <chr>   <chr>        <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr> 
 1 Asl     GSM254…        535 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 2 Apod    GSM254…      13668 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 3 Cyp2d22 GSM254…       2008 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 4 Klk6    GSM254…       1101 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 5 Fcrls   GSM254…        375 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 6 Slc2a4  GSM254…        249 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 7 Exd2    GSM254…       3126 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 8 Gjc2    GSM254…        791 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 9 Plp1    GSM254…      98658 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
10 Gnb4    GSM254…       2437 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
# … with 4,412 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
#   product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
#   chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>

Now let’s imagine we are interested in the human homologs of the mouse genes analysed in this dataset. This information can be found in the last column of the rna tibble, named hsapiens_homolog_associated_gene_name. Some mouse gene have no human homologs. These can be retrieved using a filter() in the chain, and the is.na() function that determines whether something is an NA.

rna_NA <- filter(rna, is.na(hsapiens_homolog_associated_gene_name))
select(rna_NA, gene, hsapiens_homolog_associated_gene_name)
# A tibble: 4,290 × 2
   gene     hsapiens_homolog_associated_gene_name
   <chr>    <chr>                                
 1 Prodh    <NA>                                 
 2 Tssk5    <NA>                                 
 3 Vmn2r1   <NA>                                 
 4 Gm10654  <NA>                                 
 5 Hexa     <NA>                                 
 6 Sult1a1  <NA>                                 
 7 Gm6277   <NA>                                 
 8 Tmem198b <NA>                                 
 9 Adam1a   <NA>                                 
10 Ebp      <NA>                                 
# … with 4,280 more rows

If we want to keep only mouse gene that have a human homolog, we can insert a “!” symbol that negates the result, so we’re asking for every row where hsapiens_homolog_associated_gene_name is not an NA.

rna_no_NA <- filter(rna, !is.na(hsapiens_homolog_associated_gene_name))
select(rna_no_NA, gene, hsapiens_homolog_associated_gene_name)
# A tibble: 28,138 × 2
   gene    hsapiens_homolog_associated_gene_name
   <chr>   <chr>                                
 1 Asl     ASL                                  
 2 Apod    APOD                                 
 3 Cyp2d22 CYP2D6                               
 4 Klk6    KLK6                                 
 5 Fcrls   FCRL2                                
 6 Slc2a4  SLC2A4                               
 7 Exd2    EXD2                                 
 8 Gjc2    GJC2                                 
 9 Plp1    PLP1                                 
10 Gnb4    GNB4                                 
# … with 28,128 more rows

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

rna2 <- filter(rna, sex == "Male")
rna3 <- select(rna2, gene, sample, tissue, expression)
rna3
# A tibble: 14,740 × 4
   gene    sample     tissue     expression
   <chr>   <chr>      <chr>           <dbl>
 1 Asl     GSM2545340 Cerebellum        626
 2 Apod    GSM2545340 Cerebellum      13021
 3 Cyp2d22 GSM2545340 Cerebellum       2171
 4 Klk6    GSM2545340 Cerebellum        448
 5 Fcrls   GSM2545340 Cerebellum        180
 6 Slc2a4  GSM2545340 Cerebellum        313
 7 Exd2    GSM2545340 Cerebellum       2366
 8 Gjc2    GSM2545340 Cerebellum        310
 9 Plp1    GSM2545340 Cerebellum      53126
10 Gnb4    GSM2545340 Cerebellum       1355
# … with 14,730 more rows

This is readable, but can clutter up your workspace with lots of intermediate objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression)
rna3
# A tibble: 14,740 × 4
   gene    sample     tissue     expression
   <chr>   <chr>      <chr>           <dbl>
 1 Asl     GSM2545340 Cerebellum        626
 2 Apod    GSM2545340 Cerebellum      13021
 3 Cyp2d22 GSM2545340 Cerebellum       2171
 4 Klk6    GSM2545340 Cerebellum        448
 5 Fcrls   GSM2545340 Cerebellum        180
 6 Slc2a4  GSM2545340 Cerebellum        313
 7 Exd2    GSM2545340 Cerebellum       2366
 8 Gjc2    GSM2545340 Cerebellum        310
 9 Plp1    GSM2545340 Cerebellum      53126
10 Gnb4    GSM2545340 Cerebellum       1355
# … with 14,730 more rows

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl

In the above code, we use the pipe to send the rna dataset first through filter() to keep rows where sex is Male, then through select() to keep only the gene, sample, tissue, and expressioncolumns.

The pipe %>% takes the object on its left and passes it directly as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.

rna %>%
  filter(sex == "Male") %>%
  select(gene, sample, tissue, expression)
# A tibble: 14,740 × 4
   gene    sample     tissue     expression
   <chr>   <chr>      <chr>           <dbl>
 1 Asl     GSM2545340 Cerebellum        626
 2 Apod    GSM2545340 Cerebellum      13021
 3 Cyp2d22 GSM2545340 Cerebellum       2171
 4 Klk6    GSM2545340 Cerebellum        448
 5 Fcrls   GSM2545340 Cerebellum        180
 6 Slc2a4  GSM2545340 Cerebellum        313
 7 Exd2    GSM2545340 Cerebellum       2366
 8 Gjc2    GSM2545340 Cerebellum        310
 9 Plp1    GSM2545340 Cerebellum      53126
10 Gnb4    GSM2545340 Cerebellum       1355
# … with 14,730 more rows

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame rna, then we filtered for rows with sex == "Male", then we selected columns gene, sample, tissue, and expression.

The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

rna3 <- rna %>%
  filter(sex == "Male") %>%
  select(gene, sample, tissue, expression)

rna3
# A tibble: 14,740 × 4
   gene    sample     tissue     expression
   <chr>   <chr>      <chr>           <dbl>
 1 Asl     GSM2545340 Cerebellum        626
 2 Apod    GSM2545340 Cerebellum      13021
 3 Cyp2d22 GSM2545340 Cerebellum       2171
 4 Klk6    GSM2545340 Cerebellum        448
 5 Fcrls   GSM2545340 Cerebellum        180
 6 Slc2a4  GSM2545340 Cerebellum        313
 7 Exd2    GSM2545340 Cerebellum       2366
 8 Gjc2    GSM2545340 Cerebellum        310
 9 Plp1    GSM2545340 Cerebellum      53126
10 Gnb4    GSM2545340 Cerebellum       1355
# … with 14,730 more rows

Homework Challenge:

Using pipes, subset the rna data to keep genes with an expression higher than 50000 in female mice at time 0, and retain only the columns gene, sample, time, expression and age.

Solution

rna %>%
  filter(expression > 50000,
         sex == "Female",
         time == 0 ) %>%
  select(gene, sample, time, expression, age)
# A tibble: 9 × 5
  gene   sample      time expression   age
  <chr>  <chr>      <dbl>      <dbl> <dbl>
1 Plp1   GSM2545337     0     101241     8
2 Atp1b1 GSM2545337     0      53260     8
3 Plp1   GSM2545338     0      96534     8
4 Atp1b1 GSM2545338     0      50614     8
5 Plp1   GSM2545348     0     102790     8
6 Atp1b1 GSM2545348     0      59544     8
7 Plp1   GSM2545353     0      71237     8
8 Glul   GSM2545353     0      52451     8
9 Atp1b1 GSM2545353     0      61451     8

Mutate

Frequently you’ll want to create new columns based on the values of existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of time in hours:

rna %>%
  mutate(time_hours = time * 24) %>%
  select(time, time_hours)
# A tibble: 32,428 × 2
    time time_hours
   <dbl>      <dbl>
 1     8        192
 2     8        192
 3     8        192
 4     8        192
 5     8        192
 6     8        192
 7     8        192
 8     8        192
 9     8        192
10     8        192
# … with 32,418 more rows

You can also create a second new column based on the first new column within the same call of mutate():

rna %>%
  mutate(time_hours = time * 24,
         time_mn = time_hours * 60) %>%
  select(time, time_hours, time_mn)
# A tibble: 32,428 × 3
    time time_hours time_mn
   <dbl>      <dbl>   <dbl>
 1     8        192   11520
 2     8        192   11520
 3     8        192   11520
 4     8        192   11520
 5     8        192   11520
 6     8        192   11520
 7     8        192   11520
 8     8        192   11520
 9     8        192   11520
10     8        192   11520
# … with 32,418 more rows

Homework Challenge

Create a new data frame from the rna data that meets the following criteria: contains only the gene, chromosome_name, phenotype_description, sample, and expression columns and a new column giving the log expression the gene. This data frame must only contain genes located on autosomes and associated with a phenotype_description.

Hint: think about how the commands should be ordered to produce this data frame!

Solution

rna %>%
  filter(chromosome_name != "X", chromosome_name != "Y") %>%
  mutate(log_expression = log(expression)) %>%
  select(gene, chromosome_name, phenotype_description, sample, log_expression) %>%
  filter(!is.na(phenotype_description))
# A tibble: 21,054 × 5
   gene    chromosome_name phenotype_description           sample log_expression
   <chr>   <chr>           <chr>                           <chr>           <dbl>
 1 Asl     5               abnormal circulating amino aci… GSM25…           7.06
 2 Apod    16              abnormal lipid homeostasis      GSM25…          10.5 
 3 Cyp2d22 15              abnormal skin morphology        GSM25…           8.31
 4 Klk6    7               abnormal cytokine level         GSM25…           5.66
 5 Fcrls   3               decreased CD8-positive alpha-b… GSM25…           4.44
 6 Slc2a4  11              abnormal circulating glucose l… GSM25…           6.66
 7 Gjc2    11              Purkinje cell degeneration      GSM25…           5.66
 8 Gnb4    3               decreased anxiety-related resp… GSM25…           6.98
 9 Tnc     4               abnormal CNS synaptic transmis… GSM25…           5.39
10 Trf     9               abnormal circulating phosphate… GSM25…           9.18
# … with 21,044 more rows

Split-apply-combine data analysis

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

rna %>%
  group_by(gene)
# A tibble: 32,428 × 19
# Groups:   gene [1,474]
   gene    sample  expression organism   age sex   infection strain  time tissue
   <chr>   <chr>        <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr> 
 1 Asl     GSM254…       1170 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 2 Apod    GSM254…      36194 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 3 Cyp2d22 GSM254…       4060 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 4 Klk6    GSM254…        287 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 5 Fcrls   GSM254…         85 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 6 Slc2a4  GSM254…        782 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 7 Exd2    GSM254…       1619 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 8 Gjc2    GSM254…        288 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 9 Plp1    GSM254…      43217 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
10 Gnb4    GSM254…       1071 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
#   product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
#   chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>

The group_by() function doesn’t perform any data processing, it groups the data into subsets: in the example above, our initial tibble of 32428 observations is split into 1474 groups based on the gene variable.

We could similarly decide to group the tibble by the samples:

rna %>%
  group_by(sample)
# A tibble: 32,428 × 19
# Groups:   sample [22]
   gene    sample  expression organism   age sex   infection strain  time tissue
   <chr>   <chr>        <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr> 
 1 Asl     GSM254…       1170 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 2 Apod    GSM254…      36194 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 3 Cyp2d22 GSM254…       4060 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 4 Klk6    GSM254…        287 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 5 Fcrls   GSM254…         85 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 6 Slc2a4  GSM254…        782 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 7 Exd2    GSM254…       1619 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 8 Gjc2    GSM254…        288 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 9 Plp1    GSM254…      43217 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
10 Gnb4    GSM254…       1071 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
#   product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
#   chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>

Here our initial tibble of 32428 observations is split into 22 groups based on the sample variable.

Once the data have been combined, subsequent operations will be applied on each group independently.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group.

group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean expression by gene:

rna %>%
  group_by(gene) %>%
  summarize(mean_expression = mean(expression))
# A tibble: 1,474 × 2
   gene    mean_expression
   <chr>             <dbl>
 1 Aamp            4751.  
 2 Abca12             4.55
 3 Abcc8           2498.  
 4 Abhd14a          525.  
 5 Abi2            4909.  
 6 Abi3bp          1002.  
 7 Abl2            2124.  
 8 Acadl           2053.  
 9 Acap3           3536.  
10 Acbd4           1431.  
# … with 1,464 more rows

We could also want to calculate the mean expression levels of all genes in each sample:

rna %>%
  group_by(sample) %>%
  summarize(mean_expression = mean(expression))
# A tibble: 22 × 2
   sample     mean_expression
   <chr>                <dbl>
 1 GSM2545336           2062.
 2 GSM2545337           1766.
 3 GSM2545338           1668.
 4 GSM2545339           1696.
 5 GSM2545340           1682.
 6 GSM2545341           1638.
 7 GSM2545342           1594.
 8 GSM2545343           2107.
 9 GSM2545344           1712.
10 GSM2545345           1700.
# … with 12 more rows

But we can can also group by multiple columns:

rna %>%
  group_by(gene, infection, time) %>%
  summarize(mean_expression = mean(expression))
`summarise()` has grouped output by 'gene', 'infection'. You can override using
the `.groups` argument.
# A tibble: 4,422 × 4
# Groups:   gene, infection [2,948]
   gene    infection    time mean_expression
   <chr>   <chr>       <dbl>           <dbl>
 1 Aamp    InfluenzaA      4         4870   
 2 Aamp    InfluenzaA      8         4763.  
 3 Aamp    NonInfected     0         4603.  
 4 Abca12  InfluenzaA      4            4.25
 5 Abca12  InfluenzaA      8            4.14
 6 Abca12  NonInfected     0            5.29
 7 Abcc8   InfluenzaA      4         2609.  
 8 Abcc8   InfluenzaA      8         2292.  
 9 Abcc8   NonInfected     0         2576.  
10 Abhd14a InfluenzaA      4          547.  
# … with 4,412 more rows

Once the data is grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the median expression by gene and by condition:

rna %>%
  group_by(gene, infection, time) %>%
  summarize(mean_expression = mean(expression),
            median_expression = median(expression))
`summarise()` has grouped output by 'gene', 'infection'. You can override using
the `.groups` argument.
# A tibble: 4,422 × 5
# Groups:   gene, infection [2,948]
   gene    infection    time mean_expression median_expression
   <chr>   <chr>       <dbl>           <dbl>             <dbl>
 1 Aamp    InfluenzaA      4         4870               4708  
 2 Aamp    InfluenzaA      8         4763.              4813  
 3 Aamp    NonInfected     0         4603.              4717  
 4 Abca12  InfluenzaA      4            4.25               4.5
 5 Abca12  InfluenzaA      8            4.14               4  
 6 Abca12  NonInfected     0            5.29               5  
 7 Abcc8   InfluenzaA      4         2609.              2424. 
 8 Abcc8   InfluenzaA      8         2292.              2224  
 9 Abcc8   NonInfected     0         2576.              2578  
10 Abhd14a InfluenzaA      4          547.               523  
# … with 4,412 more rows

Challenge

Calculate the mean expression level of gene “Dok3” by timepoints.

Solution

rna %>%
  filter(gene == "Dok3") %>%
  group_by(time) %>%
  summarize(mean = mean(expression))
# A tibble: 3 × 2
   time  mean
  <dbl> <dbl>
1     0  169 
2     4  156.
3     8   61 

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each infected and non infected samples, we would do:

rna %>%
    count(infection)
# A tibble: 2 × 2
  infection       n
  <chr>       <int>
1 InfluenzaA  22110
2 NonInfected 10318

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, rna %>% count() is equivalent to:

rna %>%
    group_by(infection) %>%
    summarise(n = n())
# A tibble: 2 × 2
  infection       n
  <chr>       <int>
1 InfluenzaA  22110
2 NonInfected 10318

Previous example shows the use of count() to count the number of rows/observations for one factor (i.e., infection). If we wanted to count combination of factors, such as infection and time, we would specify the first and the second factor as the arguments of count():

rna %>%
    count(infection, time)
# A tibble: 3 × 3
  infection    time     n
  <chr>       <dbl> <int>
1 InfluenzaA      4 11792
2 InfluenzaA      8 10318
3 NonInfected     0 10318

which is equivalent to this:

rna %>%
  group_by(infection, time) %>%
  summarize(n = n())
`summarise()` has grouped output by 'infection'. You can override using the
`.groups` argument.
# A tibble: 3 × 3
# Groups:   infection [2]
  infection    time     n
  <chr>       <dbl> <int>
1 InfluenzaA      4 11792
2 InfluenzaA      8 10318
3 NonInfected     0 10318

It is sometimes useful to sort the result to facilitate the comparisons. We can use arrange() to sort the table. For instance, we might want to arrange the table above by time:

rna %>%
  count(infection, time) %>%
  arrange(time)
# A tibble: 3 × 3
  infection    time     n
  <chr>       <dbl> <int>
1 NonInfected     0 10318
2 InfluenzaA      4 11792
3 InfluenzaA      8 10318

or by counts:

rna %>%
  count(infection, time) %>%
  arrange(n)
# A tibble: 3 × 3
  infection    time     n
  <chr>       <dbl> <int>
1 InfluenzaA      8 10318
2 NonInfected     0 10318
3 InfluenzaA      4 11792

To sort in descending order, we need to add the desc() function:

rna %>%
  count(infection, time) %>%
  arrange(desc(n))
# A tibble: 3 × 3
  infection    time     n
  <chr>       <dbl> <int>
1 InfluenzaA      4 11792
2 InfluenzaA      8 10318
3 NonInfected     0 10318

Challenge

  1. How many genes were analysed in each sample?
  2. Use group_by() and summarize() to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth?
  3. Pick one sample and evaluate the number of genes by biotype
  4. Identify genes associated with “abnormal DNA methylation” phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8.

Solution

## 1.
rna %>%
  count(sample)
# A tibble: 22 × 2
   sample         n
   <chr>      <int>
 1 GSM2545336  1474
 2 GSM2545337  1474
 3 GSM2545338  1474
 4 GSM2545339  1474
 5 GSM2545340  1474
 6 GSM2545341  1474
 7 GSM2545342  1474
 8 GSM2545343  1474
 9 GSM2545344  1474
10 GSM2545345  1474
# … with 12 more rows
## 2.
rna %>%
  group_by(sample) %>%
  summarize(seq_depth = sum(expression)) %>%
  arrange(desc(seq_depth))
# A tibble: 22 × 2
   sample     seq_depth
   <chr>          <dbl>
 1 GSM2545350   3255566
 2 GSM2545352   3216163
 3 GSM2545343   3105652
 4 GSM2545336   3039671
 5 GSM2545380   3036098
 6 GSM2545353   2953249
 7 GSM2545348   2913678
 8 GSM2545362   2913517
 9 GSM2545351   2782464
10 GSM2545349   2758006
# … with 12 more rows
## 3.
rna %>%
  filter(sample == "GSM2545336") %>%
  count(gene_biotype) %>%
  arrange(desc(n))
# A tibble: 13 × 2
   gene_biotype                           n
   <chr>                              <int>
 1 protein_coding                      1321
 2 lncRNA                                69
 3 processed_pseudogene                  59
 4 miRNA                                  7
 5 snoRNA                                 5
 6 TEC                                    4
 7 polymorphic_pseudogene                 2
 8 unprocessed_pseudogene                 2
 9 IG_C_gene                              1
10 scaRNA                                 1
11 transcribed_processed_pseudogene       1
12 transcribed_unitary_pseudogene         1
13 transcribed_unprocessed_pseudogene     1
## 4.
rna %>%
  filter(phenotype_description == "abnormal DNA methylation") %>%
  group_by(gene, time) %>%
  summarize(mean_expression = mean(log(expression))) %>%
  arrange()
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups:   gene [2]
  gene   time mean_expression
  <chr> <dbl>           <dbl>
1 Xist      0            6.95
2 Xist      4            6.34
3 Xist      8            7.13
4 Zdbf2     0            6.27
5 Zdbf2     4            6.27
6 Zdbf2     8            6.19

Reshaping data

In the rna tibble, the rows contain expression values (the unit) that are associated with a combination of 2 other variables: gene and sample.

All the other columns correspond to variables describing either the sample (organism, age, sex,…) or the gene (gene_biotype, ENTREZ_ID, product…). The variables that don’t change with genes or with samples will have the same value in all the rows.

rna %>%
  arrange(gene)
# A tibble: 32,428 × 19
   gene  sample    expression organism   age sex   infection strain  time tissue
   <chr> <chr>          <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr> 
 1 Aamp  GSM25453…       5621 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 2 Aamp  GSM25453…       4049 Mus mus…     8 Fema… NonInfec… C57BL…     0 Cereb…
 3 Aamp  GSM25453…       3797 Mus mus…     8 Fema… NonInfec… C57BL…     0 Cereb…
 4 Aamp  GSM25453…       4375 Mus mus…     8 Fema… Influenz… C57BL…     4 Cereb…
 5 Aamp  GSM25453…       4095 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
 6 Aamp  GSM25453…       3867 Mus mus…     8 Male  Influenz… C57BL…     8 Cereb…
 7 Aamp  GSM25453…       3578 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…
 8 Aamp  GSM25453…       5097 Mus mus…     8 Male  NonInfec… C57BL…     0 Cereb…
 9 Aamp  GSM25453…       4202 Mus mus…     8 Fema… Influenz… C57BL…     4 Cereb…
10 Aamp  GSM25453…       4701 Mus mus…     8 Male  Influenz… C57BL…     4 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
#   product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
#   chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>

This structure is called a long-format, as one column contains all the values, and other column(s) list(s) the context of the value.

In certain cases, the long-format is not really “human-readable”, and another format, a wide-format is preferred, as a more compact way of representing the data. This is typically the case with gene expression values that scientists are used to look as matrices, were rows represent genes and columns represent samples.

In this format, it would become therefore straightforward to explore the relationship between the gene expression levels within, and between, the samples.

# A tibble: 1,474 × 23
   gene    GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
   <chr>        <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1 Asl           1170        361        400        586        626        988
 2 Apod         36194      10347       9173      10620      13021      29594
 3 Cyp2d22       4060       1616       1603       1901       2171       3349
 4 Klk6           287        629        641        578        448        195
 5 Fcrls           85        233        244        237        180         38
 6 Slc2a4         782        231        248        265        313        786
 7 Exd2          1619       2288       2235       2513       2366       1359
 8 Gjc2           288        595        568        551        310        146
 9 Plp1         43217     101241      96534      58354      53126      27173
10 Gnb4          1071       1791       1867       1430       1355        798
# … with 1,464 more rows, and 16 more variables: GSM2545342 <dbl>,
#   GSM2545343 <dbl>, GSM2545344 <dbl>, GSM2545345 <dbl>, GSM2545346 <dbl>,
#   GSM2545347 <dbl>, GSM2545348 <dbl>, GSM2545349 <dbl>, GSM2545350 <dbl>,
#   GSM2545351 <dbl>, GSM2545352 <dbl>, GSM2545353 <dbl>, GSM2545354 <dbl>,
#   GSM2545362 <dbl>, GSM2545363 <dbl>, GSM2545380 <dbl>

To convert the gene expression values from rna into a wide-format, we need to create a new table where the values of the sample column would become the names of column variables.

The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: expression levels per gene instead of recording them per gene and per sample.

The opposite transformation would be to transform column names into values of a new variable.

We can do both these of transformations with two tidyr functions, pivot_longer() and pivot_wider() (see here for details).

Pivoting the data into a wider format

Let’s first select the 3 first columns of rna and use pivot_wider() to transform data in a wide-format.

rna_exp <- rna %>%
  select(gene, sample, expression)
rna_exp
# A tibble: 32,428 × 3
   gene    sample     expression
   <chr>   <chr>           <dbl>
 1 Asl     GSM2545336       1170
 2 Apod    GSM2545336      36194
 3 Cyp2d22 GSM2545336       4060
 4 Klk6    GSM2545336        287
 5 Fcrls   GSM2545336         85
 6 Slc2a4  GSM2545336        782
 7 Exd2    GSM2545336       1619
 8 Gjc2    GSM2545336        288
 9 Plp1    GSM2545336      43217
10 Gnb4    GSM2545336       1071
# … with 32,418 more rows

pivot_wider takes three main arguments:

  1. the data to be transformed;
  2. the names_from : the column whose values will become new column names;
  3. the values_from: the column whose values will fill the new columns.
rna_wide <- rna_exp %>%
  pivot_wider(names_from = sample,
              values_from = expression)
rna_wide
# A tibble: 1,474 × 23
   gene    GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
   <chr>        <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1 Asl           1170        361        400        586        626        988
 2 Apod         36194      10347       9173      10620      13021      29594
 3 Cyp2d22       4060       1616       1603       1901       2171       3349
 4 Klk6           287        629        641        578        448        195
 5 Fcrls           85        233        244        237        180         38
 6 Slc2a4         782        231        248        265        313        786
 7 Exd2          1619       2288       2235       2513       2366       1359
 8 Gjc2           288        595        568        551        310        146
 9 Plp1         43217     101241      96534      58354      53126      27173
10 Gnb4          1071       1791       1867       1430       1355        798
# … with 1,464 more rows, and 16 more variables: GSM2545342 <dbl>,
#   GSM2545343 <dbl>, GSM2545344 <dbl>, GSM2545345 <dbl>, GSM2545346 <dbl>,
#   GSM2545347 <dbl>, GSM2545348 <dbl>, GSM2545349 <dbl>, GSM2545350 <dbl>,
#   GSM2545351 <dbl>, GSM2545352 <dbl>, GSM2545353 <dbl>, GSM2545354 <dbl>,
#   GSM2545362 <dbl>, GSM2545363 <dbl>, GSM2545380 <dbl>

Pivoting data into a longer format

In the opposite situation we are using the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.

pivot_longer() takes four main arguments:

  1. the data to be transformed;
  2. the names_to: the new column name we wish to create and populate with the current column names;
  3. the values_to: the new column name we wish to create and populate with current values;
  4. the names of the columns to be used to populate the names_to and values_to variables (or to drop).

To recreate rna_long from rna_long we would create a key called sample and value called expression and use all columns except gene for the key variable. Here we drop gene column with a minus sign.

Notice how the new variable names are to be quoted here.

rna_long <- rna_wide %>%
    pivot_longer(names_to = "sample",
                 values_to = "expression",
                 -gene)
rna_long
# A tibble: 32,428 × 3
   gene  sample     expression
   <chr> <chr>           <dbl>
 1 Asl   GSM2545336       1170
 2 Asl   GSM2545337        361
 3 Asl   GSM2545338        400
 4 Asl   GSM2545339        586
 5 Asl   GSM2545340        626
 6 Asl   GSM2545341        988
 7 Asl   GSM2545342        836
 8 Asl   GSM2545343        535
 9 Asl   GSM2545344        586
10 Asl   GSM2545345        597
# … with 32,418 more rows

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. Here the starts_with() function can help to retrieve sample names without having to list them all! Another possibility would be to use the : operator!

rna_wide %>%
    pivot_longer(names_to = "sample",
                 values_to = "expression",
                 cols = starts_with("GSM"))
# A tibble: 32,428 × 3
   gene  sample     expression
   <chr> <chr>           <dbl>
 1 Asl   GSM2545336       1170
 2 Asl   GSM2545337        361
 3 Asl   GSM2545338        400
 4 Asl   GSM2545339        586
 5 Asl   GSM2545340        626
 6 Asl   GSM2545341        988
 7 Asl   GSM2545342        836
 8 Asl   GSM2545343        535
 9 Asl   GSM2545344        586
10 Asl   GSM2545345        597
# … with 32,418 more rows
rna_wide %>%
    pivot_longer(names_to = "sample",
                 values_to = "expression",
                 GSM2545336:GSM2545380)
# A tibble: 32,428 × 3
   gene  sample     expression
   <chr> <chr>           <dbl>
 1 Asl   GSM2545336       1170
 2 Asl   GSM2545337        361
 3 Asl   GSM2545338        400
 4 Asl   GSM2545339        586
 5 Asl   GSM2545340        626
 6 Asl   GSM2545341        988
 7 Asl   GSM2545342        836
 8 Asl   GSM2545343        535
 9 Asl   GSM2545344        586
10 Asl   GSM2545345        597
# … with 32,418 more rows

Challenge

Subset genes located on X and Y chromosomes from the rna data frame and spread the data frame with sex as columns, chromosome_name as rows, and the mean expression of genes located in each chromosome as the values, as in the following tibble:

You will need to summarize before reshaping!

Let’s first calculate the mean expression level of X and Y linked genes from male and female samples…

 rna %>%
  filter(chromosome_name == "Y" | chromosome_name == "X") %>%
  group_by(sex, chromosome_name) %>%
  summarize(mean = mean(expression))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 4 × 3
# Groups:   sex [2]
  sex    chromosome_name  mean
  <chr>  <chr>           <dbl>
1 Female X               3504.
2 Female Y                  3 
3 Male   X               2497.
4 Male   Y               2117.

And pivot the table to wide format

rna_1 <- rna %>%
  filter(chromosome_name == "Y" | chromosome_name == "X") %>%
  group_by(sex, chromosome_name) %>%
  summarize(mean = mean(expression)) %>%
  pivot_wider(names_from = sex,
              values_from = mean)
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
rna_1
# A tibble: 2 × 3
  chromosome_name Female  Male
  <chr>            <dbl> <dbl>
1 X                3504. 2497.
2 Y                   3  2117.

Now take that data frame and transform it with pivot_longer() so each row is a unique chromosome_name by gender combination.

Solution

rna_1 %>%
  pivot_longer(names_to = "gender",
               values_to = "mean",
               - chromosome_name)
# A tibble: 4 × 3
  chromosome_name gender  mean
  <chr>           <chr>  <dbl>
1 X               Female 3504.
2 X               Male   2497.
3 Y               Female    3 
4 Y               Male   2117.

Homework Challenge

  1. Use the rna dataset to create an expression table where each row represents the mean expression levels of genes and columns represent the different timepoints.
  2. Use the previous table containing mean expression levels per timepoint and create a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes between timepoint 8 and timepoint 4. Convert this table into a long-format table gathering the foldchanges calculated.

Solution

Let’s first calculate the mean expression by gene and by time

 rna %>%
   group_by(gene, time) %>%
 summarize(mean_exp = mean(expression))
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
# A tibble: 4,422 × 3
# Groups:   gene [1,474]
   gene     time mean_exp
   <chr>   <dbl>    <dbl>
 1 Aamp        0  4603.  
 2 Aamp        4  4870   
 3 Aamp        8  4763.  
 4 Abca12      0     5.29
 5 Abca12      4     4.25
 6 Abca12      8     4.14
 7 Abcc8       0  2576.  
 8 Abcc8       4  2609.  
 9 Abcc8       8  2292.  
10 Abhd14a     0   591.  
# … with 4,412 more rows

before using the pivot_wider() function

rna_time <- rna %>%
 group_by(gene, time) %>%
 summarize(mean_exp = mean(expression)) %>%
 pivot_wider(names_from = time,
          values_from = mean_exp)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
rna_time
# A tibble: 1,474 × 4
# Groups:   gene [1,474]
   gene        `0`     `4`     `8`
   <chr>     <dbl>   <dbl>   <dbl>
 1 Aamp    4603.   4870    4763.  
 2 Abca12     5.29    4.25    4.14
 3 Abcc8   2576.   2609.   2292.  
 4 Abhd14a  591.    547.    432.  
 5 Abi2    4881.   4903.   4945.  
 6 Abi3bp  1175.   1061.    762.  
 7 Abl2    2170.   2078.   2131.  
 8 Acadl   2059.   2099    1995.  
 9 Acap3   3745    3446.   3431.  
10 Acbd4   1219.   1410.   1668.  
# … with 1,464 more rows

Notice that this generates a tibble with some column names starting by a number. If we wanted to select the column corresponding to the timepoints, we could not use the column names directly… What happens when we select the colum 4?

 rna %>%
  group_by(gene, time) %>%
  summarize(mean_exp = mean(expression)) %>%
  pivot_wider(names_from = time,
              values_from = mean_exp) %>%
   select(gene, 4)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
# A tibble: 1,474 × 2
# Groups:   gene [1,474]
   gene        `8`
   <chr>     <dbl>
 1 Aamp    4763.  
 2 Abca12     4.14
 3 Abcc8   2292.  
 4 Abhd14a  432.  
 5 Abi2    4945.  
 6 Abi3bp   762.  
 7 Abl2    2131.  
 8 Acadl   1995.  
 9 Acap3   3431.  
10 Acbd4   1668.  
# … with 1,464 more rows

To select the timepoint 4, we would have to quote the column name, with backticks “`”

 rna %>%
  group_by(gene, time) %>%
  summarize(mean_exp = mean(expression)) %>%
  pivot_wider(names_from = time,
               values_from = mean_exp) %>%
  select(gene, `4`)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
# A tibble: 1,474 × 2
# Groups:   gene [1,474]
   gene        `4`
   <chr>     <dbl>
 1 Aamp    4870   
 2 Abca12     4.25
 3 Abcc8   2609.  
 4 Abhd14a  547.  
 5 Abi2    4903.  
 6 Abi3bp  1061.  
 7 Abl2    2078.  
 8 Acadl   2099   
 9 Acap3   3446.  
10 Acbd4   1410.  
# … with 1,464 more rows

Another possibility would be to rename the column, choosing a name that doesn’t start by a number :

rna_time <- rna %>%
  group_by(gene, time) %>%
  summarize(mean_exp = mean(expression)) %>%
  pivot_wider(names_from = time,
               values_from = mean_exp) %>%
  rename("time0" = `0`, "time4" = `4`, "time8" = `8`)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.

Calculate FoldChanges:

 rna_time %>%
  mutate(time_8_vs_0 = time8 / time0, time_8_vs_4 = time8 / time4)
# A tibble: 1,474 × 6
# Groups:   gene [1,474]
   gene      time0   time4   time8 time_8_vs_0 time_8_vs_4
   <chr>     <dbl>   <dbl>   <dbl>       <dbl>       <dbl>
 1 Aamp    4603.   4870    4763.         1.03        0.978
 2 Abca12     5.29    4.25    4.14       0.784       0.975
 3 Abcc8   2576.   2609.   2292.         0.889       0.878
 4 Abhd14a  591.    547.    432.         0.731       0.791
 5 Abi2    4881.   4903.   4945.         1.01        1.01 
 6 Abi3bp  1175.   1061.    762.         0.649       0.719
 7 Abl2    2170.   2078.   2131.         0.982       1.03 
 8 Acadl   2059.   2099    1995.         0.969       0.950
 9 Acap3   3745    3446.   3431.         0.916       0.996
10 Acbd4   1219.   1410.   1668.         1.37        1.18 
# … with 1,464 more rows

And use the pivot_longer() function:

 rna_time %>%
   mutate(time_8_vs_0 = time8 / time0, time_8_vs_4 = time8 / time4) %>%
  pivot_longer(names_to = "comparisons",
                values_to = "Fold_changes",
               time_8_vs_0:time_8_vs_4)
# A tibble: 2,948 × 6
# Groups:   gene [1,474]
   gene      time0   time4   time8 comparisons Fold_changes
   <chr>     <dbl>   <dbl>   <dbl> <chr>              <dbl>
 1 Aamp    4603.   4870    4763.   time_8_vs_0        1.03 
 2 Aamp    4603.   4870    4763.   time_8_vs_4        0.978
 3 Abca12     5.29    4.25    4.14 time_8_vs_0        0.784
 4 Abca12     5.29    4.25    4.14 time_8_vs_4        0.975
 5 Abcc8   2576.   2609.   2292.   time_8_vs_0        0.889
 6 Abcc8   2576.   2609.   2292.   time_8_vs_4        0.878
 7 Abhd14a  591.    547.    432.   time_8_vs_0        0.731
 8 Abhd14a  591.    547.    432.   time_8_vs_4        0.791
 9 Abi2    4881.   4903.   4945.   time_8_vs_0        1.01 
10 Abi2    4881.   4903.   4945.   time_8_vs_4        1.01 
# … with 2,938 more rows

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data_output, in our working directory (if we haven’t already created it) that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them.

Let’s use write_csv() to save the rna_wide table that we have created previously.

write_csv(rna_wide, file = "data_output/rna_wide.csv")

Key Points

  • Tabular data in R using the tidyverse meta-package


Data visualization

Overview

Teaching: XX min
Exercises: XX min
Questions
  • Visualization in R

Objectives
  • Produce scatter plots, boxplots, line plots, etc. using ggplot.

  • Set universal plot settings.

  • Describe what faceting is and apply faceting in ggplot.

  • Modify the aesthetics of an existing ggplot plot (including axis labels and color).

  • Build complex and customized plots from data in a data frame.

Data Visualization

We start by loading the required packages. ggplot2 is included in the tidyverse package.

library("tidyverse")

If not still in the workspace, load the data we saved in the previous lesson.

rna <- read.csv("course-data/data/GSE96870/rnaseq.csv")

The Data Visualization Cheat Sheet will cover the basics and more advanced features of ggplot2 and will help, in addition to serve as a reminder, getting an overview of the many data representations available in the package. The following video tutorials (part 1 and 2) by Thomas Lin Pedersen are also very instructive.

Plotting with ggplot2

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. The theoretical foundation that supports the ggplot2 is the Grammar of Graphics (@Wilkinson:2005). Using this approach, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

There is a book about ggplot2 (@ggplot2book) that provides a good overview, but it is outdated. The 3rd edition is in preparation and will be freely available online. The ggplot2 webpage (https://ggplot2.tidyverse.org) provides ample documentation.

ggplot2 functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2.

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

The idea behind the Grammar of Graphics it is that you can build every graph from the same 3 components: (1) a data set, (2) a coordinate system, and (3) geoms — i.e. visual marks that represent data points 1.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()
ggplot(data = rna)
ggplot(data = rna, mapping = aes(x = expression))

To add a geom(etry) to the plot use the + operator. Let’s use geom_histogram() first:

ggplot(data = rna, mapping = aes(x = expression)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk first-ggplot

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot templates and conveniently explore different types of plots, so the above plot can also be generated with code like this:

# Assign plot to a variable
rna_plot <- ggplot(data = rna,
                   mapping = aes(x = expression))

# Draw the plot
rna_plot + geom_histogram()

Homework Challenge

You have probably noticed an automatic message that appears when drawing the histogram:

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Change the arguments bins or binwidth of geom_histogram() to change the number or width of the bins.

Solution

# change bins
ggplot(rna, aes(x = expression)) +
    geom_histogram(bins = 15)

plot of chunk unnamed-chunk-5

# change binwidth
ggplot(rna, aes(x = expression)) +
    geom_histogram(binwidth = 2000)

plot of chunk unnamed-chunk-5

We can observe here that the data are skewed to the right. We can apply log2 transformation to have a more symmetric distribution. Note that we add here a small constant value (+1) to avoid having -Inf values returned for expression values equal to 0.

rna <- rna %>%
  mutate(expression_log = log2(expression + 1))

If we now draw the histogram of the log2-transformed expressions, the distribution is indeed closer to a normal distribution.

ggplot(rna, aes(x = expression_log)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk second-ggplot

From now on we will work on the log-transformed expression values.

Homework Challenge

Another way to visualize this transformation is to consider the scale of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components (i.e., by incrementally adding commands). Try making this modification:

  • Represent the un-transformed expression on the log10 scale; see scale_x_log10(). Compare it with the previous graph. Why do you now have warning messages appearing?

Solution

ggplot(data = rna,mapping = aes(x = expression))+
  geom_histogram() +
  scale_x_log10()
Warning: Transformation introduced infinite values in continuous x-axis
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 507 rows containing non-finite values (stat_bin).

plot of chunk unnamed-chunk-6

Notes

# This is the correct syntax for adding layers
rna_plot +
  geom_histogram()

# This will not add the new layer and will return an error message
rna_plot
  + geom_histogram()

Building your plots iteratively

We will now draw a scatter plot with two continuous variables and the geom_point() function. This graph will represent the log2 fold changes of expression comparing time 8 versus time 0 and time 4 versus time 0. To this end, we first need to compute the means of the log-transformed expression values by gene and time then the log fold changes by subtracting the mean log expressions between time 8 and time 0 and between time 4 and time 0. Note that we also include here the gene biotype that we will use later on to represent the genes. We will save the fold changes in a new data frame called rna_fc.

rna_fc <- rna %>% select(gene, time,
                         gene_biotype, expression_log) %>%
  group_by(gene, time, gene_biotype) %>%
  summarize(mean_exp = mean(expression_log)) %>%
  pivot_wider(names_from = time,
              values_from = mean_exp) %>%
  mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`)
`summarise()` has grouped output by 'gene', 'time'. You can override using the
`.groups` argument.

We can then build a ggplot with the newly created dataset rna_fc. Building plots with ggplot2 is typically an iterative process. We start by defining the dataset we’ll use, lay out the axes, and choose a geom:

ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
  geom_point()

plot of chunk create-ggplot-object

Then, we start modifying this plot to extract more information from it. For instance, we can add transparency (alpha) to avoid overplotting:

ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
  geom_point(alpha = 0.3)

plot of chunk adding-transparency

We can also add colors for all the points:

ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
  geom_point(alpha = 0.3, color = "blue")

plot of chunk adding-colors

Or to color each gene in the plot differently, you could use a vector as an input to the argument color. ggplot2 will provide a different color corresponding to different values in the vector. Here is an example where we color with gene_biotype:

ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
  geom_point(alpha = 0.3, aes(color = gene_biotype))

plot of chunk color-by-gene_biotype1

We can also specify the colors directly inside the mapping provided in the ggplot() function. This will be seen by any geom layers and the mapping will be determined by the x- and y-axis set up in aes().

ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0,
                                color = gene_biotype)) +
  geom_point(alpha = 0.3)

plot of chunk color-by-gene_biotype2

Finally, we could also add a diagonal line with the geom_abline() function:

ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0,
                                color = gene_biotype)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0)

plot of chunk adding-diag

Notice that we can change the geom layer from geom_point to geom_jitter and colors will be still determined by gene_biotype.

ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0,
                                color = gene_biotype)) +
  geom_jitter(alpha = 0.3) +
  geom_abline(intercept = 0)

plot of chunk color-by-gene_biotype3

Challenge

Use what you just learned to create a scatter plot of expression_log over sample from the rna dataset with the time showing in different colors. Is this a good way to show this type of data?

Solution

ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) +
    geom_point(aes(color = time))

plot of chunk unnamed-chunk-7

Boxplot

We can use boxplots to visualize the distribution of gene expressions within each sample:

ggplot(data = rna,
         mapping = aes(y = expression_log, x = sample)) +
  geom_boxplot()

plot of chunk boxplot

By adding points to boxplot, we can have a better idea of the number of measurements and of their distribution:

ggplot(data = rna,
         mapping = aes(y = expression_log, x = sample)) +
  geom_jitter(alpha = 0.2, color = "tomato") +
  geom_boxplot(alpha = 0)

plot of chunk boxplot-with-points

Challenge

Note how the boxplot layer is in front of the jitter layer? What do you need to change in the code to put the boxplot below the points?

Solution

We should switch the order of these two geoms:

ggplot(data = rna,
         mapping = aes(y = expression_log, x = sample)) +
  geom_boxplot(alpha = 0) +
  geom_jitter(alpha = 0.2, color = "tomato")

plot of chunk boxplot-with-points2

You may notice that the values on the x-axis are still not properly readable. Let’s change the orientation of the labels and adjust them vertically and horizontally so they don’t overlap. You can use a 90-degree angle, or experiment to find the appropriate angle for diagonally oriented labels:

ggplot(data = rna,
         mapping = aes(y = expression_log, x = sample)) +
  geom_jitter(alpha = 0.2, color = "tomato") +
  geom_boxplot(alpha = 0) +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))

plot of chunk boxplot-xaxis-rotated

Challenge

Add color to the data points on your boxplot according to the duration of the infection (time).

Hint: Check the class for time. Consider changing the class of time from integer to factor directly in the ggplot mapping. Why does this change how R makes the graph?

Solution

# time as integer
ggplot(data = rna,
         mapping = aes(y = expression_log,
                       x = sample)) +
  geom_jitter(alpha = 0.2, aes(color = time)) +
  geom_boxplot(alpha = 0) +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))

plot of chunk boxplot-color-time

# time as factor
ggplot(data = rna,
         mapping = aes(y = expression_log,
                       x = sample)) +
  geom_jitter(alpha = 0.2, aes(color = as.factor(time))) +
  geom_boxplot(alpha = 0) +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))

plot of chunk boxplot-color-time

Homework Challenge

Boxplots are useful summaries, but hide the shape of the distribution. For example, if the distribution is bimodal, we would not see it in a boxplot. An alternative to the boxplot is the violin plot, where the shape (of the density of points) is drawn.

  1. Replace the box plot with a violin plot; see geom_violin().
  2. Fill in the violins according to the time with the argument fill.
  3. Modify the violin plot to fill in the violins by sex.

Solution

ggplot(data = rna,
         mapping = aes(y = expression_log, x = sample)) +
  geom_violin(aes(fill = as.factor(time))) +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))

plot of chunk unnamed-chunk-8

ggplot(data = rna,
         mapping = aes(y = expression_log, x = sample)) +
  geom_violin(aes(fill = sex)) +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))

plot of chunk unnamed-chunk-9

Line plots

Let’s calculate the mean expression per duration of the infection for the 10 genes having the highest log fold changes comparing time 8 versus time 0. First, we need to select the genes and create a subset of rna called sub_rna containing the 10 selected genes, then we need to group the data and calculate the mean gene expression within each group:

rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0))

genes_selected <- rna_fc$gene[1:10]

sub_rna <- rna %>%
    filter(gene %in% genes_selected)

mean_exp_by_time <- sub_rna %>%
  group_by(gene,time) %>%
  summarize(mean_exp = mean(expression_log))
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.

We can build the line plot with duration of the infection on the x-axis and the mean expression on the y-axis:

ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) +
  geom_line()

plot of chunk first-time-series

Unfortunately, this does not work because we plotted data for all the genes together. We need to tell ggplot to draw a line for each gene by modifying the aesthetic function to include group = gene:

ggplot(data = mean_exp_by_time,
       mapping = aes(x = time, y = mean_exp, group = gene)) +
  geom_line()

plot of chunk time-series-by-gene

We will be able to distinguish genes in the plot if we add colors (using color also automatically groups the data):

ggplot(data = mean_exp_by_time,
       mapping = aes(x = time, y = mean_exp, color = gene)) +
  geom_line()

plot of chunk time-series-with-colors

Faceting

ggplot2 has a special technique called faceting that allows the user to split one plot into multiple (sub) plots based on a factor included in the dataset. These different subplots inherit the same properties (axes limits, ticks, …) to facilitate their direct comparison. We will use it to make a line plot across time for each gene:

ggplot(data = mean_exp_by_time,
       mapping = aes(x = time, y = mean_exp)) + geom_line() +
  facet_wrap(~ gene)

plot of chunk first-facet Here both x- and y-axis have the same scale for all the sub plots. You can change this default behavior by modifying scales in order to allow a free scale for the y-axis:

ggplot(data = mean_exp_by_time,
       mapping = aes(x = time, y = mean_exp)) +
  geom_line() +
  facet_wrap(~ gene, scales = "free_y")

plot of chunk first-facet-scales

Now we would like to split the line in each plot by the sex of the mice. To do that we need to calculate the mean expression in the data frame grouped by gene, time, and sex:

mean_exp_by_time_sex <- sub_rna %>%
  group_by(gene, time, sex) %>%
  summarize(mean_exp = mean(expression_log))
`summarise()` has grouped output by 'gene', 'time'. You can override using the
`.groups` argument.

We can now make the faceted plot by splitting further by sex using color (within a single plot):

ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = sex)) +
  geom_line() +
  facet_wrap(~ gene, scales = "free_y")

plot of chunk facet-by-gene-and-sex

Usually plots with white background look more readable when printed. We can set the background to white using the function theme_bw(). Additionally, we can remove the grid:

ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = sex)) +
  geom_line() +
  facet_wrap(~ gene, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank())

plot of chunk facet-by-gene-and-sex-white-bg

Homework Challenge

Use what you just learned to create a plot that depicts how the average expression of each chromosome changes through the duration of infection.

Solution

mean_exp_by_chromosome <- rna %>%
  group_by(chromosome_name, time) %>%
  summarize(mean_exp = mean(expression_log))
`summarise()` has grouped output by 'chromosome_name'. You can override using
the `.groups` argument.
ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time,
                                y = mean_exp)) +
  geom_line() +
  facet_wrap(~ chromosome_name, scales = "free_y")

plot of chunk mean-exp-chromosome-time-series

The facet_wrap geometry extracts plots into an arbitrary number of dimensions to allow them to cleanly fit on one page. On the other hand, the facet_grid geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (rows ~ columns; a . can be used as a placeholder that indicates only one row or column).

Let’s modify the previous plot to compare how the mean gene expression of males and females has changed through time:

# One column, facet by rows
ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = gene)) +
  geom_line() +
  facet_grid(sex ~ .)

plot of chunk mean-exp-time-facet-sex-rows

# One row, facet by column
ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = gene)) +
  geom_line() +
  facet_grid(. ~ sex)

plot of chunk mean-exp-time-facet-sex-columns

ggplot2 themes

In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at http://docs.ggplot2.org/current/ggtheme.html. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme.

The ggthemes package provides a wide variety of options (including an Excel 2003 theme). The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.

Customisation

Let’s come back to the faceted plot of mean expression by time and gene, colored by sex.

Take a look at the ggplot2 cheat sheet, and think of ways you could improve the plot.

Now, we can change names of axes to something more informative than ‘time’ and ‘mean_exp’, and add a title to the figure:

ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = sex)) +
  geom_line() +
  facet_wrap(~ gene, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank()) +
  labs(title = "Mean gene expression by duration of the infection",
       x = "Duration of the infection (in days)",
       y = "Mean gene expression")

plot of chunk mean_exp-time-with-right-labels

The axes have more informative names, but their readability can be improved by increasing the font size:

ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = sex)) +
  geom_line() +
  facet_wrap(~ gene, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank()) +
  labs(title = "Mean gene expression by duration of the infection",
       x = "Duration of the infection (in days)",
       y = "Mean gene expression")  +
  theme(text = element_text(size = 16))

plot of chunk mean_exp-time-with-right-labels-xfont-size

Note that it is also possible to change the fonts of your plots. If you are on Windows, you may have to install the extrafont package, and follow the instructions included in the README for this package.

We can further customize the color of x- and y-axis text, the color of the grid, etc. We can also for example move the legend to the top by setting legend.position to "top".

ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = sex)) +
  geom_line() +
  facet_wrap(~ gene, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank()) +
  labs(title = "Mean gene expression by duration of the infection",
       x = "Duration of the infection (in days)",
       y = "Mean gene expression")  +
  theme(text = element_text(size = 16),
        axis.text.x = element_text(colour = "royalblue4", size = 12),
        axis.text.y = element_text(colour = "royalblue4", size = 12),
        panel.grid = element_line(colour="lightsteelblue1"),
        legend.position = "top")

plot of chunk mean_exp-time-with-theme

If you like the changes you created better than the default theme, you can save them as an object to be able to easily apply them to other plots you may create. Here is an example with the histogram we have previously created.

blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4",
                                               size = 12),
                    axis.text.y = element_text(colour = "royalblue4",
                                               size = 12),
                    text = element_text(size = 16),
                    panel.grid = element_line(colour="lightsteelblue1"))

ggplot(rna, aes(x = expression_log)) +
  geom_histogram(bins = 20) +
    blue_theme

plot of chunk mean_exp-time-with-right-labels-xfont

Challenge

With all of this information in hand, try to create the most ugly plot that you can think about. You can use one of the plots generated in this exercise as a starting point. Use the RStudio ggplot2 cheatsheet for inspiration. Here are some ideas:

  • See if you can change the thickness of the lines.
  • Can you find a way to change the name of the legend? What about its labels? (hint: look for a ggplot function starting with scale_)
  • Try using a different color palette or manually specifying the colors for the lines (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/).

Composing plots

Faceting is a great tool for splitting one plot into multiple subplots, but sometimes you may want to produce a single figure that contains multiple independent plots, i.e. plots that are based on different variables or even different data frames.

Let’s start by creating the two plots that we want to arrange next to each other:

The first graph counts the number of unique genes per chromosome. We first need to reorder the levels of chromosome_name and filter the unique genes per chromosome. We also change the scale of the y-axis to a log10 scale for better readability.

rna$chromosome_name <- factor(rna$chromosome_name,
                               levels = c(1:19,"X","Y"))

count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>%
  distinct() %>% ggplot() +
  geom_bar(aes(x = chromosome_name), fill = "seagreen",
           position = "dodge", stat = "count") +
  labs(y = "log10(n genes)", x = "chromosome") +
  scale_y_log10()

count_gene_chromosome

plot of chunk sub1

Below, we also remove the legend altogether by setting the legend.position to "none".

exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time),
                 color=sex)) +
   geom_boxplot(alpha = 0) +
  labs(y = "Mean gene exp",
       x = "time") + theme(legend.position = "none")

exp_boxplot_sex

plot of chunk sub2

The patchwork package provides an elegant approach to combining figures using the + to arrange figures (typically side by side). More specifically the | explicitly arranges them side by side and / stacks them on top of each other.

install.packages("patchwork")
library("patchwork")
count_gene_chromosome + exp_boxplot_sex

plot of chunk unnamed-chunk-11

## or count_gene_chromosome | exp_boxplot_sex
count_gene_chromosome / exp_boxplot_sex

plot of chunk unnamed-chunk-12

We can combine further control the layout of the final composition with plot_layout to create more complex layouts:

count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1)

plot of chunk unnamed-chunk-13

count_gene_chromosome +
 (count_gene_chromosome + exp_boxplot_sex) +
 exp_boxplot_sex +
 plot_layout(ncol = 1)

plot of chunk unnamed-chunk-14

The last plot can also be created using the | and / composers:

count_gene_chromosome /
 (count_gene_chromosome | exp_boxplot_sex) /
 exp_boxplot_sex

plot of chunk unnamed-chunk-15

Learn more about patchwork on its webpage or in this video.

Another option is the gridExtra package that allows to combine separate ggplots into a single figure using grid.arrange():

install.packages("gridExtra")
library(gridExtra)
grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2)

plot of chunk gridarrange-example

In addition to the ncol and nrow arguments, used to make simple arrangements, there are tools for constructing more complex layouts.

Exporting plots

After creating your plot, you can save it to a file in your favorite format. The Export tab in the Plot pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters.

Instead, use the ggsave() function, which allows you easily change the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi).

Make sure you have the fig_output/ folder in your working directory.

my_plot <- ggplot(data = mean_exp_by_time_sex,
       mapping = aes(x = time, y = mean_exp, color = sex)) +
  geom_line() +
  facet_wrap(~ gene, scales = "free_y") +
  labs(title = "Mean gene expression by duration of the infection",
         x = "Duration of the infection (in days)",
         y = "Mean gene expression") +
  guides(color=guide_legend(title="Gender")) +
  theme_bw() +
  theme(axis.text.x = element_text(colour = "royalblue4", size = 12),
        axis.text.y = element_text(colour = "royalblue4", size = 12),
        text = element_text(size = 16),
        panel.grid = element_line(colour="lightsteelblue1"),
        legend.position = "top")
ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15,
       height = 10)

# This also works for grid.arrange() plots
combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex,
                           ncol = 2, widths = c(4, 6))
ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot,
       width = 10, dpi = 300)

Note: The parameters width and height also determine the font size in the saved plot.

Other packages for visualisation

ggplot2 is a very powerful package that fits very nicely in our tidy data and tidy tools pipeline. There are other visualization packages in R that shouldn’t be ignored.

Base graphics {-}

The default graphics system that comes with R, often called base R graphics is simple and fast. It is based on the painter’s or canvas model, where different output are directly overlaid on top of each other (see figure \@ref(fig:paintermodel)). This is a fundamental difference with ggplot2 (and with lattice, described below), that returns dedicated objects, that are rendered on screen or in a file, and that can even be updated.

par(mfrow = c(1, 3))
plot(1:20, main = "First layer, produced with plot(1:20)")

plot(1:20, main = "A horizontal red line, added with abline(h = 10)")
abline(h = 10, col = "red")

plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)")
abline(h = 10, col = "red")
rect(5, 5, 15, 15, lwd = 3)

Successive layers added on top of each other.

Another main difference is that base graphics’ plotting function try to do the right thing based on their input type, i.e. they will adapt their behaviour based on the class of their input. This is again very different from what we have in ggplot2, that only accepts dataframes as input, and that requires plots to be constructed bit by bit.

par(mfrow = c(2, 2))
boxplot(rnorm(100),
        main = "Boxplot of rnorm(100)")
boxplot(matrix(rnorm(100), ncol = 10),
        main = "Boxplot of matrix(rnorm(100), ncol = 10)")
hist(rnorm(100))
hist(matrix(rnorm(100), ncol = 10))

Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right).

The out-of-the-box approach in base graphics can be very efficient for simple, standard figures, that can be produced very quickly with a single line of code and a single function such as plot, or hist, or boxplot, … The defaults are however not always the most appealing and tuning of figures, especially when they become more complex (for example to produce facets), can become lengthy and cumbersome.

The lattice package {-}

The lattice package is similar to ggplot2 in that is uses dataframes as input, returns graphical objects and supports faceting. lattice however isn’t based on the grammar of graphics and has a more convoluted interface.

A good reference for the lattice package is @latticebook.

Key Points

  • Visualization in R


Joining tables

Overview

Teaching: XX min
Exercises: XX min
Questions
  • Join tables in R

Objectives
  • Understand the need and concept of table joins

  • Understand the different types of joins

  • Understand the importance of keys in joins and the implications of using non-unique keys

Joining tables

In many real life situations, data are spread across multiple tables. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources.

It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a column that would be common to all the tables, for example, an attribute that uniquely identifies the subjects.

The dplyr package provides a set of join functions for combining two data frames based on matches within specified columns.

For further reading, please refer to the chapter about table joins in R for Data Science.

The Data Transformation Cheat Sheet also provides a short overview on table joins.

Combining tables

We are going to illustrate join using a common example from the bioinformatics world, where annotations about genes are scattered in different tables that have one or more shared columns.

The data we are going to use are available in the following package.

if (!require("rWSBIM1207"))
    BiocManager::install("UCLouvain-CBIO/rWSBIM1207")
library("rWSBIM1207")
data(jdf)

The data is composed of several tables.

The first table, jdf1, contains protein UniProt1 unique accession number (uniprot variable), the most likely sub-cellular localisation of these respective proteins (organelle variable) as well as the proteins identifier (entry).

jdf1
# A tibble: 25 × 3
   uniprot  organelle                             entry      
   <chr>    <chr>                                 <chr>      
 1 P26039   Actin cytoskeleton                    TLN1_MOUSE 
 2 Q99PL5   Endoplasmic reticulum/Golgi apparatus RRBP1_MOUSE
 3 Q6PB66   Mitochondrion                         LPPRC_MOUSE
 4 P11276   Extracellular matrix                  FINC_MOUSE 
 5 Q6PR54   Nucleus - Chromatin                   RIF1_MOUSE 
 6 Q05793   Extracellular matrix                  PGBM_MOUSE 
 7 P19096   Cytosol                               FAS_MOUSE  
 8 Q9JKF1   Plasma membrane                       IQGA1_MOUSE
 9 Q9QZQ1-2 Plasma membrane                       AFAD_MOUSE 
10 Q6NS46   Nucleus - Non-chromatin               RRP5_MOUSE 
# … with 15 more rows

The second table, jdf2, contains the name of the gene that codes for the protein (gene_name variable), a description of the gene (description variable), the uniprot accession number (this is the common variable that can be used to join tables) and the species the protein information comes from (organism variable).

jdf2
# A tibble: 25 × 4
   gene_name description                                      uniprot organism
   <chr>     <chr>                                            <chr>   <chr>   
 1 Iqgap1    Ras GTPase-activating-like protein IQGAP1        Q9JKF1  Mmus    
 2 Hspa5     78 kDa glucose-regulated protein                 P20029  Mmus    
 3 Pdcd11    Protein RRP5 homolog                             Q6NS46  Mmus    
 4 Tfrc      Transferrin receptor protein 1                   Q62351  Mmus    
 5 Hspd1     60 kDa heat shock protein, mitochondrial         P63038  Mmus    
 6 Tln1      Talin-1                                          P26039  Mmus    
 7 Smc1a     Structural maintenance of chromosomes protein 1A Q9CU62  Mmus    
 8 Lamc1     Laminin subunit gamma-1                          P02468  Mmus    
 9 Hsp90b1   Endoplasmin                                      P08113  Mmus    
10 Mia3      Melanoma inhibitory activity protein 3           Q8BI84  Mmus    
# … with 15 more rows

We now want to join these two tables into a single one containing all variables.

We are going to use the full_join function of dplyr to do so,

Th function will automatically find the common variable (in this case uniprot) to match observations from the first and second table.

library("dplyr")

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
full_join(jdf1, jdf2)
Joining, by = "uniprot"
# A tibble: 25 × 6
   uniprot  organelle                       entry gene_name description organism
   <chr>    <chr>                           <chr> <chr>     <chr>       <chr>   
 1 P26039   Actin cytoskeleton              TLN1… Tln1      Talin-1     Mmus    
 2 Q99PL5   Endoplasmic reticulum/Golgi ap… RRBP… Rrbp1     Ribosome-b… Mmus    
 3 Q6PB66   Mitochondrion                   LPPR… Lrpprc    Leucine-ri… Mmus    
 4 P11276   Extracellular matrix            FINC… Fn1       Fibronectin Mmus    
 5 Q6PR54   Nucleus - Chromatin             RIF1… Rif1      Telomere-a… Mmus    
 6 Q05793   Extracellular matrix            PGBM… Hspg2     Basement m… Mmus    
 7 P19096   Cytosol                         FAS_… Fasn      Fatty acid… Mmus    
 8 Q9JKF1   Plasma membrane                 IQGA… Iqgap1    Ras GTPase… Mmus    
 9 Q9QZQ1-2 Plasma membrane                 AFAD… Mllt4     Isoform 1 … Mmus    
10 Q6NS46   Nucleus - Non-chromatin         RRP5… Pdcd11    Protein RR… Mmus    
# … with 15 more rows

In these examples, each observation of the jdf1 and jdf2 tables are uniquely identified by their UniProt accession number. Such variables are called keys. Keys are used to match observations across different tables.

Now let’s look at a third table, jdf3. It also contains the column uniProt, but it is written differently!

jdf3
# A tibble: 25 × 4
   gene_name description                                      UniProt organism
   <chr>     <chr>                                            <chr>   <chr>   
 1 Iqgap1    Ras GTPase-activating-like protein IQGAP1        Q9JKF1  Mmus    
 2 Hspa5     78 kDa glucose-regulated protein                 P20029  Mmus    
 3 Pdcd11    Protein RRP5 homolog                             Q6NS46  Mmus    
 4 Tfrc      Transferrin receptor protein 1                   Q62351  Mmus    
 5 Hspd1     60 kDa heat shock protein, mitochondrial         P63038  Mmus    
 6 Tln1      Talin-1                                          P26039  Mmus    
 7 Smc1a     Structural maintenance of chromosomes protein 1A Q9CU62  Mmus    
 8 Lamc1     Laminin subunit gamma-1                          P02468  Mmus    
 9 Hsp90b1   Endoplasmin                                      P08113  Mmus    
10 Mia3      Melanoma inhibitory activity protein 3           Q8BI84  Mmus    
# … with 15 more rows

In case none of the variable names match, we can set manually the variables to use for the matching. These variables can be set using the by argument, as shown below with the jdf1 (as above) and jdf3 tables, where the UniProt accession number is encoded using a different capitalisation.

names(jdf3)
[1] "gene_name"   "description" "UniProt"     "organism"   
full_join(jdf1, jdf3, by = c("uniprot" = "UniProt"))
# A tibble: 25 × 6
   uniprot  organelle                       entry gene_name description organism
   <chr>    <chr>                           <chr> <chr>     <chr>       <chr>   
 1 P26039   Actin cytoskeleton              TLN1… Tln1      Talin-1     Mmus    
 2 Q99PL5   Endoplasmic reticulum/Golgi ap… RRBP… Rrbp1     Ribosome-b… Mmus    
 3 Q6PB66   Mitochondrion                   LPPR… Lrpprc    Leucine-ri… Mmus    
 4 P11276   Extracellular matrix            FINC… Fn1       Fibronectin Mmus    
 5 Q6PR54   Nucleus - Chromatin             RIF1… Rif1      Telomere-a… Mmus    
 6 Q05793   Extracellular matrix            PGBM… Hspg2     Basement m… Mmus    
 7 P19096   Cytosol                         FAS_… Fasn      Fatty acid… Mmus    
 8 Q9JKF1   Plasma membrane                 IQGA… Iqgap1    Ras GTPase… Mmus    
 9 Q9QZQ1-2 Plasma membrane                 AFAD… Mllt4     Isoform 1 … Mmus    
10 Q6NS46   Nucleus - Non-chromatin         RRP5… Pdcd11    Protein RR… Mmus    
# … with 15 more rows

As can be seen above, the variable name of the first table is retained in the joined one.

Challenge

Using the full_join function, join tables jdf4 and jdf5. What has happened for observations P26039 and P02468?

Solution

full_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 14 × 6
   uniprot  organelle                       entry gene_name description organism
   <chr>    <chr>                           <chr> <chr>     <chr>       <chr>   
 1 P26039   Actin cytoskeleton              TLN1… <NA>      <NA>        <NA>    
 2 Q99PL5   Endoplasmic reticulum/Golgi ap… RRBP… <NA>      <NA>        <NA>    
 3 Q6PB66   Mitochondrion                   LPPR… <NA>      <NA>        <NA>    
 4 P11276   Extracellular matrix            FINC… <NA>      <NA>        <NA>    
 5 Q6PR54   Nucleus - Chromatin             RIF1… <NA>      <NA>        <NA>    
 6 Q05793   Extracellular matrix            PGBM… <NA>      <NA>        <NA>    
 7 P19096   Cytosol                         FAS_… Fasn      Fatty acid… Mmus    
 8 Q9JKF1   Plasma membrane                 IQGA… <NA>      <NA>        <NA>    
 9 Q9QZQ1-2 Plasma membrane                 AFAD… <NA>      <NA>        <NA>    
10 Q6NS46   Nucleus - Non-chromatin         RRP5… <NA>      <NA>        <NA>    
11 P02468   <NA>                            <NA>  Lamc1     Laminin su… Mmus    
12 P08113   <NA>                            <NA>  Hsp90b1   Endoplasmin Mmus    
13 Q8BI84   <NA>                            <NA>  Mia3      Melanoma i… Mmus    
14 Q6P5D8   <NA>                            <NA>  Smchd1    Structural… Mmus    

P26039 and P02468 are only present in jdf4 and jdf5 respectively, and their respective values for the variables of the table have been encoded as missing.

Different types of joins

Above, we have used the full_join function, that fully joins two tables and keeps all observations, adding missing values if necessary. Sometimes, we want to be selective, and keep observations that are present in only one or both tables.

An inner join matches pairs of observation matching in both tables, this dropping those that are unique to one table. Figure taken from *R for Data Science*.

Outer joins match observations that appear in at least on table, filling up missing values with `NA` values. Figure taken from *R for Data Science*.

Challenge

Join tables jdf4 and jdf5, keeping only observations in jdf4.

Solution

left_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 10 × 6
   uniprot  organelle                       entry gene_name description organism
   <chr>    <chr>                           <chr> <chr>     <chr>       <chr>   
 1 P26039   Actin cytoskeleton              TLN1… <NA>      <NA>        <NA>    
 2 Q99PL5   Endoplasmic reticulum/Golgi ap… RRBP… <NA>      <NA>        <NA>    
 3 Q6PB66   Mitochondrion                   LPPR… <NA>      <NA>        <NA>    
 4 P11276   Extracellular matrix            FINC… <NA>      <NA>        <NA>    
 5 Q6PR54   Nucleus - Chromatin             RIF1… <NA>      <NA>        <NA>    
 6 Q05793   Extracellular matrix            PGBM… <NA>      <NA>        <NA>    
 7 P19096   Cytosol                         FAS_… Fasn      Fatty acid… Mmus    
 8 Q9JKF1   Plasma membrane                 IQGA… <NA>      <NA>        <NA>    
 9 Q9QZQ1-2 Plasma membrane                 AFAD… <NA>      <NA>        <NA>    
10 Q6NS46   Nucleus - Non-chromatin         RRP5… <NA>      <NA>        <NA>    

Challenge

Join tables jdf4 and jdf5, keeping only observations in jdf5.

Solution

right_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 5 × 6
  uniprot organelle entry     gene_name description                     organism
  <chr>   <chr>     <chr>     <chr>     <chr>                           <chr>   
1 P19096  Cytosol   FAS_MOUSE Fasn      Fatty acid synthase             Mmus    
2 P02468  <NA>      <NA>      Lamc1     Laminin subunit gamma-1         Mmus    
3 P08113  <NA>      <NA>      Hsp90b1   Endoplasmin                     Mmus    
4 Q8BI84  <NA>      <NA>      Mia3      Melanoma inhibitory activity p… Mmus    
5 Q6P5D8  <NA>      <NA>      Smchd1    Structural maintenance of chro… Mmus    

Challenge

Join tables jdf4 and jdf5, keeping observations observed in both tables.

Solution

inner_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 1 × 6
  uniprot organelle entry     gene_name description         organism
  <chr>   <chr>     <chr>     <chr>     <chr>               <chr>   
1 P19096  Cytosol   FAS_MOUSE Fasn      Fatty acid synthase Mmus    

Multiple matches

Sometimes, keys aren’t unique.

In the jdf6 table below, we see that the accession number Q99PL5 is repeated twice. According to this table, the ribosomial protein binding protein 1 localises in the endoplasmic reticulum and in the Golgi apparatus.

jdf6
# A tibble: 5 × 4
  uniprot organelle             entry       isoform
  <chr>   <chr>                 <chr>         <dbl>
1 P26039  Actin cytoskeleton    TLN1_MOUSE        1
2 Q99PL5  Endoplasmic reticulum RRBP1_MOUSE       1
3 Q99PL5  Golgi apparatus       RRBP1_MOUSE       2
4 Q6PB66  Mitochondrion         LPPRC_MOUSE       1
5 P11276  Extracellular matrix  FINC_MOUSE        1

If we now want to join jdf6 and jdf2, the variables of the latter will be duplicated.

inner_join(jdf6, jdf2)
Joining, by = "uniprot"
# A tibble: 5 × 7
  uniprot organelle             entry     isoform gene_name description organism
  <chr>   <chr>                 <chr>       <dbl> <chr>     <chr>       <chr>   
1 P26039  Actin cytoskeleton    TLN1_MOU…       1 Tln1      Talin-1     Mmus    
2 Q99PL5  Endoplasmic reticulum RRBP1_MO…       1 Rrbp1     Ribosome-b… Mmus    
3 Q99PL5  Golgi apparatus       RRBP1_MO…       2 Rrbp1     Ribosome-b… Mmus    
4 Q6PB66  Mitochondrion         LPPRC_MO…       1 Lrpprc    Leucine-ri… Mmus    
5 P11276  Extracellular matrix  FINC_MOU…       1 Fn1       Fibronectin Mmus    

In the case above, repeating is useful, as it completes jdf6 with correct information from jdf2.

But one needs however to be careful when duplicated keys exist in both tables.

Let’s now use jdf7 for the join. It also has 2 entries for the uniprot Q99PL5

jdf7
# A tibble: 5 × 6
  gene_name description                     uniprot organism isoform_num measure
  <chr>     <chr>                           <chr>   <chr>          <dbl>   <dbl>
1 Rrbp1     Ribosome-binding protein 1      Q99PL5  Mmus               1     102
2 Rrbp1     Ribosome-binding protein 1      Q99PL5  Mmus               2       3
3 Iqgap1    Ras GTPase-activating-like pro… Q9JKF1  Mmus               1      13
4 Hspa5     78 kDa glucose-regulated prote… P20029  Mmus               1      54
5 Pdcd11    Protein RRP5 homolog            Q6NS46  Mmus               1      28

Let’s we create an inner join between jdf6 and jdf7 (both having duplicated Q99PL5 entries)

inner_join(jdf6, jdf7)
Joining, by = "uniprot"
# A tibble: 4 × 9
  uniprot organelle     entry isoform gene_name description organism isoform_num
  <chr>   <chr>         <chr>   <dbl> <chr>     <chr>       <chr>          <dbl>
1 Q99PL5  Endoplasmic … RRBP…       1 Rrbp1     Ribosome-b… Mmus               1
2 Q99PL5  Endoplasmic … RRBP…       1 Rrbp1     Ribosome-b… Mmus               2
3 Q99PL5  Golgi appara… RRBP…       2 Rrbp1     Ribosome-b… Mmus               1
4 Q99PL5  Golgi appara… RRBP…       2 Rrbp1     Ribosome-b… Mmus               2
# … with 1 more variable: measure <dbl>

Challenge

Interpret the result of the inner join above, where both tables have duplicated keys.

Solution

jdf6 has two entries, one for each possible sub-cellular localisation of the protein. jdf7 has also two entries, referring to two different quantitative measurements (variable measure). When joining the duplicated keys, you get all possible combinations.

Joins with duplicated keys in both tables, producing all possible combinations. Figure taken from *R for Data Science*.

In this case, we obtain wrong information: both proteins in the ER and in the GA both have value 102 and 3.

inner_join(jdf6, jdf7)
Joining, by = "uniprot"
# A tibble: 4 × 9
  uniprot organelle     entry isoform gene_name description organism isoform_num
  <chr>   <chr>         <chr>   <dbl> <chr>     <chr>       <chr>          <dbl>
1 Q99PL5  Endoplasmic … RRBP…       1 Rrbp1     Ribosome-b… Mmus               1
2 Q99PL5  Endoplasmic … RRBP…       1 Rrbp1     Ribosome-b… Mmus               2
3 Q99PL5  Golgi appara… RRBP…       2 Rrbp1     Ribosome-b… Mmus               1
4 Q99PL5  Golgi appara… RRBP…       2 Rrbp1     Ribosome-b… Mmus               2
# … with 1 more variable: measure <dbl>

Matching across multiple keys

So far, we have matched tables using a single key (possibly with different names in the two tables). Sometimes, it is necessary to match tables using multiple keys. A typical example is when multiple variables are needed to discriminate different rows in a tables.

Following up from the last example, we see that the duplicated UniProt accession numbers in the jdf6 and jdf7 tables refer to different isoforms of the same RRBP1 gene.

jdf6
# A tibble: 5 × 4
  uniprot organelle             entry       isoform
  <chr>   <chr>                 <chr>         <dbl>
1 P26039  Actin cytoskeleton    TLN1_MOUSE        1
2 Q99PL5  Endoplasmic reticulum RRBP1_MOUSE       1
3 Q99PL5  Golgi apparatus       RRBP1_MOUSE       2
4 Q6PB66  Mitochondrion         LPPRC_MOUSE       1
5 P11276  Extracellular matrix  FINC_MOUSE        1
jdf7
# A tibble: 5 × 6
  gene_name description                     uniprot organism isoform_num measure
  <chr>     <chr>                           <chr>   <chr>          <dbl>   <dbl>
1 Rrbp1     Ribosome-binding protein 1      Q99PL5  Mmus               1     102
2 Rrbp1     Ribosome-binding protein 1      Q99PL5  Mmus               2       3
3 Iqgap1    Ras GTPase-activating-like pro… Q9JKF1  Mmus               1      13
4 Hspa5     78 kDa glucose-regulated prote… P20029  Mmus               1      54
5 Pdcd11    Protein RRP5 homolog            Q6NS46  Mmus               1      28

To uniquely identify isoforms, we should consider two keys:

Because the isoform status was encoded using different names (which is, of course a source of confusion), jdf6 and jdf7 are only automatically joined based on the shared uniprot key.

If the isoform status was encoded the same way in both tables, the join would have been automatically done on both keys!

Here, we need to join using both keys and need to explicitly name the variables used for the join.

inner_join(jdf6, jdf7, by = c("uniprot" = "uniprot", "isoform" = "isoform_num"))
# A tibble: 2 × 8
  uniprot organelle         entry isoform gene_name description organism measure
  <chr>   <chr>             <chr>   <dbl> <chr>     <chr>       <chr>      <dbl>
1 Q99PL5  Endoplasmic reti… RRBP…       1 Rrbp1     Ribosome-b… Mmus         102
2 Q99PL5  Golgi apparatus   RRBP…       2 Rrbp1     Ribosome-b… Mmus           3

We now see that isoform 1 localised to the ER and has a measured value of 102, while isoform 2, that localised to the GA, has a measured value of 3.

Ideally, the isoform variables should be named identically in the two tables to enable an automatic join with the two keys.

An alternative could be to rename the isoform_num from jdf7 in order to have the both keys names present in both tables, enabling an automatic join. This can be done easily using the rename function from dplyr package.

jdf7 %>% rename(isoform = isoform_num)
# A tibble: 5 × 6
  gene_name description                         uniprot organism isoform measure
  <chr>     <chr>                               <chr>   <chr>      <dbl>   <dbl>
1 Rrbp1     Ribosome-binding protein 1          Q99PL5  Mmus           1     102
2 Rrbp1     Ribosome-binding protein 1          Q99PL5  Mmus           2       3
3 Iqgap1    Ras GTPase-activating-like protein… Q9JKF1  Mmus           1      13
4 Hspa5     78 kDa glucose-regulated protein    P20029  Mmus           1      54
5 Pdcd11    Protein RRP5 homolog                Q6NS46  Mmus           1      28
inner_join(jdf6,
           jdf7 %>%
             rename(isoform = isoform_num))
Joining, by = c("uniprot", "isoform")
# A tibble: 2 × 8
  uniprot organelle         entry isoform gene_name description organism measure
  <chr>   <chr>             <chr>   <dbl> <chr>     <chr>       <chr>      <dbl>
1 Q99PL5  Endoplasmic reti… RRBP…       1 Rrbp1     Ribosome-b… Mmus         102
2 Q99PL5  Golgi apparatus   RRBP…       2 Rrbp1     Ribosome-b… Mmus           3

Row and column binding

There are two other important functions in R, rbind and cbind, that can be used to combine two dataframes.

d2
  a b
1 4 4
2 5 5
d3
  v1 v2 v3
1  1  3  5
2  2  4  6
cbind(d2, d3)
  a b v1 v2 v3
1 4 4  1  3  5
2 5 5  2  4  6
d1
  x y
1 1 1
2 2 2
3 3 3
d2
  a b
1 4 4
2 5 5

using rbind(d1, d2) would produce an error because both data frames do not have the same column names (even if they have the same number of columns)

If we change the names of d2, it works!

names(d2) <- names(d1)
d1
  x y
1 1 1
2 2 2
3 3 3
d2
  x y
1 4 4
2 5 5
rbind(d1, d2)
  x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

Note that beyond the dimensions and column names that are required to match, the real meaning of rbind is to bind data frames that contain observations for the same set of variables - there is more than only the column names!

Note: rbind and cbind are base R functions. The tidyverse alternatives from the dplyr package are bind_rows and bind_cols and work similarly.

  1. UniProt is the protein information database. Its mission is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information

Key Points

  • Join tables of data in R


Next steps

Overview

Teaching: XX min
Exercises: XX min
Questions
  • SummarizedExperiment

Objectives
  • Introduce the notion of data containers

  • Give an overview of the SummarizedExperiment, extensively used in omics analyses

Next steps

Data in bioinformatics is often complex. For example, an experiment can have multiple types of files e.g. counts, sample information and gene information. To deal with this, developers define specialised data containers (termed classes) that match the properties of the data they need to handle.

This aspect is central to the Bioconductor1 project which uses the same core data infrastructure across packages. This certainly contributed to Bioconductor’s success. Bioconductor package developers are advised to make use of existing infrastructure to provide coherence, interoperability and stability to the project as a whole. You will learn more about Bioconductor in the next episode.

SummarizedExperiment

To illustrate a Bioconductor omics data container, we’ll present the SummarizedExperiment class. This is a core structure for omics data such as genomics, transcriptomics, proteomics, methylation and cytometry. The image below from this website shows how a large number of Bioconductor packages make use of SummarizedExperiment.

plot of chunk SE-core

What is it

The figure below represents the anatomy of SummarizedExperiment.

plot of chunk SE

Benefits of SummarizedExperiment format

plot of chunk SE-packages

When would you use it

You may encounter a SummarizedExperiment object

Exploring a SummarizedExperiment

Let’s explore a SummarizedExperiment object in R. We’ll load in an example. This is the same RNA data we’ve been working with but also contains genomic coordinates/ranges. Ranges are an optional part of a SummarizedExperiment. When they’re attached, the object is called a RangedSummarizedExperiment.

se <- readRDS("course-data/data/GSE96870/se2.rds")
se
class: RangedSummarizedExperiment 
dim: 1474 22 
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse

We can access the expression matrix with the assay function:

head(assay(se))
        GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Asl           1170        361        400        586        626        988
Apod         36194      10347       9173      10620      13021      29594
Cyp2d22       4060       1616       1603       1901       2171       3349
Klk6           287        629        641        578        448        195
Fcrls           85        233        244        237        180         38
Slc2a4         782        231        248        265        313        786
        GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Asl            836        535        586        597        938       1035
Apod         24959      13668      13230      15868      27769      34301
Cyp2d22       3122       2008       2254       2277       2985       3452
Klk6           186       1101        537        567        327        233
Fcrls           68        375        199        177         89         67
Slc2a4         528        249        266        357        654        693
        GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Asl            494        481        666        937        803        541
Apod         11258      11812      15816      29242      20415      13682
Cyp2d22       1883       2014       2417       3678       2920       2216
Klk6           742        881        828        250        798        710
Fcrls          300        233        231         81        303        285
Slc2a4         271        304        349        715        513        320
        GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl            473        748        576       1192
Apod         11088      15916      11166      38148
Cyp2d22       1821       2842       2011       4019
Klk6           894        501        598        259
Fcrls          248        179        184         68
Slc2a4         248        350        317        796
dim(assay(se))
[1] 1474   22

We can access the sample metadata using the colData function:

colData(se)
DataFrame with 22 rows and 10 columns
                     title geo_accession     organism         age      sex
               <character>   <character>  <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus     8 weeks   Female
GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks   Female
GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks   Female
GSM2545339 CNS_RNA-seq_13C    GSM2545339 Mus musculus     8 weeks   Female
GSM2545340 CNS_RNA-seq_14C    GSM2545340 Mus musculus     8 weeks   Male  
...                    ...           ...          ...         ...      ...
GSM2545353  CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks   Female
GSM2545354  CNS_RNA-seq_4C    GSM2545354 Mus musculus     8 weeks   Male  
GSM2545362  CNS_RNA-seq_5C    GSM2545362 Mus musculus     8 weeks   Female
GSM2545363  CNS_RNA-seq_6C    GSM2545363 Mus musculus     8 weeks   Male  
GSM2545380  CNS_RNA-seq_9C    GSM2545380 Mus musculus     8 weeks   Female
             infection      strain     time     tissue    mouse
              <factor> <character> <factor>   <factor> <factor>
GSM2545336 InfluenzaA      C57BL/6     Day8 Cerebellum       14
GSM2545337 NonInfected     C57BL/6     Day0 Cerebellum       9 
GSM2545338 NonInfected     C57BL/6     Day0 Cerebellum       10
GSM2545339 InfluenzaA      C57BL/6     Day4 Cerebellum       15
GSM2545340 InfluenzaA      C57BL/6     Day4 Cerebellum       18
...                ...         ...      ...        ...      ...
GSM2545353 NonInfected     C57BL/6     Day0 Cerebellum       4 
GSM2545354 NonInfected     C57BL/6     Day0 Cerebellum       2 
GSM2545362 InfluenzaA      C57BL/6     Day4 Cerebellum       20
GSM2545363 InfluenzaA      C57BL/6     Day4 Cerebellum       12
GSM2545380 InfluenzaA      C57BL/6     Day8 Cerebellum       19
dim(colData(se))
[1] 22 10

We can also access the feature metadata using the rowData function:

head(rowData(se))
DataFrame with 6 rows and 11 columns
               gene    ENTREZID                product       gbkey
        <character> <character>            <character> <character>
Asl             Asl      109900 argininosuccinate ly..        mRNA
Apod           Apod       11815 apolipoprotein D, tr..        mRNA
Cyp2d22     Cyp2d22       56448 cytochrome P450, fam..        mRNA
Klk6           Klk6       19144 kallikrein related-p..        mRNA
Fcrls         Fcrls       80891 Fc receptor-like S, ..        mRNA
Slc2a4       Slc2a4       20528 solute carrier famil..        mRNA
        external_gene_name    ensembl_gene_id external_synonym chromosome_name
               <character>        <character>      <character>     <character>
Asl                    Asl ENSMUSG00000025533    2510006M18Rik               5
Apod                  Apod ENSMUSG00000022548               NA              16
Cyp2d22            Cyp2d22 ENSMUSG00000061740             2D22              15
Klk6                  Klk6 ENSMUSG00000050063             Bssp               7
Fcrls                Fcrls ENSMUSG00000015852    2810439C17Rik               3
Slc2a4              Slc2a4 ENSMUSG00000018566           Glut-4              11
          gene_biotype  phenotype_description
           <character>            <character>
Asl     protein_coding abnormal circulating..
Apod    protein_coding abnormal lipid homeo..
Cyp2d22 protein_coding abnormal skin morpho..
Klk6    protein_coding abnormal cytokine le..
Fcrls   protein_coding decreased CD8-positi..
Slc2a4  protein_coding abnormal circulating..
        hsapiens_homolog_associated_gene_name
                                  <character>
Asl                                       ASL
Apod                                     APOD
Cyp2d22                                CYP2D6
Klk6                                     KLK6
Fcrls                                   FCRL2
Slc2a4                                 SLC2A4
dim(rowData(se))
[1] 1474   11

If genomic coordinates are present, we can access using the rowRanges function:

rowRanges(se)
GRanges object with 1474 ranges and 11 metadata columns:
          seqnames              ranges strand |        gene    ENTREZID
             <Rle>           <IRanges>  <Rle> | <character> <character>
      Asl        5 130024208-130024330      - |         Asl      109900
     Apod       16   31314552-31314808      - |        Apod       11815
  Cyp2d22       15   82380078-82380260      - |     Cyp2d22       56448
     Klk6        7   43824544-43824595      + |        Klk6       19144
    Fcrls        3   87263678-87263765      - |       Fcrls       80891
      ...      ...                 ...    ... .         ...         ...
    Mgst3        1 167393751-167393797      - |       Mgst3       66447
   Lrrc52        1 167466093-167466780      - |      Lrrc52      240899
     Rxrg        1 167598362-167598800      + |        Rxrg       20183
    Lmx1a        1 167688300-167688991      + |       Lmx1a      110648
     Pbx1        1 168431314-168432169      - |        Pbx1       18514
                         product       gbkey external_gene_name
                     <character> <character>        <character>
      Asl argininosuccinate ly..        mRNA                Asl
     Apod apolipoprotein D, tr..        mRNA               Apod
  Cyp2d22 cytochrome P450, fam..        mRNA            Cyp2d22
     Klk6 kallikrein related-p..        mRNA               Klk6
    Fcrls Fc receptor-like S, ..        mRNA              Fcrls
      ...                    ...         ...                ...
    Mgst3 microsomal glutathio..        mRNA              Mgst3
   Lrrc52 leucine rich repeat ..        mRNA             Lrrc52
     Rxrg retinoid X receptor ..        mRNA               Rxrg
    Lmx1a LIM homeobox transcr..        mRNA              Lmx1a
     Pbx1 pre B cell leukemia ..        mRNA               Pbx1
             ensembl_gene_id external_synonym chromosome_name   gene_biotype
                 <character>      <character>     <character>    <character>
      Asl ENSMUSG00000025533    2510006M18Rik               5 protein_coding
     Apod ENSMUSG00000022548             <NA>              16 protein_coding
  Cyp2d22 ENSMUSG00000061740             2D22              15 protein_coding
     Klk6 ENSMUSG00000050063             Bssp               7 protein_coding
    Fcrls ENSMUSG00000015852    2810439C17Rik               3 protein_coding
      ...                ...              ...             ...            ...
    Mgst3 ENSMUSG00000026688    2010012L10Rik               1 protein_coding
   Lrrc52 ENSMUSG00000040485    4930413P14Rik               1 protein_coding
     Rxrg ENSMUSG00000015843            Nr2b3               1 protein_coding
    Lmx1a ENSMUSG00000026686           Lmx1.1               1 protein_coding
     Pbx1 ENSMUSG00000052534    2310056B04Rik               1 protein_coding
           phenotype_description hsapiens_homolog_associated_gene_name
                     <character>                           <character>
      Asl abnormal circulating..                                   ASL
     Apod abnormal lipid homeo..                                  APOD
  Cyp2d22 abnormal skin morpho..                                CYP2D6
     Klk6 abnormal cytokine le..                                  KLK6
    Fcrls decreased CD8-positi..                                 FCRL2
      ...                    ...                                   ...
    Mgst3 decreased mean corpu..                                 MGST3
   Lrrc52 abnormal sperm physi..                                LRRC52
     Rxrg abnormal bone minera..                                  RXRG
    Lmx1a abnormal bony labyri..                                 LMX1A
     Pbx1 abnormal adrenal gla..                                  PBX1
  -------
  seqinfo: 182 sequences from an unspecified genome; no seqlengths

rowRanges gives us the same information as rowData but also includes the coordinates.

Creating a SummarizedExperiment

Often a package will create a SummarizedExperiment for you but we will demonstrate how a SummarizedExperiment can be created. The 3 different tables that make up a SummarizedExperiment object are:

You would import these tables into R or generate them in R. Here we will create them from the se object that we have.

count_matrix <- assay(se)
sample_metadata <- colData(se)
gene_metadata <- rowRanges(se) # or rowData(se)

We will create a SummarizedExperiment from these tables using the SummarizedExperiment constructor. We need to provide inputs for the arguments assays (count matrix), colData (sample metadata) and rowData (gene metadata). Because assays can accept multiple assays (e.g. raw counts, log normalized) we put them in a list (SimpleList) and give each assay a name (we’ll use counts). We can see we need to do this in the help for the SummarizedExperiment function ?SummarizedExperiment .

# BiocManager::install("SummarizedExperiment")
library("SummarizedExperiment")

se_created <- SummarizedExperiment(assays = SimpleList(counts=count_matrix),
                           colData = sample_metadata,
                           rowRanges = gene_metadata) # or rowData =
se_created
class: RangedSummarizedExperiment 
dim: 1474 22 
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse

If we want to save a SummarizedExperiment object we can use the saveRDS function that we saw in a previous episode.

saveRDS(se_created, file = "data_output/se_created.rds")

Subsetting a SummarizedExperiment

SummarizedExperiment can be subset just like with data frames, with numerics, characters or logicals. There are two dimensions, the first dimension is features (genes) and the second dimension is samples.

To subset to the first 5 features we could do:

se[1:5, ]
class: RangedSummarizedExperiment 
dim: 5 22 
metadata(0):
assays(1): counts
rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse

To subset to the first 3 samples we could do:

se[, 1:3]
class: RangedSummarizedExperiment 
dim: 1474 3 
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(3): GSM2545336 GSM2545337 GSM2545338
colData names(10): title geo_accession ... tissue mouse

Below, we create a new instance of class SummarizedExperiment that contains only the first 5 features for the first 3 samples.

se1 <- se[1:5, 1:3]
se1
class: RangedSummarizedExperiment 
dim: 5 3 
metadata(0):
assays(1): counts
rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(3): GSM2545336 GSM2545337 GSM2545338
colData names(10): title geo_accession ... tissue mouse
assay(se1)
        GSM2545336 GSM2545337 GSM2545338
Asl           1170        361        400
Apod         36194      10347       9173
Cyp2d22       4060       1616       1603
Klk6           287        629        641
Fcrls           85        233        244
colData(se1)
DataFrame with 3 rows and 10 columns
                     title geo_accession     organism         age      sex
               <character>   <character>  <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus     8 weeks   Female
GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks   Female
GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks   Female
             infection      strain     time     tissue    mouse
              <factor> <character> <factor>   <factor> <factor>
GSM2545336 InfluenzaA      C57BL/6     Day8 Cerebellum       14
GSM2545337 NonInfected     C57BL/6     Day0 Cerebellum       9 
GSM2545338 NonInfected     C57BL/6     Day0 Cerebellum       10
rowData(se1)
DataFrame with 5 rows and 11 columns
               gene    ENTREZID                product       gbkey
        <character> <character>            <character> <character>
Asl             Asl      109900 argininosuccinate ly..        mRNA
Apod           Apod       11815 apolipoprotein D, tr..        mRNA
Cyp2d22     Cyp2d22       56448 cytochrome P450, fam..        mRNA
Klk6           Klk6       19144 kallikrein related-p..        mRNA
Fcrls         Fcrls       80891 Fc receptor-like S, ..        mRNA
        external_gene_name    ensembl_gene_id external_synonym chromosome_name
               <character>        <character>      <character>     <character>
Asl                    Asl ENSMUSG00000025533    2510006M18Rik               5
Apod                  Apod ENSMUSG00000022548               NA              16
Cyp2d22            Cyp2d22 ENSMUSG00000061740             2D22              15
Klk6                  Klk6 ENSMUSG00000050063             Bssp               7
Fcrls                Fcrls ENSMUSG00000015852    2810439C17Rik               3
          gene_biotype  phenotype_description
           <character>            <character>
Asl     protein_coding abnormal circulating..
Apod    protein_coding abnormal lipid homeo..
Cyp2d22 protein_coding abnormal skin morpho..
Klk6    protein_coding abnormal cytokine le..
Fcrls   protein_coding decreased CD8-positi..
        hsapiens_homolog_associated_gene_name
                                  <character>
Asl                                       ASL
Apod                                     APOD
Cyp2d22                                CYP2D6
Klk6                                     KLK6
Fcrls                                   FCRL2

We can also use the colData() function to subset on something from the sample metadata, or the rowData() to subset on something from the feature metadata. For example, here we keep only miRNAs and the non infected samples:

se1 <- se[rowData(se)$gene_biotype == "miRNA",
          colData(se)$infection == "NonInfected"]
se1
class: RangedSummarizedExperiment 
dim: 7 7 
metadata(0):
assays(1): counts
rownames(7): Mir1901 Mir378a ... Mir128-1 Mir7682
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(7): GSM2545337 GSM2545338 ... GSM2545353 GSM2545354
colData names(10): title geo_accession ... tissue mouse
assay(se1)
         GSM2545337 GSM2545338 GSM2545343 GSM2545348 GSM2545349 GSM2545353
Mir1901          45         44         74         55         68         33
Mir378a          11          7          9          4         12          4
Mir133b           4          6          5          4          6          7
Mir30c-2         10          6         16         12          8         17
Mir149            1          2          0          0          0          0
Mir128-1          4          1          2          2          1          2
Mir7682           2          0          4          1          3          5
         GSM2545354
Mir1901          60
Mir378a           8
Mir133b           3
Mir30c-2         15
Mir149            2
Mir128-1          1
Mir7682           5
colData(se1)
DataFrame with 7 rows and 10 columns
                     title geo_accession     organism         age      sex
               <character>   <character>  <character> <character> <factor>
GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks   Female
GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks   Female
GSM2545343 CNS_RNA-seq_20C    GSM2545343 Mus musculus     8 weeks   Male  
GSM2545348 CNS_RNA-seq_27C    GSM2545348 Mus musculus     8 weeks   Female
GSM2545349 CNS_RNA-seq_28C    GSM2545349 Mus musculus     8 weeks   Male  
GSM2545353  CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks   Female
GSM2545354  CNS_RNA-seq_4C    GSM2545354 Mus musculus     8 weeks   Male  
             infection      strain     time     tissue    mouse
              <factor> <character> <factor>   <factor> <factor>
GSM2545337 NonInfected     C57BL/6     Day0 Cerebellum       9 
GSM2545338 NonInfected     C57BL/6     Day0 Cerebellum       10
GSM2545343 NonInfected     C57BL/6     Day0 Cerebellum       11
GSM2545348 NonInfected     C57BL/6     Day0 Cerebellum       8 
GSM2545349 NonInfected     C57BL/6     Day0 Cerebellum       7 
GSM2545353 NonInfected     C57BL/6     Day0 Cerebellum       4 
GSM2545354 NonInfected     C57BL/6     Day0 Cerebellum       2 
rowData(se1)
DataFrame with 7 rows and 11 columns
                gene    ENTREZID        product         gbkey
         <character> <character>    <character>   <character>
Mir1901      Mir1901   100316686  microRNA 1901 precursor_RNA
Mir378a      Mir378a      723889  microRNA 378a precursor_RNA
Mir133b      Mir133b      723817  microRNA 133b precursor_RNA
Mir30c-2    Mir30c-2      723964 microRNA 30c-2 precursor_RNA
Mir149        Mir149      387167   microRNA 149 precursor_RNA
Mir128-1    Mir128-1      387147 microRNA 128-1 precursor_RNA
Mir7682      Mir7682   102466847  microRNA 7682 precursor_RNA
         external_gene_name    ensembl_gene_id external_synonym chromosome_name
                <character>        <character>      <character>     <character>
Mir1901             Mir1901 ENSMUSG00000084565         Mirn1901              18
Mir378a             Mir378a ENSMUSG00000105200          Mirn378              18
Mir133b             Mir133b ENSMUSG00000065480         mir 133b               1
Mir30c-2           Mir30c-2 ENSMUSG00000065567        mir 30c-2               1
Mir149               Mir149 ENSMUSG00000065470          Mirn149               1
Mir128-1           Mir128-1 ENSMUSG00000065520          Mirn128               1
Mir7682             Mir7682 ENSMUSG00000106406     mmu-mir-7682               1
         gene_biotype  phenotype_description
          <character>            <character>
Mir1901         miRNA                     NA
Mir378a         miRNA abnormal mitochondri..
Mir133b         miRNA no abnormal phenotyp..
Mir30c-2        miRNA                     NA
Mir149          miRNA increased circulatin..
Mir128-1        miRNA no abnormal phenotyp..
Mir7682         miRNA                     NA
         hsapiens_homolog_associated_gene_name
                                   <character>
Mir1901                                     NA
Mir378a                                MIR378A
Mir133b                                MIR133B
Mir30c-2                               MIR30C2
Mir149                                      NA
Mir128-1                              MIR128-1
Mir7682                                     NA

We can subset to a genomic region of interest using the subsetByOverlaps function.

# Define a region of interest e.g. interval 1 to 10,000,000 of chromosome 1 
roi <- GRanges(seqnames="1", ranges=1:10000000)
# Subset se object for only rows in the region of interest
subsetByOverlaps(se, roi)
class: RangedSummarizedExperiment 
dim: 20 22 
metadata(0):
assays(1): counts
rownames(20): Sox17 Rp1 ... Snord87 Tcf24
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse

Challenge

  1. Extract the gene expression levels of the first 3 genes in samples at time 0 and at time 8.
  2. Extract all genes on chromosome Y between 1 and 10,000,000 on the negative strand.

Solution

assay(se)[1:3, colData(se)$time != "Day4"]
        GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343
Asl           1170        361        400        988        836        535
Apod         36194      10347       9173      29594      24959      13668
Cyp2d22       4060       1616       1603       3349       3122       2008
        GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545351 GSM2545353
Asl            938       1035        494        481        937        541
Apod         27769      34301      11258      11812      29242      13682
Cyp2d22       2985       3452       1883       2014       3678       2216
        GSM2545354 GSM2545380
Asl            473       1192
Apod         11088      38148
Cyp2d22       1821       4019
# Equivalent to
assay(se)[1:3, colData(se)$time == "Day0" | colData(se)$time == "Day8"]
        GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343
Asl           1170        361        400        988        836        535
Apod         36194      10347       9173      29594      24959      13668
Cyp2d22       4060       1616       1603       3349       3122       2008
        GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545351 GSM2545353
Asl            938       1035        494        481        937        541
Apod         27769      34301      11258      11812      29242      13682
Cyp2d22       2985       3452       1883       2014       3678       2216
        GSM2545354 GSM2545380
Asl            473       1192
Apod         11088      38148
Cyp2d22       1821       4019
roi <- GRanges(seqnames="Y", ranges = 1:10000000, strand = "-")
subsetByOverlaps(se, roi)
class: RangedSummarizedExperiment 
dim: 3 22 
metadata(0):
assays(1): counts
rownames(3): Ddx3y Uty Gm4017
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse

Adding variables to metadata

We can also add information to the sample metadata. Suppose that you want to add the center where the samples were collected…

colData(se)$center <- rep("University of Illinois", nrow(colData(se)))
colData(se)
DataFrame with 22 rows and 11 columns
                     title geo_accession     organism         age      sex
               <character>   <character>  <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus     8 weeks   Female
GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks   Female
GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks   Female
GSM2545339 CNS_RNA-seq_13C    GSM2545339 Mus musculus     8 weeks   Female
GSM2545340 CNS_RNA-seq_14C    GSM2545340 Mus musculus     8 weeks   Male  
...                    ...           ...          ...         ...      ...
GSM2545353  CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks   Female
GSM2545354  CNS_RNA-seq_4C    GSM2545354 Mus musculus     8 weeks   Male  
GSM2545362  CNS_RNA-seq_5C    GSM2545362 Mus musculus     8 weeks   Female
GSM2545363  CNS_RNA-seq_6C    GSM2545363 Mus musculus     8 weeks   Male  
GSM2545380  CNS_RNA-seq_9C    GSM2545380 Mus musculus     8 weeks   Female
             infection      strain     time     tissue    mouse
              <factor> <character> <factor>   <factor> <factor>
GSM2545336 InfluenzaA      C57BL/6     Day8 Cerebellum       14
GSM2545337 NonInfected     C57BL/6     Day0 Cerebellum       9 
GSM2545338 NonInfected     C57BL/6     Day0 Cerebellum       10
GSM2545339 InfluenzaA      C57BL/6     Day4 Cerebellum       15
GSM2545340 InfluenzaA      C57BL/6     Day4 Cerebellum       18
...                ...         ...      ...        ...      ...
GSM2545353 NonInfected     C57BL/6     Day0 Cerebellum       4 
GSM2545354 NonInfected     C57BL/6     Day0 Cerebellum       2 
GSM2545362 InfluenzaA      C57BL/6     Day4 Cerebellum       20
GSM2545363 InfluenzaA      C57BL/6     Day4 Cerebellum       12
GSM2545380 InfluenzaA      C57BL/6     Day8 Cerebellum       19
                           center
                      <character>
GSM2545336 University of Illinois
GSM2545337 University of Illinois
GSM2545338 University of Illinois
GSM2545339 University of Illinois
GSM2545340 University of Illinois
...                           ...
GSM2545353 University of Illinois
GSM2545354 University of Illinois
GSM2545362 University of Illinois
GSM2545363 University of Illinois
GSM2545380 University of Illinois

This illustrates that the metadata slots can grow indefinitely without affecting the other structures!

tidySummarizedExperiment

You may be wondering, can we use tidyverse commands to interact with SummarizedExperiment objects. The answer is yes, we can with the tidySummarizedExperiment package.

Remember what our SummarizedExperiment object looks like.

se
class: RangedSummarizedExperiment 
dim: 1474 22 
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(11): title geo_accession ... mouse center

Load tidyverse and tidySummarizedExperiment and then take a look at the se object again.

# BiocManager::install("tidySummarizedExperiment")
library("tidyverse")
library("tidySummarizedExperiment")

se
# A SummarizedExperiment-tibble abstraction: 32,428 × 22
# Features=1474 | Samples=22 | Assays=counts
   .feature .sample    counts title geo_accession organism age   sex   infection
   <chr>    <chr>       <int> <chr> <chr>         <chr>    <chr> <fct> <fct>    
 1 Asl      GSM2545336   1170 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 2 Apod     GSM2545336  36194 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 3 Cyp2d22  GSM2545336   4060 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 4 Klk6     GSM2545336    287 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 5 Fcrls    GSM2545336     85 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 6 Slc2a4   GSM2545336    782 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 7 Exd2     GSM2545336   1619 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 8 Gjc2     GSM2545336    288 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 9 Plp1     GSM2545336  43217 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
10 Gnb4     GSM2545336   1071 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
# … with 40 more rows, and 21 more variables: strain <chr>, time <fct>,
#   tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
#   product <chr>, gbkey <chr>, external_gene_name <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
#   end <int>, width <int>, strand <fct>

It’s still a SummarizedExperiment object but now we can view it as a tibble. Note the first line of the output says this, it’s a SummarizedExperiment-tibble abstraction. We can also see in the second line of the output the number of transcripts and samples.

Reverting to standard SummarizedExperiment view

If we want to revert to the standard SummarizedExperiment view we can do that.

options("restore_SummarizedExperiment_show" = TRUE)
se
class: RangedSummarizedExperiment 
dim: 1474 22 
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(11): title geo_accession ... mouse center

If we want to revert back to tidy SummarizedExperiment view we can.

options("restore_SummarizedExperiment_show" = FALSE)
se
# A SummarizedExperiment-tibble abstraction: 32,428 × 22
# Features=1474 | Samples=22 | Assays=counts
   .feature .sample    counts title geo_accession organism age   sex   infection
   <chr>    <chr>       <int> <chr> <chr>         <chr>    <chr> <fct> <fct>    
 1 Asl      GSM2545336   1170 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 2 Apod     GSM2545336  36194 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 3 Cyp2d22  GSM2545336   4060 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 4 Klk6     GSM2545336    287 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 5 Fcrls    GSM2545336     85 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 6 Slc2a4   GSM2545336    782 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 7 Exd2     GSM2545336   1619 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 8 Gjc2     GSM2545336    288 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 9 Plp1     GSM2545336  43217 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
10 Gnb4     GSM2545336   1071 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
# … with 40 more rows, and 21 more variables: strain <chr>, time <fct>,
#   tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
#   product <chr>, gbkey <chr>, external_gene_name <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
#   end <int>, width <int>, strand <fct>

We can now use tidyverse commands to interact with the SummarizedExperiment object.

We can use filter to filter for rows using a condition e.g. to remove a failed sample.

se %>% 
    filter(.sample != "GSM2545336")
# A SummarizedExperiment-tibble abstraction: 30,954 × 21
# Features=1474 | Samples=21 | Assays=counts
   .feature .sample    counts title geo_accession organism age   sex   infection
   <chr>    <chr>       <int> <chr> <chr>         <chr>    <chr> <fct> <fct>    
 1 Asl      GSM2545337    361 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 2 Apod     GSM2545337  10347 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 3 Cyp2d22  GSM2545337   1616 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 4 Klk6     GSM2545337    629 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 5 Fcrls    GSM2545337    233 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 6 Slc2a4   GSM2545337    231 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 7 Exd2     GSM2545337   2288 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 8 Gjc2     GSM2545337    595 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 9 Plp1     GSM2545337 101241 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
10 Gnb4     GSM2545337   1791 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
# … with 40 more rows, and 21 more variables: strain <chr>, time <fct>,
#   tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
#   product <chr>, gbkey <chr>, external_gene_name <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
#   end <int>, width <int>, strand <fct>

We can use select to specify columns we want to view.

se %>% 
    select(.sample)
tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A tibble: 32,428 × 1
   .sample   
   <chr>     
 1 GSM2545336
 2 GSM2545336
 3 GSM2545336
 4 GSM2545336
 5 GSM2545336
 6 GSM2545336
 7 GSM2545336
 8 GSM2545336
 9 GSM2545336
10 GSM2545336
# … with 32,418 more rows
se %>%
    count(.sample)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 22 × 2
   .sample        n
   <chr>      <int>
 1 GSM2545336  1474
 2 GSM2545337  1474
 3 GSM2545338  1474
 4 GSM2545339  1474
 5 GSM2545340  1474
 6 GSM2545341  1474
 7 GSM2545342  1474
 8 GSM2545343  1474
 9 GSM2545344  1474
10 GSM2545345  1474
# … with 12 more rows

We can use mutate to add metadata info.

se %>%
    mutate(center = "University of Melbourne")
# A SummarizedExperiment-tibble abstraction: 32,428 × 22
# Features=1474 | Samples=22 | Assays=counts
   .feature .sample    counts title geo_accession organism age   sex   infection
   <chr>    <chr>       <int> <chr> <chr>         <chr>    <chr> <fct> <fct>    
 1 Asl      GSM2545336   1170 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 2 Apod     GSM2545336  36194 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 3 Cyp2d22  GSM2545336   4060 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 4 Klk6     GSM2545336    287 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 5 Fcrls    GSM2545336     85 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 6 Slc2a4   GSM2545336    782 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 7 Exd2     GSM2545336   1619 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 8 Gjc2     GSM2545336    288 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 9 Plp1     GSM2545336  43217 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
10 Gnb4     GSM2545336   1071 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
# … with 40 more rows, and 21 more variables: strain <chr>, time <fct>,
#   tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
#   product <chr>, gbkey <chr>, external_gene_name <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
#   end <int>, width <int>, strand <fct>

We can also combine commands with the tidyverse pipe %>%.

For example, we could combine group_by and summarise to get the total counts for each sample.

se %>%
    group_by(.sample) %>%
    summarise(total_counts=sum(counts))
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 22 × 2
   .sample    total_counts
   <chr>             <int>
 1 GSM2545336      3039671
 2 GSM2545337      2602360
 3 GSM2545338      2458618
 4 GSM2545339      2500082
 5 GSM2545340      2479024
 6 GSM2545341      2413723
 7 GSM2545342      2349728
 8 GSM2545343      3105652
 9 GSM2545344      2524137
10 GSM2545345      2506038
# … with 12 more rows

Some other tidyverse commands we haven’t seen before but that are useful are slice, distinct and unite.

We can use slice to choose rows by number e.g. to choose the first row.

se %>% 
    slice(1)
# A SummarizedExperiment-tibble abstraction: 1 × 1
# Features=1 | Samples=1 | Assays=counts
  .feature .sample    counts title  geo_accession organism age   sex   infection
  <chr>    <chr>       <int> <chr>  <chr>         <chr>    <chr> <fct> <fct>    
1 Asl      GSM2545336   1170 CNS_R… GSM2545336    Mus mus… 8 we… Fema… Influenz…
# … with 21 more variables: strain <chr>, time <fct>, tissue <fct>,
#   mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>, product <chr>,
#   gbkey <chr>, external_gene_name <chr>, ensembl_gene_id <chr>,
#   external_synonym <chr>, chromosome_name <chr>, gene_biotype <chr>,
#   phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>,
#   seqnames <fct>, start <int>, end <int>, width <int>, strand <fct>

We can use distinct to see what unique information we have.

se %>%
    distinct(.sample, infection, sex)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 22 × 3
   .sample    sex    infection  
   <chr>      <fct>  <fct>      
 1 GSM2545336 Female InfluenzaA 
 2 GSM2545337 Female NonInfected
 3 GSM2545338 Female NonInfected
 4 GSM2545339 Female InfluenzaA 
 5 GSM2545340 Male   InfluenzaA 
 6 GSM2545341 Male   InfluenzaA 
 7 GSM2545342 Female InfluenzaA 
 8 GSM2545343 Male   NonInfected
 9 GSM2545344 Female InfluenzaA 
10 GSM2545345 Male   InfluenzaA 
# … with 12 more rows

We can use unite to combine multiple columns into a single column.

se %>%
    unite("group", c(infection, time))
tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A SummarizedExperiment-tibble abstraction: 32,428 × 22
# Features=1474 | Samples=22 | Assays=counts
   .feature .sample counts title geo_accession organism age   sex   group strain
   <chr>    <chr>    <int> <chr> <chr>         <chr>    <chr> <fct> <chr> <chr> 
 1 Asl      GSM254…   1170 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 2 Apod     GSM254…  36194 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 3 Cyp2d22  GSM254…   4060 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 4 Klk6     GSM254…    287 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 5 Fcrls    GSM254…     85 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 6 Slc2a4   GSM254…    782 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 7 Exd2     GSM254…   1619 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 8 Gjc2     GSM254…    288 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
 9 Plp1     GSM254…  43217 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
10 Gnb4     GSM254…   1071 CNS_… GSM2545336    Mus mus… 8 we… Fema… Infl… C57BL…
# … with 40 more rows, and 19 more variables: tissue <fct>, mouse <fct>,
#   center <chr>, gene <chr>, ENTREZID <chr>, product <chr>, gbkey <chr>,
#   external_gene_name <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
#   chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
#   end <int>, width <int>, strand <fct>

We can treat se as a normal tibble for plotting.

Here we plot the distribution of counts per sample.

se %>%
    ggplot(aes(counts + 1, group=.sample, color=infection)) +
    geom_density() +
    scale_x_log10() +
    theme_bw()

plot of chunk tidySE-plot

To work with genomic coordinates (ranges) with tidy methods, performing subsetting etc, we can use the plyranges package.

Challenge

  1. Extract the miRNA and NonInfected this time using tidyverse commands.
  2. Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8 this time using tidyverse commands.
  3. Create a bar plot of the sample counts using the se object. Hint use geom_col to plot the summarised counts.

Solution

se %>% 
filter(gene_biotype == "miRNA" & infection == "NonInfected")
# A SummarizedExperiment-tibble abstraction: 49 × 7
# Features=7 | Samples=7 | Assays=counts
   .feature .sample    counts title geo_accession organism age   sex   infection
   <chr>    <chr>       <int> <chr> <chr>         <chr>    <chr> <fct> <fct>    
 1 Mir1901  GSM2545337     45 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 2 Mir378a  GSM2545337     11 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 3 Mir133b  GSM2545337      4 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 4 Mir30c-2 GSM2545337     10 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 5 Mir149   GSM2545337      1 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 6 Mir128-1 GSM2545337      4 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 7 Mir7682  GSM2545337      2 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 8 Mir1901  GSM2545338     44 CNS_… GSM2545338    Mus mus… 8 we… Fema… NonInfec…
 9 Mir378a  GSM2545338      7 CNS_… GSM2545338    Mus mus… 8 we… Fema… NonInfec…
10 Mir133b  GSM2545338      6 CNS_… GSM2545338    Mus mus… 8 we… Fema… NonInfec…
# … with 39 more rows, and 21 more variables: strain <chr>, time <fct>,
#   tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
#   product <chr>, gbkey <chr>, external_gene_name <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
#   end <int>, width <int>, strand <fct>
se %>% 
  filter(time == "Day0" | time == "Day8") %>% 
  group_by(.sample) %>% 
  slice(1:3)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 42 × 30
# Groups:   .sample [14]
   .feature .sample    counts title geo_accession organism age   sex   infection
   <chr>    <chr>       <int> <chr> <chr>         <chr>    <chr> <fct> <fct>    
 1 Asl      GSM2545336   1170 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 2 Apod     GSM2545336  36194 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 3 Cyp2d22  GSM2545336   4060 CNS_… GSM2545336    Mus mus… 8 we… Fema… Influenz…
 4 Asl      GSM2545337    361 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 5 Apod     GSM2545337  10347 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 6 Cyp2d22  GSM2545337   1616 CNS_… GSM2545337    Mus mus… 8 we… Fema… NonInfec…
 7 Asl      GSM2545338    400 CNS_… GSM2545338    Mus mus… 8 we… Fema… NonInfec…
 8 Apod     GSM2545338   9173 CNS_… GSM2545338    Mus mus… 8 we… Fema… NonInfec…
 9 Cyp2d22  GSM2545338   1603 CNS_… GSM2545338    Mus mus… 8 we… Fema… NonInfec…
10 Asl      GSM2545341    988 CNS_… GSM2545341    Mus mus… 8 we… Male  Influenz…
# … with 32 more rows, and 21 more variables: strain <chr>, time <fct>,
#   tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
#   product <chr>, gbkey <chr>, external_gene_name <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
#   end <int>, width <int>, strand <fct>
se %>%
  group_by(.sample) %>%
  summarise(total_counts=sum(counts)) %>%
  ggplot(aes(x = .sample, y = total_counts)) +
  geom_col() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90,  hjust = 0.5, vjust = 0.5))
tidySummarizedExperiment says: A data frame is returned for independent data analysis.

plot of chunk unnamed-chunk-35

Interactive exploration

iSEE is a Bioconductor package that provides an interactive graphical user interface for exploring data stored in SummarizedExperiment objects.

# We won't use here but to install and use 
# BiocManager::install('iSEE')
# library(iSEE)
# iSEE(se)

Extensions to SummarizedExperiment

plot of chunk SCE

The Orchestrating Single-Cell Analysis with Bioconductor book is a great resource for single cell analysis that describes using SingleCellExperiment.

As there is tidySummarizedExperiment for interacting with SummarizedExperiment objects using tidyverse commands, there is equivalently the tidySingleCellExperiment package for SingleCellExperiment objects.

There are also extensions to SummarizedExperiment for other data types:

  1. The Bioconductor was initiated by Robert Gentleman, one of the two creators of the R language. Bioconductor provides tools dedicated to omics data analysis. Bioconductor uses the R statistical programming language, and is open source and open development. 

Key Points

  • SummarizedExperiment represents an efficient way to store and to handle omics data

  • SummarizedExperiment is used in many Bioconductor packages

  • We can use tidySummarizedExperiment to interact with SummarizedExperiment objects using tidyverse commands


Bioconductor

Overview

Teaching: XX min
Exercises: XX min
Questions
  • What is Bioconductor?

  • How can I use Bioconductor effectively for my analysis?

Objectives
  • Give an overview of the Bioconductor project including its website

  • Introduce concepts of reproducibility, coherence, interoperability and stability

Bioconductor

In the previous lesson we have already learned a little bit about the Bioconductor[^Bioconductor] project. In this class we will formalize our understanding of the Bioconductor[^Bioconductor] project.

Important: Remember that Bioconductor packages are downloaded via the function BiocManager::install().

Bioconductor is a repository that collects open-source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assays in R. In addition, Bioconductor supports development, education and a thriving community. The The broad goals of the Bioconductor project are:

One of the best ways to explore the Bioconductor project is its website.

The Bioconductor website

Bioconductor website

The website tells us that there are over 2,000 packages. This is obviously way too many packages to explore individually. So how would you find potentially useful packages for the analysis of your dataset?

Exploring different topics with biocViews

In Bioconductor, each package is classified as belonging to different categories. These different categories are called biocViews and they are structured as follows:

Each category is further divided into additional sub-categories, which are divided into sub-sub-categories that refer to specific assays, techniques, and research fields.

BiocViews can help tremendously with identifying packages that could be of use to you in the analysis of your dataset. These are easily searchable once you have navigated to the page that lists all packages.

Challenge

In your groups, identify a research topic of interest and then use the search function to find packages related to this topic. Think about the following:

  • What keywords define your topic of interest?
  • What type of packages are good for beginners and why?
  • What other information on this website may be useful?

Exploiting the detailed documentation of the Bioconductor project

Bioconductor help page

The other part of the Bioconductor website crucial for anyone’s learning journey is the help page. This page collects some outstanding Bioconductor learning resources such as

Most importantly it introduces the concept of vignettes, which are part of Bioconductor’s mission to enhance reproducibility through rigorous documentation. In Bioconductor almost every package (certainly every software package) has to include a vignette. Vignettes are small tutorials that explain common use cases of a package. For example, let’s explore the vignettes available for SummarizedExperiment:

browseVignettes(package = "SummarizedExperiment")

This should open a separate browser tab that will list availabe vignettes. By clicking on the html link you will open a nicely formatted tutorial that you should be easily able to follow. Vignettes are a great place to start when trying to get familar with a new package.

Bioconductor website

Core Bioconductor principles

Bioconductor is organized around some core principles:

While it is not absolutely necessary for the end-user to remember these, the core principles explain some of the idiosyncrasies of Bioconductor that you may come across.

S4 classes

Challenge

Use the function str on the se SummarizedExperiment object you worked with during the last lession. What oddity do you notice? Hint: Compare the output to output of str function applied to rna.

Solution

We can see certain elements of the object starting with an @.

S4 programming allows object-oriented programming in R and thus ensure coherence, stability, and interoperability. Object-oriented programming is a computer programming model that organizes software design around data, or objects, rather than functions and logic. An object can be defined as a data field that has unique attributes and behavior.In S4, certain classes are defined with specific accessibility functions (called methods).

# These two statements will result in the same output.

se@colData
DataFrame with 22 rows and 10 columns
                     title geo_accession     organism         age      sex
               <character>   <character>  <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus     8 weeks   Female
GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks   Female
GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks   Female
GSM2545339 CNS_RNA-seq_13C    GSM2545339 Mus musculus     8 weeks   Female
GSM2545340 CNS_RNA-seq_14C    GSM2545340 Mus musculus     8 weeks   Male  
...                    ...           ...          ...         ...      ...
GSM2545353  CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks   Female
GSM2545354  CNS_RNA-seq_4C    GSM2545354 Mus musculus     8 weeks   Male  
GSM2545362  CNS_RNA-seq_5C    GSM2545362 Mus musculus     8 weeks   Female
GSM2545363  CNS_RNA-seq_6C    GSM2545363 Mus musculus     8 weeks   Male  
GSM2545380  CNS_RNA-seq_9C    GSM2545380 Mus musculus     8 weeks   Female
             infection      strain     time     tissue    mouse
              <factor> <character> <factor>   <factor> <factor>
GSM2545336 InfluenzaA      C57BL/6     Day8 Cerebellum       14
GSM2545337 NonInfected     C57BL/6     Day0 Cerebellum       9 
GSM2545338 NonInfected     C57BL/6     Day0 Cerebellum       10
GSM2545339 InfluenzaA      C57BL/6     Day4 Cerebellum       15
GSM2545340 InfluenzaA      C57BL/6     Day4 Cerebellum       18
...                ...         ...      ...        ...      ...
GSM2545353 NonInfected     C57BL/6     Day0 Cerebellum       4 
GSM2545354 NonInfected     C57BL/6     Day0 Cerebellum       2 
GSM2545362 InfluenzaA      C57BL/6     Day4 Cerebellum       20
GSM2545363 InfluenzaA      C57BL/6     Day4 Cerebellum       12
GSM2545380 InfluenzaA      C57BL/6     Day8 Cerebellum       19
colData(se) #colData is a method that allows access to the column meta data
DataFrame with 22 rows and 10 columns
                     title geo_accession     organism         age      sex
               <character>   <character>  <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus     8 weeks   Female
GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks   Female
GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks   Female
GSM2545339 CNS_RNA-seq_13C    GSM2545339 Mus musculus     8 weeks   Female
GSM2545340 CNS_RNA-seq_14C    GSM2545340 Mus musculus     8 weeks   Male  
...                    ...           ...          ...         ...      ...
GSM2545353  CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks   Female
GSM2545354  CNS_RNA-seq_4C    GSM2545354 Mus musculus     8 weeks   Male  
GSM2545362  CNS_RNA-seq_5C    GSM2545362 Mus musculus     8 weeks   Female
GSM2545363  CNS_RNA-seq_6C    GSM2545363 Mus musculus     8 weeks   Male  
GSM2545380  CNS_RNA-seq_9C    GSM2545380 Mus musculus     8 weeks   Female
             infection      strain     time     tissue    mouse
              <factor> <character> <factor>   <factor> <factor>
GSM2545336 InfluenzaA      C57BL/6     Day8 Cerebellum       14
GSM2545337 NonInfected     C57BL/6     Day0 Cerebellum       9 
GSM2545338 NonInfected     C57BL/6     Day0 Cerebellum       10
GSM2545339 InfluenzaA      C57BL/6     Day4 Cerebellum       15
GSM2545340 InfluenzaA      C57BL/6     Day4 Cerebellum       18
...                ...         ...      ...        ...      ...
GSM2545353 NonInfected     C57BL/6     Day0 Cerebellum       4 
GSM2545354 NonInfected     C57BL/6     Day0 Cerebellum       2 
GSM2545362 InfluenzaA      C57BL/6     Day4 Cerebellum       20
GSM2545363 InfluenzaA      C57BL/6     Day4 Cerebellum       12
GSM2545380 InfluenzaA      C57BL/6     Day8 Cerebellum       19

Challenge

Why is it bad practice to use the @ to access parts of an object?

Solution

Using methods functions to access certain parts of the object enhances readibility of your code. If the underlying class structure changes (i.e. the name of the element) the method will still work.

If you want to know more about S4 classes you can find more information in these slides on their implementation in the Bioconductor package.

The release cycle

Bioconductor has two releases each year, typically in April and October, where all packages are updated to their next version in a way that they are interoperable (i.e. they do not clash when you load more than one). The releases conincide with the releases of new R versions, which also happen twice a year. This has two significant implications:

1) To ensure that packages on Bioconductor work flawlessly always use BiocManager::install to install a package, even when it is not a technically Bioconductor package. Biconductor mirrors most other R package repositories like CRAN and so the package will be most likely available. This will avoid clashed with packages being ahead of the Bioconductor release. 2) You will need to update your Bioconductor packages twice a year (after updating R) to have all the latest versions. Refer to this guide for updating R qnd R-Studio and this guide for updating Bioconductor.

Working with annotations

Bioconductor provides extensive annotation resources. These can be gene centric, or genome centric. Annotations can be provided in packages curated by Bioconductor, or obtained from web-based resources.

Gene centric AnnotationDbi packages include:

The most popular annotation packages have been modified so that they can make use of a new set of methods to more easily access their contents. These four methods are named: columns, keytypes, keys and select. They can currently be used with all chip, organism, and TxDb packages along with the popular GO.db package.

An extremely common kind of Annotation package is the so called platform based or chip based package type. This package is intended to make the manufacturer labels for a series of probes or probesets to a wide range of gene-based features. A package of this kind will load an ChipDb object. Below is a set of examples to show how you might use the standard 4 methods to interact with an object of this type. First we need to load the package:

suppressPackageStartupMessages({
library(hgu95av2.db)
})

If we list the contents of this package, we can see that one of the many things loaded is an object named after the package “hgu95av2.db”:

ls("package:hgu95av2.db")
 [1] "hgu95av2"              "hgu95av2_dbconn"       "hgu95av2_dbfile"      
 [4] "hgu95av2_dbInfo"       "hgu95av2_dbschema"     "hgu95av2.db"          
 [7] "hgu95av2ACCNUM"        "hgu95av2ALIAS2PROBE"   "hgu95av2CHR"          
[10] "hgu95av2CHRLENGTHS"    "hgu95av2CHRLOC"        "hgu95av2CHRLOCEND"    
[13] "hgu95av2ENSEMBL"       "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"     
[16] "hgu95av2ENZYME"        "hgu95av2ENZYME2PROBE"  "hgu95av2GENENAME"     
[19] "hgu95av2GO"            "hgu95av2GO2ALLPROBES"  "hgu95av2GO2PROBE"     
[22] "hgu95av2MAP"           "hgu95av2MAPCOUNTS"     "hgu95av2OMIM"         
[25] "hgu95av2ORGANISM"      "hgu95av2ORGPKG"        "hgu95av2PATH"         
[28] "hgu95av2PATH2PROBE"    "hgu95av2PFAM"          "hgu95av2PMID"         
[31] "hgu95av2PMID2PROBE"    "hgu95av2PROSITE"       "hgu95av2REFSEQ"       
[34] "hgu95av2SYMBOL"        "hgu95av2UNIPROT"      

We can look at this object to learn more about it:

hgu95av2.db
ChipDb object:
| DBSCHEMAVERSION: 2.1
| Db type: ChipDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMANCHIP_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| MANUFACTURER: Affymetrix
| CHIPNAME: Affymetrix HG_U95Av2 Array
| MANUFACTURERURL: http://www.affymetrix.com
| EGSOURCEDATE: 2021-Apr14
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: ENTREZID
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2021-02-01
| GOEGSOURCEDATE: 2021-Apr14
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL: 
| GPSOURCEDATE: 2021-Feb16
| ENSOURCEDATE: 2021-Feb16
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Mon Apr 26 21:53:12 2021

Please see: help('select') for usage information

If we want to know what kinds of data are retriveable via select, then we should use the columns method like this:

columns(hgu95av2.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"      
[26] "UCSCKG"       "UNIPROT"     

If we are further curious to know more about those values for columns, we can consult the help pages. Asking about any of these values will pull up a manual page describing the different fields and what they mean.

help("SYMBOL")

If we are curious about what kinds of fields we could potential use as keys to query the database, we can use the keytypes method. In a perfect world, this method will return values very similar to what was returned by columns, but in reality, some kinds of values make poor keys and so this list is often shorter.

keytypes(hgu95av2.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"      
[26] "UCSCKG"       "UNIPROT"     

If we want to extract some sample keys of a particular type, we can use the keys method.

head(keys(hgu95av2.db, keytype="SYMBOL"))
[1] "A1BG"  "A2M"   "A2MP1" "NAT1"  "NAT2"  "NATP" 

And finally, if we have some keys, we can use select to extract them. By simply using appropriate argument values with select we can specify what keys we want to look up values for (keys), what we want returned back (columns) and the type of keys that we are passing in (keytype).

#1st get some example keys
k <- head(keys(hgu95av2.db,keytype="PROBEID"))
# then call select
select(hgu95av2.db, keys=k, columns=c("SYMBOL","GENENAME"), keytype="PROBEID")
'select()' returned 1:1 mapping between keys and columns
    PROBEID  SYMBOL
1   1000_at   MAPK3
2   1001_at    TIE1
3 1002_f_at CYP2C19
4 1003_s_at   CXCR5
5   1004_at   CXCR5
6   1005_at   DUSP1
                                                         GENENAME
1                              mitogen-activated protein kinase 3
2 tyrosine kinase with immunoglobulin like and EGF like domains 1
3                  cytochrome P450 family 2 subfamily C member 19
4                                C-X-C motif chemokine receptor 5
5                                C-X-C motif chemokine receptor 5
6                                  dual specificity phosphatase 1

An organism level package (an ‘org’ package) uses a central gene identifier (e.g. Entrez Gene id) and contains mappings between this identifier and other kinds of identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form org...db (e.g. org.Sc.sgd.db) where is a 2-letter abbreviation of the organism (e.g. Sc for Saccharomyces cerevisiae) and is an abbreviation (in lower-case) describing the type of central identifier (e.g. sgd for gene identifiers assigned by the Saccharomyces Genome Database, or eg for Entrez Gene ids). Just as the chip packages load a **ChipDb** object, the org packages will load a **OrgDb** object. The following exercise should acquaint you with the use of these methods in the context of an organism package.

Challenge

Display the OrgDb object for the org.Hs.eg.db package. Use the columns method to discover which sorts of annotations can be extracted from it. Finally, use the keys method to extract UNIPROT identifiers and then pass those keys in to the select method in such a way that you extract the gene symbol and KEGG pathway information for each. Use the help system as needed to learn which values to pass in to columns in order to achieve this.

Solution

library(org.Hs.eg.db)
uniKeys <- head(keys(org.Hs.eg.db, keytype="UNIPROT"))
cols <- c("SYMBOL", "PATH")
select(org.Hs.eg.db, keys=uniKeys, columns=cols, keytype="UNIPROT")

A TxDb package (a ’TxDb’ package) connects a set of genomic coordinates to various transcript oriented features. The package can also contain Identifiers to features such as genes and transcripts, and the internal schema describes the relationships between these different elements. All TxDb containing packages follow a specific naming scheme that tells where the data came from as well as which build of the genome it comes from.

Challenge

Display the TxDb object for the TxDb.Hsapiens.UCSC.hg19.knownGene package. As before, use the columns and keytypes methods to discover which sorts of annotations can be extracted from it. Use the keys method to extract just a few gene identifiers and then pass those keys in to the select method in such a way that you extract the transcript ids and transcript starts for each.

Solution

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
keys <- head(keys(txdb, keytype="GENEID"))
cols <- c("TXID", "TXSTART")
select(txdb, keys=keys, columns=cols, keytype="GENEID")

Key Points

  • Bioconductor collates interoperable packages for analyses of biological data.

  • Using biocView can help you to find packages for a specific topic.