Data organisation with Spreadsheets
Overview
Teaching: XX min
Exercises: XX minQuestions
How to organise tabular data?
Objectives
Learn about spreadsheets, their strenghts and weaknesses
How do we format data in spreadsheets for effective data use?
Learn about common spreadsheet errors and how to correct them.
Organize your data according to tidy data principles.
Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated formats.
Spreadsheet programs
Question
- What are basic principles for using spreadsheets for good data organization?
Objective
- Describe best practices for organizing data so computers can make the best use of data sets.
Keypoint
- Good data organization is the foundation of any research project.
Good data organization is the foundation of your research project. Most researchers have data or do data entry in spreadsheets. Spreadsheet programs are very useful graphical interfaces for designing data tables and handling very basic data quality control functions. See also @Broman:2018.
Spreadsheet outline
Spreadsheets are good for data entry. Therefore we have a lot of data in spreadsheets. Much of your time as a researcher will be spent in this ‘data wrangling’ stage. It’s not the most fun, but it’s necessary. We’ll teach you how to think about data organization and some practices for more effective data wrangling.
What this lesson will not teach you
- How to do statistics in a spreadsheet
- How to do plotting in a spreadsheet
- How to write code in spreadsheet programs
If you’re looking to do this, a good reference is Head First Excel, published by O’Reilly.
Why aren’t we teaching data analysis in spreadsheets
-
Data analysis in spreadsheets usually requires a lot of manual work. If you want to change a parameter or run an analysis with a new dataset, you usually have to redo everything by hand. (We do know that you can create macros, but see the next point.)
-
It is also difficult to track or reproduce statistical or plotting analyses done in spreadsheet programs when you want to go back to your work or someone asks for details of your analysis.
Many spreadsheet programs are available. Since most participants utilise Excel as their primary spreadsheet program, this lesson will make use of Excel examples. A free spreadsheet program that can also be used is LibreOffice. Commands may differ a bit between programs, but the general idea is the same.
Spreadsheet programs encompass a lot of the things we need to be able to do as researchers. We can use them for:
- Data entry
- Organizing data
- Subsetting and sorting data
- Statistics
- Plotting
Spreadsheet programs use tables to represent and display data. Data formatted as tables is also the main theme of this chapter, and we will see how to organise data into tables in a standardised way to ensure efficient downstream analysis.
Problems with Spreadsheets
Spreadsheets are good for data entry, but in reality we tend to use spreadsheet programs for much more than data entry. We use them to create data tables for publications, to generate summary statistics, and make figures.
Generating tables for publications in a spreadsheet is not optimal - often, when formatting a data table for publication, we’re reporting key summary statistics in a way that is not really meant to be read as data, and often involves special formatting (merging cells, creating borders, making it pretty). We advise you to do this sort of operation within your document editing software.
The latter two applications, generating statistics and figures, should be used with caution: because of the graphical, drag and drop nature of spreadsheet programs, it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else’s), particularly if your stats or figures require you to do more complex calculations. Furthermore, in doing calculations in a spreadsheet, it’s easy to accidentally apply a slightly different formula to multiple adjacent cells. When using a command-line based statistics program like R or SAS, it’s practically impossible to apply a calculation to one observation in your dataset but not another unless you’re doing it on purpose.
Using Spreadsheets for Data Entry and Cleaning
However, there are circumstances where you might want to use a spreadsheet program to produce “quick and dirty” calculations or figures, and data cleaning will help you use some of these features. Data cleaning also puts your data in a better format prior to importation into a statistical analysis program. We will show you how to use some features of spreadsheet programs to check your data quality along the way and produce preliminary summary statistics.
In this lesson, we will assume that you are most likely using Excel as your primary spreadsheet program - there are others (gnumeric, Calc from OpenOffice), and their functionality is similar, but Excel seems to be the program most used by biologists and biomedical researchers.
In this lesson we’re going to talk about:
- Formatting data tables in spreadsheets
- Formatting problems
- Dates as data
- Quality control
- Exporting data
Formatting data tables in spreadsheets
Questions
- How do we format data in spreadsheets for effective data use?
Objectives
-
Describe best practices for data entry and formatting in spreadsheets.
-
Apply best practices to arrange variables and observations in a spreadsheet.
Keypoints
-
Never modify your raw data. Always make a copy before making any changes.
-
Keep track of all of the steps you take to clean your data in a plain text file.
-
Organize your data according to tidy data principles.
The most common mistake made is treating spreadsheet programs like lab notebooks, that is, relying on context, notes in the margin, spatial layout of data and fields to convey information. As humans, we can (usually) interpret these things, but computers don’t view information the same way, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fits together.
Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal).
This is why it’s extremely important to set up well-formatted tables from the outset - before you even start entering data from your very first preliminary experiment. Data organization is the foundation of your research project. It can make it easier or harder to work with your data throughout your analysis, so it’s worth thinking about when you’re doing your data entry or setting up your experiment. You can set things up in different ways in spreadsheets, but some of these choices can limit your ability to work with the data in other programs or have the you-of-6-months-from-now or your collaborator work with the data.
Note: the best layouts/formats (as well as software and interfaces) for data entry and data analysis might be different. It is important to take this into account, and ideally automate the conversion from one to another.
Keeping track of your analyses
When you’re working with spreadsheets, during data clean up or analyses, it’s very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when a reviewer or instructor asks for a different analysis, you should
-
create a new file with your cleaned or analyzed data. Don’t modify the original dataset, or you will never know where you started!
-
keep track of the steps you took in your clean up or analysis. You should track these steps as you would any step in an experiment. We recommend that you do this in a plain text file stored in the same folder as the data file.
This might be an example of a spreadsheet setup:
Put these principles in to practice today during your exercises.
While versioning is out of scope for this course, you can look at the Carpentries lesson on ‘Git’ to learn how to maintain version control over your data. See also this blog post for a quick tutorial or @Perez-Riverol:2016 for a more research-oriented use-case.
Structuring data in spreadsheets
The cardinal rules of using spreadsheet programs for data:
- Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’.
- Put each observation in its own row.
- Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
- Leave the raw data raw - don’t change it!
- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.
For instance, we have data from patients that visited several hospitals from Brussels, Belgium. They recorded the date of the visit, the hospital, the patients’ gender, weight and blood group.
If we were to keep track of the data like this:
the problem is that the ABO and Rhesus groups are in the same Blood
type column. So, if they wanted to look at all observations of the A
group or look at weight distributions by ABO group, it would be tricky
to do this using this data setup. If instead we put the ABO and Rhesus
groups in different columns, you can see that it would be much easier.
An important rule when setting up a datasheet, is that columns are used for variables and rows are used for observations:
- columns are variables
- rows are observations
- cells are individual values
Challenge: We’re going to take a messy data and describe how we would clean it up.
Download a messy data by clicking here.
Open up the data in a spreadsheet program.
You can see that there are two tabs. The data contains various clinical variables recorded in various hospitals in Brussels during the first and second COVID-19 waves in 2020. As you can see, the data have been recorded differently during the March and November waves. Now you’re the person in charge of this project and you want to be able to start analyzing the data.
With the person next to you, identify what is wrong with this spreadsheet. Also discuss the steps you would need to take to clean up first and second wave tabs, and to put them all together in one spreadsheet.
Important: Do not forget our first piece of advice: to create a new file (or tab) for the cleaned data, never modify your original (raw) data. However in this case we will have taken care of this for you!
After you go through this exercise, we’ll discuss as a group what was wrong with this data and how you would fix it.
- Take about 7 minutes to work on this exercise.
- All the mistakes in the common mistakes section below are present in the messy dataset. If the exercise is done during a workshop, ask people what they saw as wrong with the data. As they bring up different points, you can refer to the common mistakes or expand a bit on the point they brought up.
- If you get a response where they’ve fixed the date, you can pause and go to the dates lesson. Or you can say you’ll come back to dates at the end.
Challenge: Once you have tidied up the data, answer the following questions:
- How many men and women took part in the study?
- How many A, AB, and B types have been tested?
- As above, but disregarding the contaminated samples?
- How many Rhesus + and - have been tested?
- How many universal donors (0-) have been tested?
- What is the average weight of AB men?
- How many samples have been tested in the different hospitals?
An excellent reference, in particular with regard to R scripting is the Tidy Data paper @Wickham:2014.
Common Spreadsheet Errors
Questions
- What are some common challenges with formatting data in spreadsheets and how can we avoid them?
Objectives
- Recognize and resolve common spreadsheet formatting problems.
Keypoints
- Avoid using multiple tables within one spreadsheet.
- Avoid spreading data across multiple tabs.
- Record zeros as zeros.
- Use an appropriate null value to record missing data.
- Don’t use formatting to convey information or to make your spreadsheet look pretty.
- Place comments in a separate column.
- Record units in column headers.
- Include only one piece of information in a cell.
- Avoid spaces, numbers and special characters in column headers.
- Avoid special characters in your data.
- Record metadata in a separate plain text file.
There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets, can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.
- Using multiple tables
- Using multiple tabs
- Not filling in zeros
- Using problematic null values
- Using formatting to convey information
- Using formatting to make the data sheet look pretty
- Placing comments or units in cells
- Entering more than one piece of information in a cell
- Using problematic field names
- Using special characters in data
- Inclusion of metadata in data table
Using multiple tables
A common strategy is creating multiple data tables within one spreadsheet. This confuses the computer, so don’t do this! When you create multiple tables within one spreadsheet, you’re drawing false associations between things for the computer, which sees each row as an observation. You’re also potentially using the same field name in multiple places, which will make it harder to clean your data up into a usable form. The example below depicts the problem:
In the example above, the computer will see (for example) row 4 and assume that all columns A-AF refer to the same sample. This row actually represents four distinct samples (sample 1 for each of four different collection dates - May 29th, June 12th, June 19th, and June 26th), as well as some calculated summary statistics (an average (avr) and standard error of measurement (SEM)) for two of those samples. Other rows are similarly problematic.
Using multiple tabs
But what about workbook tabs? That seems like an easy way to organize data, right? Well, yes and no. When you create extra tabs, you fail to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or scripting to ensure this connection). Say, for instance, you make a separate tab for each day you take a measurement.
This isn’t good practice for two reasons:
-
you are more likely to accidentally add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and
-
even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyze the data because you will have to combine these data into a single datatable. You will have to explicitly tell the computer how to combine tabs - and if the tabs are inconsistently formatted, you might even have to do it manually.
The next time you’re entering data, and you go to create another tab or table, ask yourself if you could avoid adding this tab by adding another column to your original spreadsheet. We used multiple tabs in our example of a messy data file, but now you’ve seen how you can reorganize your data to consolidate across tabs.
Your data sheet might get very long over the course of the experiment. This makes it harder to enter data if you can’t see your headers at the top of the spreadsheet. But don’t repeat your header row. These can easily get mixed into the data, leading to problems down the road. Instead you can freeze the column headers so that they remain visible even when you have a spreadsheet with many rows.
Not filling in zeros
It might be that when you’re measuring something, it’s usually a zero, say the number of times a rabbit is observed in the survey. Why bother writing in the number zero in that column, when it’s mostly zeros?
However, there’s a difference between a zero and a blank cell in a spreadsheet. To the computer, a zero is actually data. You measured or counted it. A blank cell means that it wasn’t measured and the computer will interpret it as an unknown value (also known as a null or missing value).
The spreadsheets or statistical programs will likely mis-interpret blank cells that you intend to be zeros. By not entering the value of your observation, you are telling your computer to represent that data as unknown or missing (null). This can cause problems with subsequent calculations or analyses. For example, the average of a set of numbers which includes a single null value is always null (because the computer can’t guess the value of the missing observations). Because of this, it’s very important to record zeros as zeros and truly missing data as nulls.
Using problematic null values
Example: using -999 or other numerical values (or zero) to represent missing data.
Solutions:
There are a few reasons why null values get represented differently within a dataset. Sometimes confusing null values are automatically recorded from the measuring device. If that’s the case, there’s not much you can do, but it can be addressed in data cleaning with a tool like OpenRefine before analysis. Other times different null values are used to convey different reasons why the data isn’t there. This is important information to capture, but is in effect using one column to capture two pieces of information. Like for [using formatting to convey information]((#formatting) it would be good here to create a new column like ‘data_missing’ and use that column to capture the different reasons.
Whatever the reason, it’s a problem if unknown or missing data is recorded as -999, 999, or 0.
Many statistical programs will not recognize that these are intended to represent missing (null) values. How these values are interpreted will depend on the software you use to analyze your data. It is essential to use a clearly defined and consistent null indicator.
Blanks (most applications) and NA (for R) are good choices. @White:2013 explain good choices for indicating null values for different software applications in their article:
Using formatting to convey information
Example: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.
Solution: create a new field to encode which data should be excluded.
Using formatting to make the data sheet look pretty
Example: merging cells.
Solution: If you’re not careful, formatting a worksheet to be more aesthetically pleasing can compromise your computer’s ability to see associations in the data. Merged cells will make your data unreadable by statistics software. Consider restructuring your data in such a way that you will not need to merge cells to organize your data.
Placing comments or units in cells
Most analysis software can’t see Excel or LibreOffice comments, and would be confused by comments placed within your data cells. As described above for formatting, create another field if you need to add notes to cells. Similarly, don’t include units in cells: ideally, all the measurements you place in one column should be in the same unit, but if for some reason they aren’t, create another field and specify the units the cell is in.
Entering more than one piece of information in a cell
Example: Recording ABO and Rhesus groups in one cell, such as A+, B+, A-, …
Solution: Don’t include more than one piece of information in a cell. This will limit the ways in which you can analyze your data. If you need both these measurements, design your data sheet to include this information. For example, include one column the ABO group and one for the Rhesus group.
Using problematic field names
Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be misinterpreted by parsers that use whitespace as delimiters and some programs don’t like field names that are text strings that start with numbers.
Underscores (_
) are a good alternative to spaces. Consider writing
names in camel case (like this: ExampleFileName) to improve
readability. Remember that abbreviations that make sense at the moment
may not be so obvious in 6 months, but don’t overdo it with names that
are excessively long. Including the units in the field names avoids
confusion and enables others to readily interpret your fields.
Examples
Good Name | Good Alternative | Avoid |
---|---|---|
Max_temp_C | MaxTemp | Maximum Temp (°C) |
Precipitation_mm | Precipitation | precmm |
Mean_year_growth | MeanYearGrowth | Mean growth/year |
sex | sex | M/F |
weight | weight | w. |
cell_type | CellType | Cell Type |
Observation_01 | first_observation | 1st Obs |
Using special characters in data
Example: You treat your spreadsheet program as a word processor when writing notes, for example copying data directly from Word or other applications.
Solution: This is a common strategy. For example, when writing longer text in a cell, people often include line breaks, em-dashes, etc in their spreadsheet. Also, when copying data in from applications such as Word, formatting and fancy non-standard characters (such as left- and right-aligned quotation marks) are included. When exporting this data into a coding/statistical environment or into a relational database, dangerous things may occur, such as lines being cut in half and encoding errors being thrown.
General best practice is to avoid adding characters such as newlines, tabs, and vertical tabs. In other words, treat a text cell as if it were a simple web form that can only contain text and spaces.
Inclusion of metadata in data table
Example: You add a legend at the top or bottom of your data table explaining column meaning, units, exceptions, etc.
Solution: Recording data about your data (“metadata”) is essential. You may be on intimate terms with your dataset while you are collecting and analysing it, but the chances that you will still remember that the variable “sglmemgp” means single member of group, for example, or the exact algorithm you used to transform a variable or create a derived one, after a few months, a year, or more are slim.
As well, there are many reasons other people may want to examine or use your data - to understand your findings, to verify your findings, to review your submitted publication, to replicate your results, to design a similar study, or even to archive your data for access and re-use by others. While digital data by definition are machine-readable, understanding their meaning is a job for human beings. The importance of documenting your data during the collection and analysis phase of your research cannot be overestimated, especially if your research is going to be part of the scholarly record.
However, metadata should not be contained in the data file itself. Unlike a table in a paper or a supplemental file, metadata (in the form of legends) should not be included in a data file since this information is not data, and including it can disrupt how computer programs interpret your data file. Rather, metadata should be stored as a separate file in the same directory as your data file, preferably in plain text format with a name that clearly associates it with your data file. Because metadata files are free text format, they also allow you to encode comments, units, information about how null values are encoded, etc. that are important to document but can disrupt the formatting of your data file.
Additionally, file or database level metadata describes how files that make up the dataset relate to each other; what format they are in; and whether they supercede or are superceded by previous files. A folder-level readme.txt file is the classic way of accounting for all the files and folders in a project.
(Text on metadata adapted from the online course Research Data MANTRA by EDINA and Data Library, University of Edinburgh. MANTRA is licensed under a Creative Commons Attribution 4.0 International License.)
Exporting data
Question
- How can we export data from spreadsheets in a way that is useful for downstream applications?
Objectives
- Store spreadsheet data in universal file formats.
- Export data from a spreadsheet to a CSV file.
Keypoints
-
Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.
-
Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.
Storing the data you’re going to work with for your analyses in Excel
default file format (*.xls
or *.xlsx
- depending on the Excel
version) isn’t a good idea. Why?
-
Because it is a proprietary format, and it is possible that in the future, technology won’t exist (or will become sufficiently rare) to make it inconvenient, if not impossible, to open the file.
-
Other spreadsheet software may not be able to open files saved in a proprietary Excel format.
-
Different versions of Excel may handle data differently, leading to inconsistencies. Dates is a well-documented example of inconsistencies in data storage.
-
Finally, more journals and grant agencies are requiring you to deposit your data in a data repository, and most of them don’t accept Excel format. It needs to be in one of the formats discussed below.
-
The above points also apply to other formats such as open data formats used by LibreOffice / Open Office. These formats are not static and do not get parsed the same way by different software packages.
Storing data in a universal, open, and static format will help deal with this problem. Try tab-delimited (tab separated values or TSV) or comma-delimited (comma separated values or CSV). CSV files are plain text files where the columns are separated by commas, hence ‘comma separated values’ or CSV. The advantage of a CSV file over an Excel/SPSS/etc. file is that we can open and read a CSV file using just about any software, including plain text editors like TextEdit or NotePad. Data in a CSV file can also be easily imported into other formats and environments, such as SQLite and R. We’re not tied to a certain version of a certain expensive program when we work with CSV files, so it’s a good format to work with for maximum portability and endurance. Most spreadsheet programs can save to delimited text formats like CSV easily, although they may give you a warning during the file export.
To save a file you have opened in Excel in CSV format:
- From the top menu select ‘File’ and ‘Save as’.
- In the ‘Format’ field, from the list, select ‘Comma Separated
Values’ (
*.csv
). - Double check the file name and the location where you want to save it and hit ‘Save’.
An important note for backwards compatibility: you can open CSV files in Excel!
A note on R and xls
: There are R packages that can read xls
files (as well as Google spreadsheets). It is even possible to access
different worksheets in the xls
documents.
But
- some of these only work on Windows
- this equates to replacing a (simple but manual) export to
csv
with additional complexity/dependencies in the data analysis R code - data formatting best practice still apply
- Is there really a good reason why
csv
(or similar) is not adequate?
Caveats on commas
In some datasets, the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. This is because the commas which are a part of the data values will be interpreted as delimiters.
For example, our data might look like this:
species_id,genus,species,taxa
AB,Amphispiza,bilineata,Bird
AH,Ammospermophilus,harrisi,Rodent, not censused
AS,Ammodramus,savannarum,Bird
BA,Baiomys,taylori,Rodent
In the record AH,Ammospermophilus,harrisi,Rodent, not censused
the
value for taxa
includes a comma (Rodent, not censused
). If we try
to read the above into Excel (or other spreadsheet program), we will
get something like this:
The value for taxa
was split into two columns (instead of being put
in one column D
). This can propagate to a number of further
errors. For example, the extra column will be interpreted as a column
with many missing values (and without a proper header). In addition to
that, the value in column D
for the record in row 3 (so the one
where the value for ‘taxa’ contained the comma) is now incorrect.
If you want to store your data in csv
format and expect that your
data values may contain commas, you can avoid the problem discussed
above by putting the values in quotes (“”). Applying this rule, our
data might look like this:
species_id,genus,species,taxa
"AB","Amphispiza","bilineata","Bird"
"AH","Ammospermophilus","harrisi","Rodent, not censused"
"AS","Ammodramus","savannarum","Bird"
"BA","Baiomys","taylori","Rodent"
Now opening this file as a csv
in Excel will not lead to an extra
column, because Excel will only use commas that fall outside of
quotation marks as delimiting characters.
Alternatively, if you are working with data that contains commas, you likely will need to use another delimiter when working in a spreadsheet1. In this case, consider using tabs as your delimiter and working with TSV files. TSV files can be exported from spreadsheet programs in the same way as CSV files.
If you are working with an already existing dataset in which the data values are not included in “” but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning. If the dataset you’re dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - “”) is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.
Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.
Summary
A typical data analysis worflow is illustrated in the figure above, where data is repeatedly tranformed, visualised, modelled. This iteration is repeated multiple times until the data is understood. In many real-life cases, however, most time is spent cleaning up and preparing the data, rather than actually analysing and understanding it.
An agile data analysis workflow, with several fast iterations of the transform/visualise/model cycle is only feasible is the data is formatted in a predictable way and one can reason about the data without having to look at it and/or fix it.
-
This is particularly relevant in European countries where the comma is used as a decimal separator. In such cases, the default value separator in a csv file will be the semi-colon (;), or values will be systematically quoted. ↩
Key Points
Good data organization is the foundation of any research project.
R and RStudio
Overview
Teaching: XX min
Exercises: XX minQuestions
What are R and RStudio?
Objectives
Describe the purpose of the RStudio Script, Console, Environment, and Plots panes.
Organize files and directories for a set of analyses as an R project, and understand the purpose of the working directory.
Use the built-in RStudio help interface to search for more information on R functions.
Demonstrate how to provide sufficient information for troubleshooting with the R user community.
What is R? What is RStudio?
The term R is used to refer to both the programming language, the environment for statistical computing and the software that interprets the scripts written using it.
RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software1. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
The RStudio IDE Cheat Sheet provides much more information that will be covered here, but can be useful to learn keyboard shortcuts and discover new features.
Why learn R?
R does not involve lots of pointing and clicking, and that’s a good thing
The learning curve might be steeper than with other software, but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that’s a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results; you just have to run your script again.
Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.
Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.
R code is great for reproducibility
Reproducibility means that someone else (including your future self) can obtain the same results from the same dataset when using the same analysis code.
R integrates with other tools to generate manuscripts or reports from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript or report are updated automatically.
An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
R is interdisciplinary and extensible
With 10000+ packages2 that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.
R works on data of all shapes and sizes
The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you.
R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient.
R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.
R produces high-quality graphics
The plotting functionalities in R are extensive, and allow you to adjust any aspect of your graph to convey most effectively the message from your data.
R has a large and welcoming community
Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, or on the RStudio community. These broad user community extends to specialised areas such as bioinformatics.
Not only is R free, but it is also open-source and cross-platform
Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.
Knowing your way around RStudio
Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.
The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, Inc.
We will use the RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.
The RStudio window is divided into 4 “Panes”:
- the Source for your scripts and documents (top-left, in the default layout)
- your Environment/History (top-right),
- your Files/Plots/Packages/Help/Viewer (bottom-right), and
- the R Console (bottom-left).
The placement of these panes and their content can be customized (see
menu, Tools -> Global Options -> Pane Layout
).
One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.
Getting set up
It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory. All of the scripts within this folder can then use relative paths to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work.
RStudio provides a helpful set of tools to do this through its “Projects” interface, which not only creates a working directory for you, but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and open files to make it easier to resume work after a break. Go through the steps for creating an “R Project” for this tutorial below.
- Start RStudio.
- Under the
File
menu, click onNew project
. ChooseNew directory
, thenNew project
. - Enter a name for this new folder (or “directory”), and choose a
convenient location for it. This will be your working directory
for this session (or whole course) (e.g.,
bioc-intro
). - Click on
Create project
. - (Optional) Set Preferences to ‘Never’ save workspace in RStudio.
RStudio’s default preferences generally work well, but saving a workspace to .RData can be cumbersome, especially if you are working with larger datasets. To turn that off, go to Tools –> ‘Global Options’ and select the ‘Never’ option for ‘Save workspace to .RData’ on exit.’
To avoid character encoding issue between Windows and other operating systems, we are going to set UTF-8 by default:
Organizing your working directory
Using a consistent folder structure across your projects will help keep things organized, and will also make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for scripts, data, and documents.
data/
Use this folder to store your raw data and intermediate datasets you may create for the need of a particular analysis. For the sake of transparency and provenance, you should always keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is also a good idea. For example, you could have filesdata/raw/tree_survey.plot1.txt
and...plot2.txt
kept separate from adata/processed/tree.survey.csv
file generated by thescripts/01.preprocess.tree_survey.R
script.documents/
This would be a place to keep outlines, drafts, and other text.scripts/
(orsrc
) This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions (more on that later).
You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory.
For this course, we will need a data/
folder to store our raw data,
and we will use data_output/
for when we learn how to export data as
CSV files, and fig_output/
folder for the figures that we will save.
Challenge: create your project directory structure
Create an RStudio project for this workshop called e.g. bioc-intro, as described above under Getting set up, if you haven’t already done that. Under the
Files
tab on the right of the screen, click onNew Folder
and create a folder nameddata
within your newly created working directory (e.g.,~/bioc-intro/data
). (Alternatively, typedir.create("data")
at your R console.) Repeat these operations to create adata_output/
and afig_output
folders.
Your working directory should now look like this:
The working directory
The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.
Using RStudio projects makes this easy and ensures that your working
directory is set properly. If you need to check it, you can use
getwd()
. If for some reason your working directory is not what it
should be, you can change it in the RStudio interface by navigating in
the file browser where your working directory should be, and clicking
on the blue gear icon More
, and select Set As Working Directory
.
Alternatively you can use setwd("/path/to/working/directory")
to
reset your working directory. However, your scripts should not include
this line because it will fail on someone else’s computer.
Example
The schema below represents the working directory bioc-intro
with the
data
and fig_output
sub-directories, and 2 files in the latter:
bioc-intro/data/
/fig_output/fig1.pdf
/fig_output/fig2.png
If we were in the working directory, we could refer to the fig1.pdf
file using the relative path fig_output/fig1.pdf
or the
absolute path /home/user/bioc-intro/fig_output/fig1.pdf
.
If we were in the data
directory, we would use the relative path
../fig_output/fig1.pdf
or the same absolute path
/home/user/bioc-intro/fig_output/fig1.pdf
.
Interacting with R
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.
There are two main ways of interacting with R: by using the
console or by using scripts (plain text files that contain
your code). The console pane (in RStudio, the bottom left panel) is
the place where commands written in the R language can be typed and
executed immediately by the computer. It is also where the results
will be shown for commands that have been executed. You can type
commands directly into the console and press Enter
to execute those
commands, but they will be forgotten when you close the session.
Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer. Note, however, that merely typing the commands in the script does not automatically run them - they still need to be sent to the console for execution.
RStudio allows you to execute commands directly from the script editor
by using the Ctrl
+ Enter
shortcut (on Macs, Cmd
+ Return
will
work, too). The command on the current line in the script (indicated
by the cursor) or all of the commands in the currently selected text
will be sent to the console and executed when you press Ctrl
+
Enter
. You can find other keyboard shortcuts in this RStudio
cheatsheet about the RStudio
IDE.
At some point in your analysis you may want to check the content of a
variable or the structure of an object, without necessarily keeping a
record of it in your script. You can type these commands and execute
them directly in the console. RStudio provides the Ctrl
+ 1
and
Ctrl
+ 2
shortcuts allow you to jump between the script and the
console panes.
If R is ready to accept commands, the R console shows a >
prompt. If
it receives a command (by typing, copy-pasting or sent from the script
editor using Ctrl
+ Enter
), R will try to execute it, and when
ready, will show the results and come back with a new >
prompt to
wait for new commands.
If R is still waiting for you to enter more data because it isn’t
complete yet, the console will show a +
prompt. It means that you
haven’t finished entering a complete command. This is because you have
not ‘closed’ a parenthesis or quotation, i.e. you don’t have the same
number of left-parentheses as right-parentheses, or the same number of
opening and closing quotation marks. When this happens, and you
thought you finished typing your command, click inside the console
window and press Esc
; this will cancel the incomplete command and
return you to the >
prompt.
How to learn more during and after the course?
The material we cover during this course will give you an initial taste of how you can use R to analyse data for your own research. However, you will need to learn more to do advanced operations such as cleaning your dataset, using statistical methods, or creating beautiful graphics3. The best way to become proficient and efficient at R, as with any other tool, is to use it to address your actual research questions. As a beginner, it can feel daunting to have to write a script from scratch, and given that many people make their code available online, modifying existing code to suit your purpose might make it easier for you to get started.
Seeking help
Use the built-in RStudio help interface to search for more information on R functions
One of the fastest ways to get help, is to use the RStudio help interface. This panel by default can be found at the lower right hand panel of RStudio. As seen in the screenshot, by typing the word “Mean”, RStudio tries to also give a number of suggestions that you might be interested in. The description is then shown in the display window.
I know the name of the function I want to use, but I’m not sure how to use it
If you need help with a specific function, let’s say barplot()
, you
can type:
?barplot
If you just need to remind yourself of the names of the arguments, you can use:
args(lm)
I want to use a function that does X, there must be a function for it but I don’t know which one…
If you are looking for a function to do a particular task, you can use the
help.search()
function, which is called by the double question mark ??
.
However, this only looks through the installed packages for help pages with a
match to your search request
??kruskal
If you can’t find what you are looking for, you can use the rdocumentation.org website that searches through the help files across all packages available.
Finally, a generic Google or internet search “R <task>” will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question.
I am stuck… I get an error message that I don’t understand
Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”). If the message is very generic, you might also include the name of the function or package you’re using in your query.
However, you should check Stack Overflow. Search using the [r]
tag. Most
questions have already been answered, but the challenge is to use the right
words in the search to find the
answers:
http://stackoverflow.com/questions/tagged/r
The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
The R FAQ is dense and technical but it is full of useful information.
Asking for help
The key to receiving help from someone is for them to rapidly grasp your problem. You should make it as easy as possible to pinpoint where the issue might be.
Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem.
If possible, try to reduce what doesn’t work to a simple reproducible example. If you can reproduce the problem using a very small data frame instead of your 50000 rows and 10000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question. For instance instead of using a subset of your real dataset, create a small (3 columns, 5 rows) generic one. For more information on how to write a reproducible example see this article by Hadley Wickham.
To share an object with someone else, if it’s relatively small, you
can use the function dput()
. It will output R code that can be used
to recreate the exact same object as the one in memory:
## iris is an example data frame that comes with R and head() is a
## function that returns the first part of the data frame
dput(head(iris))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4,
1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2,
0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your question is not related to a data frame, you can save any R object to a file[^export]:
saveRDS(iris, file="/tmp/iris.rds")
The content of this file is however not human readable and cannot be
posted directly on Stack Overflow. Instead, it can be sent to someone
by email who can read it with the readRDS()
command (here it is
assumed that the downloaded file is in a Downloads
folder in the
user’s home directory):
some_data <- readRDS(file="~/Downloads/iris.rds")
Last, but certainly not least, always include the output of sessionInfo()
as it provides critical information about your platform, the versions of R and
the packages that you are using, and other information that can be very helpful
to understand your problem.
sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.39
loaded via a namespace (and not attached):
[1] compiler_4.1.3 magrittr_2.0.3 tools_4.1.3 stringi_1.7.6 highr_0.9
[6] stringr_1.4.0 xfun_0.31 evaluate_0.15
Where to ask for help?
- The person sitting next to you during the course. Don’t hesitate to talk to your neighbour during the workshop, compare your answers, and ask for help.
- Your friendly colleagues: if you know someone with more experience than you, they might be able and willing to help you.
- Stack Overflow: if your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min. Remember to follow their guidelines on how to ask a good question.
- The R-help mailing list: it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don’t expect that it will come with smiley faces. Also, here more than anywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.
- If your question is about a specific package, see if there is a
mailing list for it. Usually it’s included in the DESCRIPTION file
of the package that can be accessed using
packageDescription("name-of-package")
. You may also want to try to email the author of the package directly, or open an issue on the code repository (e.g., GitHub). - There are also some topic-specific mailing lists (GIS, phylogenetics, etc…), the complete list is here.
More resources
-
The Posting Guide for the R mailing lists.
-
How to ask for R help useful guidelines
-
This blog post by Jon Skeet has quite comprehensive advice on how to ask programming questions.
-
The reprex package is very helpful to create reproducible examples when asking for help. The rOpenSci community call “How to ask questions so they get answered” (Github link and video recording) includes a presentation of the reprex package and of its philosophy.
R packages
Loading packages
As we have seen above, R packages play a fundamental role in R. The
make use of a package’s functionality, assuming it is installed, we
first need to load it to be able to use it. This is done with the
library()
function. Below, we load ggplot2
.
library("ggplot2")
Installing packages
The default package repository is The Comprehensive R Archive
Network (CRAN), and any package that is available on CRAN can be
installed with the install.packages()
function. Below, for example,
we install the dplyr
package that we will learn about late.
install.packages("dplyr")
This command will install the dplyr
package as well as all its
dependencies, i.e. all the packages that it relies on to function.
Bioconductor packages are managed installed using a dedicated package,
namely BiocManager
, that can be installed from CRAN with
install.packages("BiocManager")
Individual packages such as SummarizedExperiment
(we will use it
later), DESeq2
(for RNA-Seq analysis), and many more can then be
installed with BiocManager::install
.
BiocManager::install("SummarizedExperiment")
BiocManager::install("DESeq2")
-
As opposed to using R directly from the command line console. There exist other software that interface and integrate with R, but RStudio is particularly well suited for beginners and while providing numerous very advanced features. ↩
-
i.e. add-ons that confer R with new functionality, such as bioinformatics data analysis. ↩
-
We will introduce most of these (except statistics) here, but will only manage to scratch the surface of the wealth of what is possible to do with R. ↩
Key Points
Start using R and RStudio
Introduction to R
Overview
Teaching: XX min
Exercises: XX minQuestions
First commands in R
Objectives
Define the following terms as they relate to R: object, assign, call, function, arguments, options.
Assign values to objects in R.
Learn how to name objects
Use comments to inform script.
Solve simple arithmetic operations in R.
Call functions and use arguments to change their default options.
Inspect the content of vectors and manipulate their content.
Subset and extract values from vectors.
Analyze vectors with missing data.
Creating objects in R
You can get output from R simply by typing math in the console:
3 + 5
[1] 8
12 / 7
[1] 1.714286
However, to do useful and interesting things, we need to assign values to
objects. To create an object, we need to give it a name followed by the
assignment operator <-
, and the value we want to give it:
weight_kg <- 55
<-
is the assignment operator. It assigns values on the right to
objects on the left. So, after executing x <- 3
, the value of x
is
3
. The arrow can be read as 3 goes into x
. For historical
reasons, you can also use =
for assignments, but not in every
context. Because of the
slight
differences
in syntax, it is good practice to always use <-
for assignments.
In RStudio, typing Alt + - (push Alt
at the same time as the - key) will write <-
in a single
keystroke in a PC, while typing Option + - (push
Option at the same time as the - key) does the
same in a Mac.
Naming variables {-}
Objects can be given any name such as x
, current_temperature
, or
subject_id
. You want your object names to be explicit and not too
long. They cannot start with a number (2x
is not valid, but x2
is). R is case sensitive (e.g., weight_kg
is different from
Weight_kg
). There are some names that cannot be used because they
are the names of fundamental functions in R (e.g., if
, else
,
for
, see
here
for a complete list). In general, even if it’s allowed, it’s best to
not use other function names (e.g., c
, T
, mean
, data
, df
,
weights
). If in doubt, check the help to see if the name is already
in use. It’s also best to avoid dots (.
) within an object name as in
my.dataset
. There are many functions in R with dots in their names
for historical reasons, but because dots have a special meaning in R
(for methods) and other programming languages, it’s best to avoid
them. It is also recommended to use nouns for object names, and verbs
for function names. It’s important to be consistent in the styling of
your code (where you put spaces, how you name objects, etc.). Using a
consistent coding style makes your code clearer to read for your
future self and your collaborators. In R, some popular style guides
are Google’s, the
tidyverse’s style and the Bioconductor
style
guide. The
tidyverse’s is very comprehensive and may seem overwhelming at
first. You can install the
lintr
package to
automatically check for issues in the styling of your code.
Objects vs. variables What are known as
objects
inR
are known asvariables
in many other programming languages. Depending on the context,object
andvariable
can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects
When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:
weight_kg <- 55 # doesn't print anything
(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg`
[1] 55
weight_kg # and so does typing the name of the object
[1] 55
Now that R has weight_kg
in memory, we can do arithmetic with it. For
instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):
2.2 * weight_kg
[1] 121
We can also change an object’s value by assigning it a new one:
weight_kg <- 57.5
2.2 * weight_kg
[1] 126.5
This means that assigning a value to one object does not change the values of
other objects For example, let’s store the animal’s weight in pounds in a new
object, weight_lb
:
weight_lb <- 2.2 * weight_kg
and then change weight_kg
to 100.
weight_kg <- 100
Challenge:
What do you think is the current content of the object
weight_lb
? 126.5 or 220?
Comments
The comment character in R is #
, anything to the right of a #
in a
script will be ignored by R. It is useful to leave notes, and
explanations in your scripts.
RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.
Challenge
What are the values after each statement in the following?
mass <- 47.5 # mass? age <- 122 # age? mass <- mass * 2.0 # mass? age <- age - 20 # age? mass_index <- mass/age # mass_index?
Functions and their arguments
Functions are “canned scripts” that automate more complicated sets of commands
including operations assignments, etc. Many functions are predefined, or can be
made available by importing R packages (more on that later). A function
usually gets one or more inputs called arguments. Functions often (but not
always) return a value. A typical example would be the function sqrt()
. The
input (the argument) must be a number, and the return value (in fact, the
output) is the square root of that number. Executing a function (‘running it’)
is called calling the function. An example of a function call is:
b <- sqrt(a)
Here, the value of a
is given to the sqrt()
function, the sqrt()
function
calculates the square root, and returns the value which is then assigned to
the object b
. This function is very simple, because it takes just one argument.
The return ‘value’ of a function need not be numerical (like that of sqrt()
),
and it also does not need to be a single item: it can be a set of things, or
even a dataset. We’ll see that when we read data files into R.
Arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.
Let’s try a function that can take multiple arguments: round()
.
round(3.14159)
[1] 3
Here, we’ve called round()
with just one argument, 3.14159
, and it has
returned the value 3
. That’s because the default is to round to the nearest
whole number. If we want more digits we can see how to do that by getting
information about the round
function. We can use args(round)
or look at the
help for this function using ?round
.
args(round)
function (x, digits = 0)
NULL
?round
We see that if we want a different number of digits, we can
type digits=2
or however many we want.
round(3.14159, digits = 2)
[1] 3.14
If you provide the arguments in the exact same order as they are defined you don’t have to name them:
round(3.14159, 2)
[1] 3.14
And if you do name the arguments, you can switch their order:
round(digits = 2, x = 3.14159)
[1] 3.14
It’s good practice to put the non-optional arguments (like the number you’re rounding) first in your function call, and to specify the names of all optional arguments. If you don’t, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you’re doing. By specifying the name of the arguments you are also safeguarding against possible future changes in the function interface, which may potentially add new arguments in between the existing ones.
Vectors and data types
A vector is the most common and basic data type in R, and is pretty much
the workhorse of R. A vector is composed by a series of values, such as
numbers or characters. We can assign a series of values to a vector using
the c()
function. For example we can create a vector of animal weights and assign
it to a new object weight_g
:
weight_g <- c(50, 60, 65, 82)
weight_g
[1] 50 60 65 82
A vector can also contain characters:
molecules <- c("dna", "rna", "protein")
molecules
[1] "dna" "rna" "protein"
The quotes around “dna”, “rna”, etc. are essential here. Without the
quotes R will assume there are objects called dna
, rna
and
protein
. As these objects don’t exist in R’s memory, there will be
an error message.
There are many functions that allow you to inspect the content of a
vector. length()
tells you how many elements are in a particular vector:
length(weight_g)
[1] 4
length(molecules)
[1] 3
An important feature of a vector, is that all of the elements are the
same type of data. The function class()
indicates the class (the
type of element) of an object:
class(weight_g)
[1] "numeric"
class(molecules)
[1] "character"
The function str()
provides an overview of the structure of an
object and its elements. It is a useful function when working with
large and complex objects:
str(weight_g)
num [1:4] 50 60 65 82
str(molecules)
chr [1:3] "dna" "rna" "protein"
You can use the c()
function to add other elements to your vector:
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
[1] 30 50 60 65 82 90
In the first line, we take the original vector weight_g
, add the
value 90
to the end of it, and save the result back into
weight_g
. Then we add the value 30
to the beginning, again saving
the result back into weight_g
.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
An atomic vector is the simplest R data type and is a linear
vector of a single type. Above, we saw 2 of the 6 main atomic
vector types that R uses: "character"
and "numeric"
(or
"double"
). These are the basic building blocks that all R objects
are built from. The other 4 atomic vector types are:
"logical"
forTRUE
andFALSE
(the boolean data type)"integer"
for integer numbers (e.g.,2L
, theL
indicates to R that it’s an integer)"complex"
to represent complex numbers with real and imaginary parts (e.g.,1 + 4i
) and that’s all we’re going to say about them"raw"
for bitstreams that we won’t discuss further
You can check the type of your vector using the typeof()
function
and inputting your vector as the argument.
Vectors are one of the many data structures that R uses. Other
important ones are lists (list
), matrices (matrix
), data frames
(data.frame
), factors (factor
) and arrays (array
).
Challenge:
We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?
Solution
R implicitly converts them to all be the same type
Challenge:
What will happen in each of these examples? (hint: use
class()
to check the data type of your objects):num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) tricky <- c(1, 2, 3, "4")
Solution
class(num_char)
[1] "character"
class(num_logical)
[1] "numeric"
class(char_logical)
[1] "character"
class(tricky)
[1] "character"
Challenge:
Why do you think it happens?
Solution
Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a common denominator that doesn’t lose any information.
Homework Challenge:
How many values in
combined_logical
are"TRUE"
(as a character) in the following example:num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) combined_logical <- c(num_logical, char_logical)
Homework Challenge:
In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
Solution
logical → numeric → character ← logical
Subsetting vectors
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:
molecules <- c("dna", "rna", "peptide", "protein")
molecules[2]
[1] "rna"
molecules[c(3, 2)]
[1] "peptide" "rna"
We can also repeat the indices to create an object with more elements than the original one:
more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)]
more_molecules
[1] "dna" "rna" "peptide" "rna" "dna" "protein"
R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
Finally, it is also possible to get all the elements of a vector except some specified elements using negative indices:
molecules ## all molecules
[1] "dna" "rna" "peptide" "protein"
molecules[-1] ## all but the first one
[1] "rna" "peptide" "protein"
molecules[-c(1, 3)] ## all but 1st/3rd ones
[1] "rna" "protein"
molecules[c(-1, -3)] ## all but 1st/3rd ones
[1] "rna" "protein"
Conditional subsetting
Another common way of subsetting is by using a logical vector. TRUE
will
select the element with the same index, while FALSE
will not:
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
[1] 21 39 54
Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:
## will return logicals with TRUE for the indices that meet
## the condition
weight_g > 50
[1] FALSE FALSE FALSE TRUE TRUE
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
[1] 54 55
You can combine multiple tests using &
(both conditions are true,
AND) or |
(at least one of the conditions is true, OR):
weight_g[weight_g < 30 | weight_g > 50]
[1] 21 54 55
weight_g[weight_g >= 30 & weight_g == 21]
numeric(0)
Here, <
stands for “less than”, >
for “greater than”, >=
for
“greater than or equal to”, and ==
for “equal to”. The double equal
sign ==
is a test for numerical equality between the left and right
hand sides, and should not be confused with the single =
sign, which
performs variable assignment (similar to <-
).
A common task is to search for certain strings in a vector. One could
use the “or” operator |
to test for equality to multiple values, but
this can quickly become tedious. The function %in%
allows you to
test if any of the elements of a search vector are found:
molecules <- c("dna", "rna", "protein", "peptide")
molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna
[1] "dna" "rna"
molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")
[1] TRUE TRUE FALSE TRUE
molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")]
[1] "dna" "rna" "peptide"
Homework Challenge:
Can you figure out why
"four" > "five"
returnsTRUE
?Solution
"four" > "five"
[1] TRUE
When using
>
or<
on strings, R compares their alphabetical order. Here"four"
comes after"five"
, and therefore is greater than it.
Names
It is possible to name each element of a vector. The code chunk below show a initial vector without any names, how names are set, and retrieved.
x <- c(1, 5, 3, 5, 10)
names(x) ## no names
NULL
names(x) <- c("A", "B", "C", "D", "E")
names(x) ## now we have names
[1] "A" "B" "C" "D" "E"
When a vector has names, it is possible to access elements by their name, in addition to their index.
x[c(1, 3)]
A C
1 3
x[c("A", "C")]
A C
1 3
Missing data
As R was designed to analyze datasets, it includes the concept of
missing data (which is uncommon in other programming
languages). Missing data are represented in vectors as NA
.
When doing operations on numbers, most functions will return NA
if
the data you are working with include missing values. This feature
makes it harder to overlook the cases where you are dealing with
missing data. You can add the argument na.rm = TRUE
to calculate
the result while ignoring the missing values.
heights <- c(2, 4, 4, NA, 6)
mean(heights)
[1] NA
max(heights)
[1] NA
mean(heights, na.rm = TRUE)
[1] 4
max(heights, na.rm = TRUE)
[1] 6
If your data include missing values, you may want to become familiar
with the functions is.na()
, na.omit()
, and complete.cases()
. See
below for examples.
## Extract those elements which are not missing values.
heights[!is.na(heights)]
[1] 2 4 4 6
## Returns the object with incomplete cases removed.
## The returned object is an atomic vector of type `"numeric"`
## (or `"double"`).
na.omit(heights)
[1] 2 4 4 6
attr(,"na.action")
[1] 4
attr(,"class")
[1] "omit"
## Extract those elements which are complete cases.
## The returned object is an atomic vector of type `"numeric"`
## (or `"double"`).
heights[complete.cases(heights)]
[1] 2 4 4 6
Homework Challenge:
- Using this vector of heights in inches, create a new vector with the NAs removed.
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
- Use the function
median()
to calculate the median of theheights
vector.- Use R to figure out how many people in the set are taller than 67 inches.
Solution
heights_no_na <- heights[!is.na(heights)] ## or heights_no_na <- na.omit(heights)
median(heights, na.rm = TRUE)
[1] 64
heights_above_67 <- heights_no_na[heights_no_na > 67] length(heights_above_67)
[1] 6
Generating vectors
Constructors {-}
There exists some functions to generate vectors of different type. To
generate a vector of numerics, one can use the numeric()
constructor, providing the length of the output vector as
parameter. The values will be initialised with 0.
numeric(3)
[1] 0 0 0
numeric(10)
[1] 0 0 0 0 0 0 0 0 0 0
Note that if we ask for a vector of numerics of length 0, we obtain exactly that:
numeric(0)
numeric(0)
There are similar constructors for characters and logicals, named
character()
and logical()
respectively.
Homework Challenge:
What are the defaults for character and logical vectors?
Solution
character(2) ## the empty charater
[1] "" ""
logical(2) ## FALSE
[1] FALSE FALSE
Replicate elements
The rep
function allow to repeat a value a certain number of
times. If we want to initiate a vector of numerics of length 5 with
the value -1, for example, we could do the following:
rep(-1, 5)
[1] -1 -1 -1 -1 -1
Similarly, to generate a vector populated with missing values, which is often a good way to start, without setting assumptions on the data to be collected:
rep(NA, 5)
[1] NA NA NA NA NA
rep
can take vectors of any length as input (above, we used vectors
of length 1) and any type. For example, if we want to repeat the
values 1, 2 and 3 five times, we would do the following:
rep(c(1, 2, 3), 5)
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
Homework Challenge:
What if we wanted to repeat the values 1, 2 and 3 five times, but obtain five 1s, five 2s and five 3s in that order? There are two possibilities - see
?rep
or?sort
for help.Solution
rep(c(1, 2, 3), each = 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
sort(rep(c(1, 2, 3), 5))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Sequence generation
Another very useful function is seq
, to generate a sequence of
numbers. For example, to generate a sequence of integers from 1 to 20
by steps of 2, one would use:
seq(from = 1, to = 20, by = 2)
[1] 1 3 5 7 9 11 13 15 17 19
The default value of by
is 1 and, given that the generate of a
sequence of one value to another with steps of 1 is frequently used,
there’s a shortcut:
seq(1, 5, 1)
[1] 1 2 3 4 5
seq(1, 5) ## default by
[1] 1 2 3 4 5
1:5
[1] 1 2 3 4 5
To generate a sequence of numbers from 1 to 20 of final length of 3, one would use:
seq(from = 1, to = 20, length.out = 3)
[1] 1.0 10.5 20.0
Random samples and permutations
A last group of useful functions are those that generate random
data. The first one, sample
, generates a random permutation of
another vector. For example, to draw a random order to 10 students
oral exam, I first assign each student a number from 1 to then (for
instance based on the alphabetic order of their name) and then:
sample(1:10)
[1] 9 4 7 1 2 5 3 10 6 8
Without further arguments, sample
will return a permutation of all
elements of the vector. If I want a random sample of a certain size, I
would set this value as second argument. Below, I sample 5 random
letters from the alphabet contained in the pre-defined letters
vector:
sample(letters, 5)
[1] "s" "a" "u" "x" "j"
If I wanted an output larger than the input vector, or being able to
draw some elements multiple times, I would need to set the replace
argument to TRUE
:
sample(1:5, 10, replace = TRUE)
[1] 2 1 5 5 1 1 5 5 2 2
Homework Challenge:
When trying the functions above out, you will have realised that the samples are indeed random and that one doesn’t get the same permutation twice. To be able to reproduce these random draws, one can set the random number generation seed manually with
set.seed()
before drawing the random sample.
Test this feature with your neighbour. First draw two random permutations of
1:10
independently and observe that you get different results.Now set the seed with, for example,
set.seed(123)
and repeat the random draw. Observe that you now get the same random draws.Repeat by setting a different seed.
Solution
Different permutations
sample(1:10)
[1] 9 1 4 3 6 2 5 8 10 7
sample(1:10)
[1] 4 9 7 6 1 10 8 3 2 5
Same permutations with seed 123
set.seed(123) sample(1:10)
[1] 3 10 2 8 6 9 1 7 5 4
set.seed(123) sample(1:10)
[1] 3 10 2 8 6 9 1 7 5 4
A different seed
set.seed(1) sample(1:10)
[1] 9 4 7 1 2 5 3 10 6 8
set.seed(1) sample(1:10)
[1] 9 4 7 1 2 5 3 10 6 8
Homework reading: Drawing samples from a normal distribution
The last function we are going to see is rnorm
, that draws a random
sample from a normal distribution. Two normal distributions of means 0
and 100 and standard deviations 1 and 5, noted noted N(0, 1) and
N(100, 5), are shown below
The three arguments, n
, mean
and sd
, define the size of the
sample, and the parameters of the normal distribution, i.e the mean
and its standard deviation. The defaults of the latter are 0 and 1.
rnorm(5)
[1] 0.69641761 0.05351568 -1.31028350 -2.12306606 -0.20807859
rnorm(5, 2, 2)
[1] 1.3744268 -0.1164714 2.8344472 1.3690969 3.6510983
rnorm(5, 100, 5)
[1] 106.45636 96.87448 95.62427 100.71678 107.12595
Now that we have learned how to write scripts, and the basics of R’s data structures, we are ready to start working with larger data, and learn about data frames.
Key Points
How to interact with R
Starting with data
Overview
Teaching: XX min
Exercises: XX minQuestions
First data analysis in R
Objectives
Describe what a data frame is.
Load external data from a .csv file into a data frame.
Summarize the contents of a data frame.
Describe what a factor is.
Convert between strings and factors.
Reorder and rename factors.
Format dates.
Export and save data.
Presentation of the gene expression data
We are going to use part of the data published by Blackmore et al. (2017), The effect of upper-respiratory infection on transcriptomic changes in the CNS. The goal of the study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occuring in the cerebellum and spinal cord post infection. Gender matched eight week old C57BL/6 mice were inoculated with saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.
The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single RNA expression measurement, and the columns represent:
Column | Description |
---|---|
gene | The name of the gene that was measured |
sample | The name of the sample the gene expression was measured in |
expression | The value of the gene expression |
organism | The organism/species - here all data stem from mice |
age | The age of the mouse (all mice were 8 weeks here) |
sex | The sex of the mouse |
infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. |
strain | The Influenza A strain; C57BL/6 in all cases. |
time | The duration of the infection (in days). |
tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. |
mouse | The mouse unique identifier. |
We have already downloaded the data and it can be found in the directory
course-data/data/GSE96870/
.
You are now ready to load the data:
rna <- read.csv("course-data/data/GSE96870/rnaseq.csv")
This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name
rna
Wow… that was a lot of output. At least it means the data loaded
properly. Let’s check the top (the first 6 lines) of this data frame
using the function head()
:
head(rna)
gene sample expression organism age sex infection strain time
1 Asl GSM2545336 1170 Mus musculus 8 Female InfluenzaA C57BL/6 8
2 Apod GSM2545336 36194 Mus musculus 8 Female InfluenzaA C57BL/6 8
3 Cyp2d22 GSM2545336 4060 Mus musculus 8 Female InfluenzaA C57BL/6 8
4 Klk6 GSM2545336 287 Mus musculus 8 Female InfluenzaA C57BL/6 8
5 Fcrls GSM2545336 85 Mus musculus 8 Female InfluenzaA C57BL/6 8
6 Slc2a4 GSM2545336 782 Mus musculus 8 Female InfluenzaA C57BL/6 8
tissue mouse ENTREZID
1 Cerebellum 14 109900
2 Cerebellum 14 11815
3 Cerebellum 14 56448
4 Cerebellum 14 19144
5 Cerebellum 14 80891
6 Cerebellum 14 20528
product
1 argininosuccinate lyase, transcript variant X1
2 apolipoprotein D, transcript variant 3
3 cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2
4 kallikrein related-peptidase 6, transcript variant 2
5 Fc receptor-like S, scavenger receptor, transcript variant X1
6 solute carrier family 2 (facilitated glucose transporter), member 4
ensembl_gene_id external_synonym chromosome_name gene_biotype
1 ENSMUSG00000025533 2510006M18Rik 5 protein_coding
2 ENSMUSG00000022548 <NA> 16 protein_coding
3 ENSMUSG00000061740 2D22 15 protein_coding
4 ENSMUSG00000050063 Bssp 7 protein_coding
5 ENSMUSG00000015852 2810439C17Rik 3 protein_coding
6 ENSMUSG00000018566 Glut-4 11 protein_coding
phenotype_description
1 abnormal circulating amino acid level
2 abnormal lipid homeostasis
3 abnormal skin morphology
4 abnormal cytokine level
5 decreased CD8-positive alpha-beta T cell number
6 abnormal circulating glucose level
hsapiens_homolog_associated_gene_name
1 ASL
2 APOD
3 CYP2D6
4 KLK6
5 FCRL2
6 SLC2A4
## Try also
## View(rna)
Note
read.csv()
assumes that fields are delineated by commas, however, in
several countries, the comma is used as a decimal separator and the
semicolon (;) is used as a field delineator. If you want to read in
this type of files in R, you can use the read.csv2()
function. It
behaves exactly like read.csv()
but uses different parameters for
the decimal and the field separators. If you are working with another
format, they can be both specified by the user. Check out the help for
read.csv()
by typing ?read.csv
to learn more. There is also the
read.delim()
function for reading tab separated data files. It is important to
note that all of these functions are actually wrapper functions for
the main read.table()
function with different arguments. As such,
the data above could have also been loaded by using read.table()
with the separation argument as ,
. The code is as follows:
rna <- read.table(file = "course-data/data/GSE96870/rnaseq.csv",
sep = ",",
header = TRUE)
The header argument has to be set to TRUE to be able to read the
headers as by default read.table()
has the header argument set to
FALSE.
What are data frames?
Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.
A data frame can be created by hand, but most commonly they are
generated by the functions read.csv()
or read.table()
; in other
words, when importing spreadsheets from your hard drive (or the web).
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
We can see this when inspecting the structure of a data frame
with the function str()
:
str(rna)
'data.frame': 32428 obs. of 19 variables:
$ gene : chr "Asl" "Apod" "Cyp2d22" "Klk6" ...
$ sample : chr "GSM2545336" "GSM2545336" "GSM2545336" "GSM2545336" ...
$ expression : int 1170 36194 4060 287 85 782 1619 288 43217 1071 ...
$ organism : chr "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
$ age : int 8 8 8 8 8 8 8 8 8 8 ...
$ sex : chr "Female" "Female" "Female" "Female" ...
$ infection : chr "InfluenzaA" "InfluenzaA" "InfluenzaA" "InfluenzaA" ...
$ strain : chr "C57BL/6" "C57BL/6" "C57BL/6" "C57BL/6" ...
$ time : int 8 8 8 8 8 8 8 8 8 8 ...
$ tissue : chr "Cerebellum" "Cerebellum" "Cerebellum" "Cerebellum" ...
$ mouse : int 14 14 14 14 14 14 14 14 14 14 ...
$ ENTREZID : int 109900 11815 56448 19144 80891 20528 97827 118454 18823 14696 ...
$ product : chr "argininosuccinate lyase, transcript variant X1" "apolipoprotein D, transcript variant 3" "cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2" "kallikrein related-peptidase 6, transcript variant 2" ...
$ ensembl_gene_id : chr "ENSMUSG00000025533" "ENSMUSG00000022548" "ENSMUSG00000061740" "ENSMUSG00000050063" ...
$ external_synonym : chr "2510006M18Rik" NA "2D22" "Bssp" ...
$ chromosome_name : chr "5" "16" "15" "7" ...
$ gene_biotype : chr "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
$ phenotype_description : chr "abnormal circulating amino acid level" "abnormal lipid homeostasis" "abnormal skin morphology" "abnormal cytokine level" ...
$ hsapiens_homolog_associated_gene_name: chr "ASL" "APOD" "CYP2D6" "KLK6" ...
Inspecting data.frame
Objects
We already saw how the functions head()
and str()
can be useful to
check the content and the structure of a data frame. Here is a
non-exhaustive list of functions to get a sense of the
content/structure of the data. Let’s try them out!
Size:
dim(rna)
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow(rna)
- returns the number of rowsncol(rna)
- returns the number of columns
Content:
head(rna)
- shows the first 6 rowstail(rna)
- shows the last 6 rows
Names:
names(rna)
- returns the column names (synonym ofcolnames()
fordata.frame
objects)rownames(rna)
- returns the row names
Summary:
str(rna)
- structure of the object and information about the class, length and content of each columnsummary(rna)
- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of
objects besides data.frame
.
Challenge:
Based on the output of
str(rna)
, can you answer the following questions?
- What is the class of the object
rna
?- How many rows and how many columns are in this object?
- How many genes (as defined by the
gene
variable) have been measured in this experiment?Solution
- class: data frame
- how many rows: 32428, how many columns: 19
- how many genes: 1477
Indexing and subsetting data frames
Our rna
data frame has rows and columns (it has 2 dimensions), if we
want to extract some specific data from it, we need to specify the
“coordinates” we want from it. Row numbers come first, followed by
column numbers. However, note that different ways of specifying these
coordinates lead to results with different classes.
# first element in the first column of the data frame (as a vector)
rna[1, 1]
# first element in the 6th column (as a vector)
rna[1, 6]
# first column of the data frame (as a vector)
rna[, 1]
# first column of the data frame (as a data.frame)
rna[1]
# first three elements in the 7th column (as a vector)
rna[1:3, 7]
# the 3rd row of the data frame (as a data.frame)
rna[3, ]
# equivalent to head_rna <- head(rna)
head_rna <- rna[1:6, ]
head_rna
:
is a special function that creates numeric vectors of integers in
increasing or decreasing order, test 1:10
and 10:1
for
instance. See section \@ref(sec:genvec) for details.
You can also exclude certain indices of a data frame using the “-
” sign:
rna[, -1] ## The whole data frame, except the first column
rna[-c(7:32428), ] ## Equivalent to head(rna)
Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:
rna["gene"] # Result is a data.frame
rna[, "gene"] # Result is a vector
rna[["gene"]] # Result is a vector
rna$gene # Result is a vector
In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.
Challenge
Create a
data.frame
(rna_200
) containing only the data in row 200 of therna
dataset.Notice how
nrow()
gave you the number of rows in adata.frame
?
Use that number to pull out just that last row in the inital
rna
data frame.Compare that with what you see as the last row using
tail()
to make sure it’s meeting expectations.Pull out that last row using
nrow()
instead of the row number.Create a new data frame (
rna_last
) from that last row.
Use
nrow()
to extract the row that is in the middle of therna
dataframe. Store the content of this row in an object namedrna_middle
.Combine
nrow()
with the-
notation above to reproduce the behavior ofhead(rna)
, keeping just the first through 6th rows of the rna dataset.Solution
## 1. rna_200 <- rna[200, ] ## 2. ## Saving `n_rows` to improve readability and reduce duplication n_rows <- nrow(rna) rna_last <- rna[n_rows, ] ## 3. rna_middle <- rna[n_rows / 2, ] ## 4. rna_head <- rna[-(7:n_rows), ]
Factors
Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.
Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
sex <- factor(c("male", "female", "female", "male", "female"))
R will assign 1
to the level "female"
and 2
to the level
"male"
(because f
comes before m
, even though the first element
in this vector is "male"
). You can see this by using the function
levels()
and you can find the number of levels using nlevels()
:
levels(sex)
[1] "female" "male"
nlevels(sex)
[1] 2
Sometimes, the order of the factors does not matter, other times you
might want to specify the order because it is meaningful (e.g., “low”,
“medium”, “high”), it improves your visualization, or it is required
by a particular type of analysis. Here, one way to reorder our levels
in the sex
vector would be:
sex ## current order
[1] male female female male female
Levels: female male
sex <- factor(sex, levels = c("male", "female"))
sex ## after re-ordering
[1] male female female male female
Levels: male female
In R’s memory, these factors are represented by integers (1, 2, 3),
but are more informative than integers because factors are self
describing: "female"
, "male"
is more descriptive than 1
,
2
. Which one is “male”? You wouldn’t be able to tell just from the
integer data. Factors, on the other hand, have this information built
in. It is particularly helpful when there are many levels (like the
species names in our example dataset).
Converting to factors {-}
If you need to convert a factor to a character vector, you use
as.character(x)
.
as.character(sex)
[1] "male" "female" "female" "male" "female"
Renaming factors {-}
When your data is stored as a factor, you can use the plot()
function to get a quick glance at the number of observations
represented by each factor level. Let’s look at the number of males
and females in our data.
plot(sex)
If we want to rename these factor, it is sufficient to change its levels:
levels(sex)
[1] "male" "female"
levels(sex) <- c("M", "F")
sex
[1] M F F M F
Levels: M F
plot(sex)
Homework Challenge:
- Rename “female” and “male” to “Female” and “Male” respectively.
Challenge:
We have seen how data frames are created when using
read.csv()
, but they can also be created by hand with thedata.frame()
function. There are a few mistakes in this hand-crafteddata.frame
. Can you spot and fix them? Don’t hesitate to experiment!animal_data <- data.frame( animal = c(dog, cat, sea cucumber, sea urchin), feel = c("furry", "squishy", "spiny"), weight = c(45, 8 1.1, 0.8))
Solution
- missing quotations around the names of the animals
- missing one entry in the “feel” column (probably for one of the furry animals)
- missing one comma in the weight column
Homework Challenge:
Can you predict the class for each of the columns in the following example?
Check your guesses using
str(country_climate)
:
Are they what you expected? Why? Why not?
Try again by adding
stringsAsFactors = TRUE
after the last variable when creating the data frame? What is happening now?stringsAsFactors
can also be set when reading text-based spreadsheets into R usingread.csv()
.country_climate <- data.frame( country = c("Canada", "Panama", "South Africa", "Australia"), climate = c("cold", "hot", "temperate", "hot/temperate"), temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) )
The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance).
Learn more in this RStudio tutorial
Matrices
Before proceeding, now that we have learnt about dataframes, let’s
recap package installation and learn about a new data type, namely the
matrix
. Like a data.frame
, a matrix has two dimensions, rows and
columns. But the major difference is that all cells in a matrix
must
be of the same type: numeric
, character
, logical
, … In that
respect, matrices are closer to a vector
than a data.frame
.
The default constructor for a matrix is matrix
. It takes a vector of
values to populate the matrix and the number of row and/or
columns1. The values are sorted along the columns, as illustrated
below.
m <- matrix(1:9, ncol = 3, nrow = 3)
m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Homework Challenge:
Using the function
installed.packages()
, create acharacter
matrix containing the information about all packages currently installed on your computer. Explore it.Solution:
## create the matrix ip <- installed.packages() head(ip) ## try also View(ip) ## number of package nrow(ip) ## names of all installed packages rownames(ip) ## type of information we have about each package colnames(ip)
It is often useful to create large random data matrices as test
data. The exercise below asks you to create such a matrix with random
data drawn from a normal distribution of mean 0 and standard deviation
1, which can be done with the rnorm()
function.
Homework Challenge:
Construct a matrix of dimension 1000 by 3 of normally distributed data (mean 0, standard deviation 1)
Solution
set.seed(123) m <- matrix(rnorm(3000), ncol = 3) dim(m)
[1] 1000 3
head(m)
[,1] [,2] [,3] [1,] -0.56047565 -0.99579872 -0.5116037 [2,] -0.23017749 -1.03995504 0.2369379 [3,] 1.55870831 -0.01798024 -0.5415892 [4,] 0.07050839 -0.13217513 1.2192276 [5,] 0.12928774 -2.54934277 0.1741359 [6,] 1.71506499 1.04057346 -0.6152683
Summary of R objects
So far, we have seen several types of R object varying in the number of dimensions and whether they could store a single of multiple data types:
vector
: one dimension (they have a length), single type of data.matrix
: two dimensions, single type of data.data.frame
: two dimensions, one type per column.
Lists
A data type that we haven’t seen yet, but that is useful to know, and follows from the summary that we have just seen are lists:
list
: one dimension, every item can be of a different data type.
Below, let’s create a list containing a vector of numbers, characters, a matrix, a dataframe and another list:
l <- list(1:10, ## numeric
letters, ## character
installed.packages(), ## a matrix
cars, ## a data.frame
list(1, 2, 3)) ## a list
length(l)
[1] 5
str(l)
List of 5
$ : int [1:10] 1 2 3 4 5 6 7 8 9 10
$ : chr [1:26] "a" "b" "c" "d" ...
$ : chr [1:399, 1:16] "abind" "annotate" "AnnotationDbi" "AnnotationHub" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:399] "abind" "annotate" "AnnotationDbi" "AnnotationHub" ...
.. ..$ : chr [1:16] "Package" "LibPath" "Version" "Priority" ...
$ :'data.frame': 50 obs. of 2 variables:
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
$ :List of 3
..$ : num 1
..$ : num 2
..$ : num 3
List subsetting is done using []
to subset a new sub-list or [[]]
to extract a single element of that list (using indices or names, of
the list is named).
l[[1]] ## first element
[1] 1 2 3 4 5 6 7 8 9 10
l[1:2] ## a list of length 2
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
l[1] ## a list of length 1
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
Exporting and saving data
Exporting tabular data {-}
We have seen how to read a text-based spreadsheet into R using the
read.table
family of functions. To export a data.frame
to a
text-based spreadsheet, we can use the write.table
set of functions
(write.csv
, write.delim
, …). They all take the variable to be
exported and the file to be exported to. For example, to export the
rna
data to an rna.csv
file in the data_output
directory, we would execute:
write.csv(rna, file = "data_output/rna.csv")
This new csv file can now be shared with other collaborators who aren’t familiar with R.
Saving data {-}
Exporting data to a spreadsheet has several limitations, such as those
described in the first chapter such as possible inconsistencies with
,
and .
for decimal separators and lack of variable type
definitions. Furthermore, exporting data to a spreadsheet is only
relevant for rectangular data such as dataframes and matrices.
A more general way to save data, that is specific to R and is
guaranteed to work on any operating system, is to use the save
function. Saving objects will generate a binary representation of the
object on disk, a R Data file (rda
extension) that guarantees to
produce the same object once loaded back into R using the load
function.
save(rna, file = "data_output/rna.rda")
rm(rna)
load("data_output/rna.rda")
head(rna)
Note about how the function load
loads the object in the file
directly in the global environment.
There is also the saveRDS
and readRDS
functions that save R
objects to binary files (using the rds
extension here) and read
these back into R. From a user’s perspective, main difference is that,
load
loads an object in the global environment while readRDS
reads
the data from disk and returns it. It is thus necessary to store the
output of readRDS
in a variable:
saveRDS(rna, file = "data_output/rna.rds")
rm(rna)
rna <- readRDS("data_output/rna.rds")
head(rna)
To conclude, when it comes to saving data from R that will be loaded again in R, saving and loading is the preferred approach. If tabular data need to be shared with somebody that is not using R, then exporting to a text-based spreadsheet is a good alternative.
-
Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don’t add up. ↩
Key Points
Tabular data in R
Manipulating and analyzing data with dplyr
Overview
Teaching: XX min
Exercises: XX minQuestions
Data analysis in R using the tidyverse meta-package
Objectives
Describe the purpose of the
dplyr
andtidyr
packages.Describe several of their functions that are extremely useful to manipulate data.
Describe the concept of a wide and a long table format, and see how to reshape a data frame from one format to the other one.
Data Manipulation using dplyr
and tidyr
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.
Some packages can greatly facilitate our task when we manipulate data.
Packages in R are basically sets of additional functions that let you
do more stuff. The functions we’ve been using so far, like str()
or
data.frame()
, come built into R; Loading packages can give you access to other
specific functions. Before you use a package for the first time you need to install
it on your machine, and then you should import it in every subsequent
R session when you need it.
-
The package
dplyr
provides powerful tools for data manipulation tasks. It is built to work directly with data frames, with many manipulation tasks optimized. -
As we will see latter on, sometimes we want a data frame to be reshaped to be able to do some specific analyses or for visualization. The package
tidyr
addresses this common problem of reshaping data and provides tools for manipulating data in a tidy way.
To learn more about dplyr
and tidyr
after the workshop,
you may want to check out this handy data transformation with
dplyr
cheatsheet
and this one about
tidyr
.
- The
tidyverse
package is an “umbrella-package” that installs several useful packages for data analysis which work well together, such astidyr
,dplyr
,ggplot2
,tibble
, etc. These packages help us to work and interact with the data. They allow us to do many things with your data, such as subsetting, transforming, visualizing, etc.
To install and load the tidyverse
package type:
# no need to run this command as it's already been installed for you
# BiocManager::install("tidyverse")
## load the tidyverse packages, incl. dplyr
library("tidyverse")
Loading data with tidyverse
Instead of read.csv()
, we will read in our data using the read_csv()
function, from the
tidyverse package readr
, .
rna <- read_csv("course-data/data/GSE96870/rnaseq.csv")
## view the data
rna
# A tibble: 32,428 × 19
gene sample expression organism age sex infection strain time tissue
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Asl GSM254… 1170 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
2 Apod GSM254… 36194 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
3 Cyp2d22 GSM254… 4060 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
4 Klk6 GSM254… 287 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
5 Fcrls GSM254… 85 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
6 Slc2a4 GSM254… 782 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
7 Exd2 GSM254… 1619 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
8 Gjc2 GSM254… 288 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
9 Plp1 GSM254… 43217 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
10 Gnb4 GSM254… 1071 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
# product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
# chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>
Notice that the class of the data is now referred to as a “tibble”.
Tibbles tweak some of the behaviors of the data frame objects we introduced in the previously. The data structure is very similar to a data frame. For our purposes the only differences are that:
-
It displays the data type of each column under its name. Note that <
dbl
> is a data type defined to hold numeric values with decimal points. -
It only prints the first few rows of data and only as many columns as fit on one screen.
We are now going to learn some of the most common dplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columnsgroup_by()
andsummarize()
: create summary statistics on grouped dataarrange()
: sort resultscount()
: count discrete values
Selecting columns and filtering rows
To select columns of a data frame, use select()
. The first argument
to this function is the data frame (rna
), and the subsequent
arguments are the columns to keep.
select(rna, gene, sample, tissue, expression)
# A tibble: 32,428 × 4
gene sample tissue expression
<chr> <chr> <chr> <dbl>
1 Asl GSM2545336 Cerebellum 1170
2 Apod GSM2545336 Cerebellum 36194
3 Cyp2d22 GSM2545336 Cerebellum 4060
4 Klk6 GSM2545336 Cerebellum 287
5 Fcrls GSM2545336 Cerebellum 85
6 Slc2a4 GSM2545336 Cerebellum 782
7 Exd2 GSM2545336 Cerebellum 1619
8 Gjc2 GSM2545336 Cerebellum 288
9 Plp1 GSM2545336 Cerebellum 43217
10 Gnb4 GSM2545336 Cerebellum 1071
# … with 32,418 more rows
To select all columns except certain ones, put a “-“ in front of the variable to exclude it.
select(rna, -tissue, -organism)
# A tibble: 32,428 × 17
gene sample expression age sex infection strain time mouse ENTREZID
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Asl GSM2545… 1170 8 Fema… Influenz… C57BL… 8 14 109900
2 Apod GSM2545… 36194 8 Fema… Influenz… C57BL… 8 14 11815
3 Cyp2d22 GSM2545… 4060 8 Fema… Influenz… C57BL… 8 14 56448
4 Klk6 GSM2545… 287 8 Fema… Influenz… C57BL… 8 14 19144
5 Fcrls GSM2545… 85 8 Fema… Influenz… C57BL… 8 14 80891
6 Slc2a4 GSM2545… 782 8 Fema… Influenz… C57BL… 8 14 20528
7 Exd2 GSM2545… 1619 8 Fema… Influenz… C57BL… 8 14 97827
8 Gjc2 GSM2545… 288 8 Fema… Influenz… C57BL… 8 14 118454
9 Plp1 GSM2545… 43217 8 Fema… Influenz… C57BL… 8 14 18823
10 Gnb4 GSM2545… 1071 8 Fema… Influenz… C57BL… 8 14 14696
# … with 32,418 more rows, and 7 more variables: product <chr>,
# ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
# gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>
This will select all the variables in rna
except tissue
and organism
.
To choose rows based on a specific criteria, use filter()
:
filter(rna, sex == "Male")
# A tibble: 14,740 × 19
gene sample expression organism age sex infection strain time tissue
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Asl GSM254… 626 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
2 Apod GSM254… 13021 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
3 Cyp2d22 GSM254… 2171 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
4 Klk6 GSM254… 448 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
5 Fcrls GSM254… 180 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
6 Slc2a4 GSM254… 313 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
7 Exd2 GSM254… 2366 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
8 Gjc2 GSM254… 310 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
9 Plp1 GSM254… 53126 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
10 Gnb4 GSM254… 1355 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
# … with 14,730 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
# product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
# chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>
filter(rna, sex == "Male" & infection == "NonInfected")
# A tibble: 4,422 × 19
gene sample expression organism age sex infection strain time tissue
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Asl GSM254… 535 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
2 Apod GSM254… 13668 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
3 Cyp2d22 GSM254… 2008 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
4 Klk6 GSM254… 1101 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
5 Fcrls GSM254… 375 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
6 Slc2a4 GSM254… 249 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
7 Exd2 GSM254… 3126 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
8 Gjc2 GSM254… 791 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
9 Plp1 GSM254… 98658 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
10 Gnb4 GSM254… 2437 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
# … with 4,412 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
# product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
# chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>
Now let’s imagine we are interested in the human homologs of the mouse
genes analysed in this dataset. This information can be found in the
last column of the rna
tibble, named hsapiens_homolog_associated_gene_name
.
Some mouse gene have no human homologs. These can be retrieved using a filter()
in the chain, and the is.na()
function that determines whether something is an NA
.
rna_NA <- filter(rna, is.na(hsapiens_homolog_associated_gene_name))
select(rna_NA, gene, hsapiens_homolog_associated_gene_name)
# A tibble: 4,290 × 2
gene hsapiens_homolog_associated_gene_name
<chr> <chr>
1 Prodh <NA>
2 Tssk5 <NA>
3 Vmn2r1 <NA>
4 Gm10654 <NA>
5 Hexa <NA>
6 Sult1a1 <NA>
7 Gm6277 <NA>
8 Tmem198b <NA>
9 Adam1a <NA>
10 Ebp <NA>
# … with 4,280 more rows
If we want to keep only mouse gene that have a human homolog, we can insert a “!”
symbol that negates the result, so we’re asking for every row where
hsapiens_homolog_associated_gene_name is not an NA
.
rna_no_NA <- filter(rna, !is.na(hsapiens_homolog_associated_gene_name))
select(rna_no_NA, gene, hsapiens_homolog_associated_gene_name)
# A tibble: 28,138 × 2
gene hsapiens_homolog_associated_gene_name
<chr> <chr>
1 Asl ASL
2 Apod APOD
3 Cyp2d22 CYP2D6
4 Klk6 KLK6
5 Fcrls FCRL2
6 Slc2a4 SLC2A4
7 Exd2 EXD2
8 Gjc2 GJC2
9 Plp1 PLP1
10 Gnb4 GNB4
# … with 28,128 more rows
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
rna2 <- filter(rna, sex == "Male")
rna3 <- select(rna2, gene, sample, tissue, expression)
rna3
# A tibble: 14,740 × 4
gene sample tissue expression
<chr> <chr> <chr> <dbl>
1 Asl GSM2545340 Cerebellum 626
2 Apod GSM2545340 Cerebellum 13021
3 Cyp2d22 GSM2545340 Cerebellum 2171
4 Klk6 GSM2545340 Cerebellum 448
5 Fcrls GSM2545340 Cerebellum 180
6 Slc2a4 GSM2545340 Cerebellum 313
7 Exd2 GSM2545340 Cerebellum 2366
8 Gjc2 GSM2545340 Cerebellum 310
9 Plp1 GSM2545340 Cerebellum 53126
10 Gnb4 GSM2545340 Cerebellum 1355
# … with 14,730 more rows
This is readable, but can clutter up your workspace with lots of intermediate objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression)
rna3
# A tibble: 14,740 × 4
gene sample tissue expression
<chr> <chr> <chr> <dbl>
1 Asl GSM2545340 Cerebellum 626
2 Apod GSM2545340 Cerebellum 13021
3 Cyp2d22 GSM2545340 Cerebellum 2171
4 Klk6 GSM2545340 Cerebellum 448
5 Fcrls GSM2545340 Cerebellum 180
6 Slc2a4 GSM2545340 Cerebellum 313
7 Exd2 GSM2545340 Cerebellum 2366
8 Gjc2 GSM2545340 Cerebellum 310
9 Plp1 GSM2545340 Cerebellum 53126
10 Gnb4 GSM2545340 Cerebellum 1355
# … with 14,730 more rows
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.
Pipes in R look like %>%
and are made available via the magrittr
package,
installed automatically with dplyr
. If you use RStudio, you can type the pipe with Ctrl
- Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
In the above code, we use the pipe to send the rna
dataset first through
filter()
to keep rows where sex
is Male, then through select()
to keep only the gene
, sample
, tissue
, and expression
columns.
The pipe %>%
takes the object on its left and passes it directly as the first
argument to the function on its right, we don’t need to explicitly include the data frame
as an argument to the filter()
and select()
functions any more.
rna %>%
filter(sex == "Male") %>%
select(gene, sample, tissue, expression)
# A tibble: 14,740 × 4
gene sample tissue expression
<chr> <chr> <chr> <dbl>
1 Asl GSM2545340 Cerebellum 626
2 Apod GSM2545340 Cerebellum 13021
3 Cyp2d22 GSM2545340 Cerebellum 2171
4 Klk6 GSM2545340 Cerebellum 448
5 Fcrls GSM2545340 Cerebellum 180
6 Slc2a4 GSM2545340 Cerebellum 313
7 Exd2 GSM2545340 Cerebellum 2366
8 Gjc2 GSM2545340 Cerebellum 310
9 Plp1 GSM2545340 Cerebellum 53126
10 Gnb4 GSM2545340 Cerebellum 1355
# … with 14,730 more rows
Some may find it helpful to read the pipe like the word “then”. For instance,
in the above example, we took the data frame rna
, then we filter
ed
for rows with sex == "Male"
, then we select
ed columns gene
, sample
,
tissue
, and expression
.
The dplyr
functions by themselves are somewhat
simple, but by combining them into linear workflows with the pipe, we can accomplish
more complex manipulations of data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
rna3 <- rna %>%
filter(sex == "Male") %>%
select(gene, sample, tissue, expression)
rna3
# A tibble: 14,740 × 4
gene sample tissue expression
<chr> <chr> <chr> <dbl>
1 Asl GSM2545340 Cerebellum 626
2 Apod GSM2545340 Cerebellum 13021
3 Cyp2d22 GSM2545340 Cerebellum 2171
4 Klk6 GSM2545340 Cerebellum 448
5 Fcrls GSM2545340 Cerebellum 180
6 Slc2a4 GSM2545340 Cerebellum 313
7 Exd2 GSM2545340 Cerebellum 2366
8 Gjc2 GSM2545340 Cerebellum 310
9 Plp1 GSM2545340 Cerebellum 53126
10 Gnb4 GSM2545340 Cerebellum 1355
# … with 14,730 more rows
Homework Challenge:
Using pipes, subset the
rna
data to keep genes with an expression higher than 50000 in female mice at time 0, and retain only the columnsgene
,sample
,time
,expression
andage
.Solution
rna %>% filter(expression > 50000, sex == "Female", time == 0 ) %>% select(gene, sample, time, expression, age)
# A tibble: 9 × 5 gene sample time expression age <chr> <chr> <dbl> <dbl> <dbl> 1 Plp1 GSM2545337 0 101241 8 2 Atp1b1 GSM2545337 0 53260 8 3 Plp1 GSM2545338 0 96534 8 4 Atp1b1 GSM2545338 0 50614 8 5 Plp1 GSM2545348 0 102790 8 6 Atp1b1 GSM2545348 0 59544 8 7 Plp1 GSM2545353 0 71237 8 8 Glul GSM2545353 0 52451 8 9 Atp1b1 GSM2545353 0 61451 8
Mutate
Frequently you’ll want to create new columns based on the values of existing
columns, for example to do unit conversions, or to find the ratio of values in two
columns. For this we’ll use mutate()
.
To create a new column of time in hours:
rna %>%
mutate(time_hours = time * 24) %>%
select(time, time_hours)
# A tibble: 32,428 × 2
time time_hours
<dbl> <dbl>
1 8 192
2 8 192
3 8 192
4 8 192
5 8 192
6 8 192
7 8 192
8 8 192
9 8 192
10 8 192
# … with 32,418 more rows
You can also create a second new column based on the first new column within the same call of mutate()
:
rna %>%
mutate(time_hours = time * 24,
time_mn = time_hours * 60) %>%
select(time, time_hours, time_mn)
# A tibble: 32,428 × 3
time time_hours time_mn
<dbl> <dbl> <dbl>
1 8 192 11520
2 8 192 11520
3 8 192 11520
4 8 192 11520
5 8 192 11520
6 8 192 11520
7 8 192 11520
8 8 192 11520
9 8 192 11520
10 8 192 11520
# … with 32,418 more rows
Homework Challenge
Create a new data frame from the
rna
data that meets the following criteria: contains only thegene
,chromosome_name
,phenotype_description
,sample
, andexpression
columns and a new column giving the log expression the gene. This data frame must only contain genes located on autosomes and associated with a phenotype_description.Hint: think about how the commands should be ordered to produce this data frame!
Solution
rna %>% filter(chromosome_name != "X", chromosome_name != "Y") %>% mutate(log_expression = log(expression)) %>% select(gene, chromosome_name, phenotype_description, sample, log_expression) %>% filter(!is.na(phenotype_description))
# A tibble: 21,054 × 5 gene chromosome_name phenotype_description sample log_expression <chr> <chr> <chr> <chr> <dbl> 1 Asl 5 abnormal circulating amino aci… GSM25… 7.06 2 Apod 16 abnormal lipid homeostasis GSM25… 10.5 3 Cyp2d22 15 abnormal skin morphology GSM25… 8.31 4 Klk6 7 abnormal cytokine level GSM25… 5.66 5 Fcrls 3 decreased CD8-positive alpha-b… GSM25… 4.44 6 Slc2a4 11 abnormal circulating glucose l… GSM25… 6.66 7 Gjc2 11 Purkinje cell degeneration GSM25… 5.66 8 Gnb4 3 decreased anxiety-related resp… GSM25… 6.98 9 Tnc 4 abnormal CNS synaptic transmis… GSM25… 5.39 10 Trf 9 abnormal circulating phosphate… GSM25… 9.18 # … with 21,044 more rows
Split-apply-combine data analysis
Many data analysis tasks can be approached using the
split-apply-combine paradigm: split the data into groups, apply some
analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function.
rna %>%
group_by(gene)
# A tibble: 32,428 × 19
# Groups: gene [1,474]
gene sample expression organism age sex infection strain time tissue
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Asl GSM254… 1170 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
2 Apod GSM254… 36194 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
3 Cyp2d22 GSM254… 4060 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
4 Klk6 GSM254… 287 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
5 Fcrls GSM254… 85 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
6 Slc2a4 GSM254… 782 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
7 Exd2 GSM254… 1619 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
8 Gjc2 GSM254… 288 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
9 Plp1 GSM254… 43217 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
10 Gnb4 GSM254… 1071 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
# product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
# chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>
The group_by()
function doesn’t perform any data processing, it
groups the data into subsets: in the example above, our initial
tibble
of 32428 observations is split into
1474 groups based on the gene
variable.
We could similarly decide to group the tibble by the samples:
rna %>%
group_by(sample)
# A tibble: 32,428 × 19
# Groups: sample [22]
gene sample expression organism age sex infection strain time tissue
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Asl GSM254… 1170 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
2 Apod GSM254… 36194 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
3 Cyp2d22 GSM254… 4060 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
4 Klk6 GSM254… 287 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
5 Fcrls GSM254… 85 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
6 Slc2a4 GSM254… 782 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
7 Exd2 GSM254… 1619 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
8 Gjc2 GSM254… 288 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
9 Plp1 GSM254… 43217 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
10 Gnb4 GSM254… 1071 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
# product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
# chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>
Here our initial tibble
of 32428 observations is split into
22 groups based on the sample
variable.
Once the data have been combined, subsequent operations will be applied on each group independently.
The summarize()
function
group_by()
is often used together with summarize()
, which
collapses each group into a single-row summary of that group.
group_by()
takes as arguments the column names that contain the
categorical variables for which you want to calculate the summary
statistics. So to compute the mean expression
by gene:
rna %>%
group_by(gene) %>%
summarize(mean_expression = mean(expression))
# A tibble: 1,474 × 2
gene mean_expression
<chr> <dbl>
1 Aamp 4751.
2 Abca12 4.55
3 Abcc8 2498.
4 Abhd14a 525.
5 Abi2 4909.
6 Abi3bp 1002.
7 Abl2 2124.
8 Acadl 2053.
9 Acap3 3536.
10 Acbd4 1431.
# … with 1,464 more rows
We could also want to calculate the mean expression levels of all genes in each sample:
rna %>%
group_by(sample) %>%
summarize(mean_expression = mean(expression))
# A tibble: 22 × 2
sample mean_expression
<chr> <dbl>
1 GSM2545336 2062.
2 GSM2545337 1766.
3 GSM2545338 1668.
4 GSM2545339 1696.
5 GSM2545340 1682.
6 GSM2545341 1638.
7 GSM2545342 1594.
8 GSM2545343 2107.
9 GSM2545344 1712.
10 GSM2545345 1700.
# … with 12 more rows
But we can can also group by multiple columns:
rna %>%
group_by(gene, infection, time) %>%
summarize(mean_expression = mean(expression))
`summarise()` has grouped output by 'gene', 'infection'. You can override using
the `.groups` argument.
# A tibble: 4,422 × 4
# Groups: gene, infection [2,948]
gene infection time mean_expression
<chr> <chr> <dbl> <dbl>
1 Aamp InfluenzaA 4 4870
2 Aamp InfluenzaA 8 4763.
3 Aamp NonInfected 0 4603.
4 Abca12 InfluenzaA 4 4.25
5 Abca12 InfluenzaA 8 4.14
6 Abca12 NonInfected 0 5.29
7 Abcc8 InfluenzaA 4 2609.
8 Abcc8 InfluenzaA 8 2292.
9 Abcc8 NonInfected 0 2576.
10 Abhd14a InfluenzaA 4 547.
# … with 4,412 more rows
Once the data is grouped, you can also summarize multiple variables at the same
time (and not necessarily on the same variable). For instance, we could add a
column indicating the median expression
by gene and by condition:
rna %>%
group_by(gene, infection, time) %>%
summarize(mean_expression = mean(expression),
median_expression = median(expression))
`summarise()` has grouped output by 'gene', 'infection'. You can override using
the `.groups` argument.
# A tibble: 4,422 × 5
# Groups: gene, infection [2,948]
gene infection time mean_expression median_expression
<chr> <chr> <dbl> <dbl> <dbl>
1 Aamp InfluenzaA 4 4870 4708
2 Aamp InfluenzaA 8 4763. 4813
3 Aamp NonInfected 0 4603. 4717
4 Abca12 InfluenzaA 4 4.25 4.5
5 Abca12 InfluenzaA 8 4.14 4
6 Abca12 NonInfected 0 5.29 5
7 Abcc8 InfluenzaA 4 2609. 2424.
8 Abcc8 InfluenzaA 8 2292. 2224
9 Abcc8 NonInfected 0 2576. 2578
10 Abhd14a InfluenzaA 4 547. 523
# … with 4,412 more rows
Challenge
Calculate the mean expression level of gene “Dok3” by timepoints.
Solution
rna %>% filter(gene == "Dok3") %>% group_by(time) %>% summarize(mean = mean(expression))
# A tibble: 3 × 2 time mean <dbl> <dbl> 1 0 169 2 4 156. 3 8 61
Counting
When working with data, we often want to know the number of observations found
for each factor or combination of factors. For this task, dplyr
provides
count()
. For example, if we wanted to count the number of rows of data for
each infected and non infected samples, we would do:
rna %>%
count(infection)
# A tibble: 2 × 2
infection n
<chr> <int>
1 InfluenzaA 22110
2 NonInfected 10318
The count()
function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, rna %>% count()
is equivalent to:
rna %>%
group_by(infection) %>%
summarise(n = n())
# A tibble: 2 × 2
infection n
<chr> <int>
1 InfluenzaA 22110
2 NonInfected 10318
Previous example shows the use of count()
to count the number of rows/observations
for one factor (i.e., infection
).
If we wanted to count combination of factors, such as infection
and time
,
we would specify the first and the second factor as the arguments of count()
:
rna %>%
count(infection, time)
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 InfluenzaA 4 11792
2 InfluenzaA 8 10318
3 NonInfected 0 10318
which is equivalent to this:
rna %>%
group_by(infection, time) %>%
summarize(n = n())
`summarise()` has grouped output by 'infection'. You can override using the
`.groups` argument.
# A tibble: 3 × 3
# Groups: infection [2]
infection time n
<chr> <dbl> <int>
1 InfluenzaA 4 11792
2 InfluenzaA 8 10318
3 NonInfected 0 10318
It is sometimes useful to sort the result to facilitate the comparisons.
We can use arrange()
to sort the table.
For instance, we might want to arrange the table above by time:
rna %>%
count(infection, time) %>%
arrange(time)
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 NonInfected 0 10318
2 InfluenzaA 4 11792
3 InfluenzaA 8 10318
or by counts:
rna %>%
count(infection, time) %>%
arrange(n)
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 InfluenzaA 8 10318
2 NonInfected 0 10318
3 InfluenzaA 4 11792
To sort in descending order, we need to add the desc()
function:
rna %>%
count(infection, time) %>%
arrange(desc(n))
# A tibble: 3 × 3
infection time n
<chr> <dbl> <int>
1 InfluenzaA 4 11792
2 InfluenzaA 8 10318
3 NonInfected 0 10318
Challenge
- How many genes were analysed in each sample?
- Use
group_by()
andsummarize()
to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth?- Pick one sample and evaluate the number of genes by biotype
- Identify genes associated with “abnormal DNA methylation” phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8.
Solution
## 1. rna %>% count(sample)
# A tibble: 22 × 2 sample n <chr> <int> 1 GSM2545336 1474 2 GSM2545337 1474 3 GSM2545338 1474 4 GSM2545339 1474 5 GSM2545340 1474 6 GSM2545341 1474 7 GSM2545342 1474 8 GSM2545343 1474 9 GSM2545344 1474 10 GSM2545345 1474 # … with 12 more rows
## 2. rna %>% group_by(sample) %>% summarize(seq_depth = sum(expression)) %>% arrange(desc(seq_depth))
# A tibble: 22 × 2 sample seq_depth <chr> <dbl> 1 GSM2545350 3255566 2 GSM2545352 3216163 3 GSM2545343 3105652 4 GSM2545336 3039671 5 GSM2545380 3036098 6 GSM2545353 2953249 7 GSM2545348 2913678 8 GSM2545362 2913517 9 GSM2545351 2782464 10 GSM2545349 2758006 # … with 12 more rows
## 3. rna %>% filter(sample == "GSM2545336") %>% count(gene_biotype) %>% arrange(desc(n))
# A tibble: 13 × 2 gene_biotype n <chr> <int> 1 protein_coding 1321 2 lncRNA 69 3 processed_pseudogene 59 4 miRNA 7 5 snoRNA 5 6 TEC 4 7 polymorphic_pseudogene 2 8 unprocessed_pseudogene 2 9 IG_C_gene 1 10 scaRNA 1 11 transcribed_processed_pseudogene 1 12 transcribed_unitary_pseudogene 1 13 transcribed_unprocessed_pseudogene 1
## 4. rna %>% filter(phenotype_description == "abnormal DNA methylation") %>% group_by(gene, time) %>% summarize(mean_expression = mean(log(expression))) %>% arrange()
`summarise()` has grouped output by 'gene'. You can override using the `.groups` argument.
# A tibble: 6 × 3 # Groups: gene [2] gene time mean_expression <chr> <dbl> <dbl> 1 Xist 0 6.95 2 Xist 4 6.34 3 Xist 8 7.13 4 Zdbf2 0 6.27 5 Zdbf2 4 6.27 6 Zdbf2 8 6.19
Reshaping data
In the rna
tibble, the rows contain expression values (the unit) that are
associated with a combination of 2 other variables: gene
and sample
.
All the other columns correspond to variables describing either the sample (organism, age, sex,…) or the gene (gene_biotype, ENTREZ_ID, product…). The variables that don’t change with genes or with samples will have the same value in all the rows.
rna %>%
arrange(gene)
# A tibble: 32,428 × 19
gene sample expression organism age sex infection strain time tissue
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Aamp GSM25453… 5621 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
2 Aamp GSM25453… 4049 Mus mus… 8 Fema… NonInfec… C57BL… 0 Cereb…
3 Aamp GSM25453… 3797 Mus mus… 8 Fema… NonInfec… C57BL… 0 Cereb…
4 Aamp GSM25453… 4375 Mus mus… 8 Fema… Influenz… C57BL… 4 Cereb…
5 Aamp GSM25453… 4095 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
6 Aamp GSM25453… 3867 Mus mus… 8 Male Influenz… C57BL… 8 Cereb…
7 Aamp GSM25453… 3578 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb…
8 Aamp GSM25453… 5097 Mus mus… 8 Male NonInfec… C57BL… 0 Cereb…
9 Aamp GSM25453… 4202 Mus mus… 8 Fema… Influenz… C57BL… 4 Cereb…
10 Aamp GSM25453… 4701 Mus mus… 8 Male Influenz… C57BL… 4 Cereb…
# … with 32,418 more rows, and 9 more variables: mouse <dbl>, ENTREZID <dbl>,
# product <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
# chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>
This structure is called a long-format
, as one column contains all the values,
and other column(s) list(s) the context of the value.
In certain cases, the long-format
is not really “human-readable”, and another format,
a wide-format
is preferred, as a more compact way of representing the data.
This is typically the case with gene expression values that scientists are used to
look as matrices, were rows represent genes and columns represent samples.
In this format, it would become therefore straightforward to explore the relationship between the gene expression levels within, and between, the samples.
# A tibble: 1,474 × 23
gene GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Asl 1170 361 400 586 626 988
2 Apod 36194 10347 9173 10620 13021 29594
3 Cyp2d22 4060 1616 1603 1901 2171 3349
4 Klk6 287 629 641 578 448 195
5 Fcrls 85 233 244 237 180 38
6 Slc2a4 782 231 248 265 313 786
7 Exd2 1619 2288 2235 2513 2366 1359
8 Gjc2 288 595 568 551 310 146
9 Plp1 43217 101241 96534 58354 53126 27173
10 Gnb4 1071 1791 1867 1430 1355 798
# … with 1,464 more rows, and 16 more variables: GSM2545342 <dbl>,
# GSM2545343 <dbl>, GSM2545344 <dbl>, GSM2545345 <dbl>, GSM2545346 <dbl>,
# GSM2545347 <dbl>, GSM2545348 <dbl>, GSM2545349 <dbl>, GSM2545350 <dbl>,
# GSM2545351 <dbl>, GSM2545352 <dbl>, GSM2545353 <dbl>, GSM2545354 <dbl>,
# GSM2545362 <dbl>, GSM2545363 <dbl>, GSM2545380 <dbl>
To convert the gene expression values from rna
into a wide-format,
we need to create a new table where the values of the sample
column would
become the names of column variables.
The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: expression levels per gene instead of recording them per gene and per sample.
The opposite transformation would be to transform column names into values of a new variable.
We can do both these of transformations with two tidyr
functions,
pivot_longer()
and pivot_wider()
(see
here for
details).
Pivoting the data into a wider format
Let’s first select the 3 first columns of rna
and use pivot_wider()
to transform data in a wide-format.
rna_exp <- rna %>%
select(gene, sample, expression)
rna_exp
# A tibble: 32,428 × 3
gene sample expression
<chr> <chr> <dbl>
1 Asl GSM2545336 1170
2 Apod GSM2545336 36194
3 Cyp2d22 GSM2545336 4060
4 Klk6 GSM2545336 287
5 Fcrls GSM2545336 85
6 Slc2a4 GSM2545336 782
7 Exd2 GSM2545336 1619
8 Gjc2 GSM2545336 288
9 Plp1 GSM2545336 43217
10 Gnb4 GSM2545336 1071
# … with 32,418 more rows
pivot_wider
takes three main arguments:
- the data to be transformed;
- the
names_from
: the column whose values will become new column names; - the
values_from
: the column whose values will fill the new columns.
rna_wide <- rna_exp %>%
pivot_wider(names_from = sample,
values_from = expression)
rna_wide
# A tibble: 1,474 × 23
gene GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Asl 1170 361 400 586 626 988
2 Apod 36194 10347 9173 10620 13021 29594
3 Cyp2d22 4060 1616 1603 1901 2171 3349
4 Klk6 287 629 641 578 448 195
5 Fcrls 85 233 244 237 180 38
6 Slc2a4 782 231 248 265 313 786
7 Exd2 1619 2288 2235 2513 2366 1359
8 Gjc2 288 595 568 551 310 146
9 Plp1 43217 101241 96534 58354 53126 27173
10 Gnb4 1071 1791 1867 1430 1355 798
# … with 1,464 more rows, and 16 more variables: GSM2545342 <dbl>,
# GSM2545343 <dbl>, GSM2545344 <dbl>, GSM2545345 <dbl>, GSM2545346 <dbl>,
# GSM2545347 <dbl>, GSM2545348 <dbl>, GSM2545349 <dbl>, GSM2545350 <dbl>,
# GSM2545351 <dbl>, GSM2545352 <dbl>, GSM2545353 <dbl>, GSM2545354 <dbl>,
# GSM2545362 <dbl>, GSM2545363 <dbl>, GSM2545380 <dbl>
Pivoting data into a longer format
In the opposite situation we are using the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.
pivot_longer()
takes four main arguments:
- the data to be transformed;
- the
names_to
: the new column name we wish to create and populate with the current column names; - the
values_to
: the new column name we wish to create and populate with current values; - the names of the columns to be used to populate the
names_to
andvalues_to
variables (or to drop).
To recreate rna_long
from rna_long
we would create a key
called sample
and value called expression
and use all columns
except gene
for the key variable. Here we drop gene
column
with a minus sign.
Notice how the new variable names are to be quoted here.
rna_long <- rna_wide %>%
pivot_longer(names_to = "sample",
values_to = "expression",
-gene)
rna_long
# A tibble: 32,428 × 3
gene sample expression
<chr> <chr> <dbl>
1 Asl GSM2545336 1170
2 Asl GSM2545337 361
3 Asl GSM2545338 400
4 Asl GSM2545339 586
5 Asl GSM2545340 626
6 Asl GSM2545341 988
7 Asl GSM2545342 836
8 Asl GSM2545343 535
9 Asl GSM2545344 586
10 Asl GSM2545345 597
# … with 32,418 more rows
We could also have used a specification for what columns to
include. This can be useful if you have a large number of identifying
columns, and it’s easier to specify what to gather than what to leave
alone. Here the starts_with()
function can help to retrieve sample
names without having to list them all!
Another possibility would be to use the :
operator!
rna_wide %>%
pivot_longer(names_to = "sample",
values_to = "expression",
cols = starts_with("GSM"))
# A tibble: 32,428 × 3
gene sample expression
<chr> <chr> <dbl>
1 Asl GSM2545336 1170
2 Asl GSM2545337 361
3 Asl GSM2545338 400
4 Asl GSM2545339 586
5 Asl GSM2545340 626
6 Asl GSM2545341 988
7 Asl GSM2545342 836
8 Asl GSM2545343 535
9 Asl GSM2545344 586
10 Asl GSM2545345 597
# … with 32,418 more rows
rna_wide %>%
pivot_longer(names_to = "sample",
values_to = "expression",
GSM2545336:GSM2545380)
# A tibble: 32,428 × 3
gene sample expression
<chr> <chr> <dbl>
1 Asl GSM2545336 1170
2 Asl GSM2545337 361
3 Asl GSM2545338 400
4 Asl GSM2545339 586
5 Asl GSM2545340 626
6 Asl GSM2545341 988
7 Asl GSM2545342 836
8 Asl GSM2545343 535
9 Asl GSM2545344 586
10 Asl GSM2545345 597
# … with 32,418 more rows
Challenge
Subset genes located on X and Y chromosomes from the
rna
data frame and spread the data frame withsex
as columns,chromosome_name
as rows, and the mean expression of genes located in each chromosome as the values, as in the following tibble:You will need to summarize before reshaping!
Let’s first calculate the mean expression level of X and Y linked genes from male and female samples…
rna %>% filter(chromosome_name == "Y" | chromosome_name == "X") %>% group_by(sex, chromosome_name) %>% summarize(mean = mean(expression))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
# A tibble: 4 × 3 # Groups: sex [2] sex chromosome_name mean <chr> <chr> <dbl> 1 Female X 3504. 2 Female Y 3 3 Male X 2497. 4 Male Y 2117.
And pivot the table to wide format
rna_1 <- rna %>% filter(chromosome_name == "Y" | chromosome_name == "X") %>% group_by(sex, chromosome_name) %>% summarize(mean = mean(expression)) %>% pivot_wider(names_from = sex, values_from = mean)
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
rna_1
# A tibble: 2 × 3 chromosome_name Female Male <chr> <dbl> <dbl> 1 X 3504. 2497. 2 Y 3 2117.
Now take that data frame and transform it with
pivot_longer()
so each row is a uniquechromosome_name
bygender
combination.Solution
rna_1 %>% pivot_longer(names_to = "gender", values_to = "mean", - chromosome_name)
# A tibble: 4 × 3 chromosome_name gender mean <chr> <chr> <dbl> 1 X Female 3504. 2 X Male 2497. 3 Y Female 3 4 Y Male 2117.
Homework Challenge
- Use the
rna
dataset to create an expression table where each row represents the mean expression levels of genes and columns represent the different timepoints.- Use the previous table containing mean expression levels per timepoint and create a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes between timepoint 8 and timepoint 4. Convert this table into a long-format table gathering the foldchanges calculated.
Solution
Let’s first calculate the mean expression by gene and by time
rna %>%
group_by(gene, time) %>%
summarize(mean_exp = mean(expression))
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
# A tibble: 4,422 × 3
# Groups: gene [1,474]
gene time mean_exp
<chr> <dbl> <dbl>
1 Aamp 0 4603.
2 Aamp 4 4870
3 Aamp 8 4763.
4 Abca12 0 5.29
5 Abca12 4 4.25
6 Abca12 8 4.14
7 Abcc8 0 2576.
8 Abcc8 4 2609.
9 Abcc8 8 2292.
10 Abhd14a 0 591.
# … with 4,412 more rows
before using the pivot_wider() function
rna_time <- rna %>%
group_by(gene, time) %>%
summarize(mean_exp = mean(expression)) %>%
pivot_wider(names_from = time,
values_from = mean_exp)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
rna_time
# A tibble: 1,474 × 4
# Groups: gene [1,474]
gene `0` `4` `8`
<chr> <dbl> <dbl> <dbl>
1 Aamp 4603. 4870 4763.
2 Abca12 5.29 4.25 4.14
3 Abcc8 2576. 2609. 2292.
4 Abhd14a 591. 547. 432.
5 Abi2 4881. 4903. 4945.
6 Abi3bp 1175. 1061. 762.
7 Abl2 2170. 2078. 2131.
8 Acadl 2059. 2099 1995.
9 Acap3 3745 3446. 3431.
10 Acbd4 1219. 1410. 1668.
# … with 1,464 more rows
Notice that this generates a tibble with some column names starting by a number. If we wanted to select the column corresponding to the timepoints, we could not use the column names directly… What happens when we select the colum 4?
rna %>%
group_by(gene, time) %>%
summarize(mean_exp = mean(expression)) %>%
pivot_wider(names_from = time,
values_from = mean_exp) %>%
select(gene, 4)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
# A tibble: 1,474 × 2
# Groups: gene [1,474]
gene `8`
<chr> <dbl>
1 Aamp 4763.
2 Abca12 4.14
3 Abcc8 2292.
4 Abhd14a 432.
5 Abi2 4945.
6 Abi3bp 762.
7 Abl2 2131.
8 Acadl 1995.
9 Acap3 3431.
10 Acbd4 1668.
# … with 1,464 more rows
To select the timepoint 4, we would have to quote the column name, with backticks “`”
rna %>%
group_by(gene, time) %>%
summarize(mean_exp = mean(expression)) %>%
pivot_wider(names_from = time,
values_from = mean_exp) %>%
select(gene, `4`)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
# A tibble: 1,474 × 2
# Groups: gene [1,474]
gene `4`
<chr> <dbl>
1 Aamp 4870
2 Abca12 4.25
3 Abcc8 2609.
4 Abhd14a 547.
5 Abi2 4903.
6 Abi3bp 1061.
7 Abl2 2078.
8 Acadl 2099
9 Acap3 3446.
10 Acbd4 1410.
# … with 1,464 more rows
Another possibility would be to rename the column, choosing a name that doesn’t start by a number :
rna_time <- rna %>%
group_by(gene, time) %>%
summarize(mean_exp = mean(expression)) %>%
pivot_wider(names_from = time,
values_from = mean_exp) %>%
rename("time0" = `0`, "time4" = `4`, "time8" = `8`)
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
Calculate FoldChanges:
rna_time %>%
mutate(time_8_vs_0 = time8 / time0, time_8_vs_4 = time8 / time4)
# A tibble: 1,474 × 6
# Groups: gene [1,474]
gene time0 time4 time8 time_8_vs_0 time_8_vs_4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aamp 4603. 4870 4763. 1.03 0.978
2 Abca12 5.29 4.25 4.14 0.784 0.975
3 Abcc8 2576. 2609. 2292. 0.889 0.878
4 Abhd14a 591. 547. 432. 0.731 0.791
5 Abi2 4881. 4903. 4945. 1.01 1.01
6 Abi3bp 1175. 1061. 762. 0.649 0.719
7 Abl2 2170. 2078. 2131. 0.982 1.03
8 Acadl 2059. 2099 1995. 0.969 0.950
9 Acap3 3745 3446. 3431. 0.916 0.996
10 Acbd4 1219. 1410. 1668. 1.37 1.18
# … with 1,464 more rows
And use the pivot_longer() function:
rna_time %>%
mutate(time_8_vs_0 = time8 / time0, time_8_vs_4 = time8 / time4) %>%
pivot_longer(names_to = "comparisons",
values_to = "Fold_changes",
time_8_vs_0:time_8_vs_4)
# A tibble: 2,948 × 6
# Groups: gene [1,474]
gene time0 time4 time8 comparisons Fold_changes
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 Aamp 4603. 4870 4763. time_8_vs_0 1.03
2 Aamp 4603. 4870 4763. time_8_vs_4 0.978
3 Abca12 5.29 4.25 4.14 time_8_vs_0 0.784
4 Abca12 5.29 4.25 4.14 time_8_vs_4 0.975
5 Abcc8 2576. 2609. 2292. time_8_vs_0 0.889
6 Abcc8 2576. 2609. 2292. time_8_vs_4 0.878
7 Abhd14a 591. 547. 432. time_8_vs_0 0.731
8 Abhd14a 591. 547. 432. time_8_vs_4 0.791
9 Abi2 4881. 4903. 4945. time_8_vs_0 1.01
10 Abi2 4881. 4903. 4945. time_8_vs_4 1.01
# … with 2,938 more rows
Exporting data
Now that you have learned how to use dplyr
to extract information from
or summarize your raw data, you may want to export these new data sets to share
them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV files into R, there is
a write_csv()
function that generates CSV files from data frames.
Before using write_csv()
, we are going to create a new folder, data_output
,
in our working directory (if we haven’t already created it) that will store this generated dataset. We don’t want
to write generated datasets in the same directory as our raw data.
It’s good practice to keep them separate. The data
folder should only contain
the raw, unaltered data, and should be left alone to make sure we don’t delete
or modify it. In contrast, our script will generate the contents of the data_output
directory, so even if the files it contains are deleted, we can always
re-generate them.
Let’s use write_csv()
to save the rna_wide table that we have created previously.
write_csv(rna_wide, file = "data_output/rna_wide.csv")
Key Points
Tabular data in R using the tidyverse meta-package
Data visualization
Overview
Teaching: XX min
Exercises: XX minQuestions
Visualization in R
Objectives
Produce scatter plots, boxplots, line plots, etc. using ggplot.
Set universal plot settings.
Describe what faceting is and apply faceting in ggplot.
Modify the aesthetics of an existing ggplot plot (including axis labels and color).
Build complex and customized plots from data in a data frame.
Data Visualization
We start by loading the required packages. ggplot2
is included
in the tidyverse
package.
library("tidyverse")
If not still in the workspace, load the data we saved in the previous lesson.
rna <- read.csv("course-data/data/GSE96870/rnaseq.csv")
The Data Visualization Cheat
Sheet
will cover the basics and more advanced features of ggplot2
and will
help, in addition to serve as a reminder, getting an overview of the
many data representations available in the package. The following
video tutorials (part 1
and 2) by Thomas Lin
Pedersen are also very instructive.
Plotting with ggplot2
ggplot2
is a plotting package that makes it simple to create
complex plots from data in a data frame. It provides a more
programmatic interface for specifying what variables to plot, how they
are displayed, and general visual properties. The theoretical
foundation that supports the ggplot2
is the Grammar of Graphics
(@Wilkinson:2005). Using this approach, we only need minimal changes
if the underlying data change or if we decide to change from a bar
plot to a scatterplot. This helps in creating publication quality
plots with minimal amounts of adjustments and tweaking.
There is a book about ggplot2
(@ggplot2book) that provides a good
overview, but it is outdated. The 3rd edition is in preparation and
will be freely available online. The
ggplot2
webpage
(https://ggplot2.tidyverse.org)
provides ample documentation.
ggplot2
functions like data in the ‘long’ format, i.e., a column for
every dimension, and a row for every observation. Well-structured data
will save you lots of time when making figures with ggplot2
.
ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
The idea behind the Grammar of Graphics it is that you can build every graph from the same 3 components: (1) a data set, (2) a coordinate system, and (3) geoms — i.e. visual marks that represent data points 1.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
- use the
ggplot()
function and bind the plot to a specific data frame using thedata
argument
ggplot(data = rna)
- define a mapping (using the aesthetic (
aes
) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
ggplot(data = rna, mapping = aes(x = expression))
-
add ‘geoms’ – geometries, or graphical representations of the data in the plot (points, lines, bars).
ggplot2
offers many different geoms; we will use some common ones today, including:* `geom_point()` for scatter plots, dot plots, etc. * `geom_histogram()` for histograms * `geom_boxplot()` for, well, boxplots! * `geom_line()` for trend lines, time series, etc.
To add a geom(etry) to the plot use the +
operator. Let’s use geom_histogram()
first:
ggplot(data = rna, mapping = aes(x = expression)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The +
in the ggplot2
package is particularly useful because it
allows you to modify existing ggplot
objects. This means you can
easily set up plot templates and conveniently explore different types
of plots, so the above plot can also be generated with code like this:
# Assign plot to a variable
rna_plot <- ggplot(data = rna,
mapping = aes(x = expression))
# Draw the plot
rna_plot + geom_histogram()
Homework Challenge
You have probably noticed an automatic message that appears when drawing the histogram:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Change the arguments
bins
orbinwidth
ofgeom_histogram()
to change the number or width of the bins.Solution
# change bins ggplot(rna, aes(x = expression)) + geom_histogram(bins = 15)
# change binwidth ggplot(rna, aes(x = expression)) + geom_histogram(binwidth = 2000)
We can observe here that the data are skewed to the right. We can
apply log2 transformation to have a more symmetric distribution. Note
that we add here a small constant value (+1
) to avoid having -Inf
values returned for expression values equal to 0.
rna <- rna %>%
mutate(expression_log = log2(expression + 1))
If we now draw the histogram of the log2-transformed expressions, the distribution is indeed closer to a normal distribution.
ggplot(rna, aes(x = expression_log)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From now on we will work on the log-transformed expression values.
Homework Challenge
Another way to visualize this transformation is to consider the scale of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components (i.e., by incrementally adding commands). Try making this modification:
- Represent the un-transformed expression on the log10 scale; see
scale_x_log10()
. Compare it with the previous graph. Why do you now have warning messages appearing?Solution
ggplot(data = rna,mapping = aes(x = expression))+ geom_histogram() + scale_x_log10()
Warning: Transformation introduced infinite values in continuous x-axis
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 507 rows containing non-finite values (stat_bin).
Notes
- Anything you put in the
ggplot()
function can be seen by any geom layers that you add (i.e., these are global plot settings). This includes the x- and y-axis mapping you set up inaes()
. - You can also specify mappings for a given geom independently of the
mappings defined globally in the
ggplot()
function. - The
+
sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the+
sign is added at the beginning of the line containing the new layer,ggplot2
will not add the new layer and will return an error message.
# This is the correct syntax for adding layers
rna_plot +
geom_histogram()
# This will not add the new layer and will return an error message
rna_plot
+ geom_histogram()
Building your plots iteratively
We will now draw a scatter plot with two continuous variables and the geom_point()
function.
This graph will represent the log2 fold changes of expression comparing time 8 versus time 0 and time 4 versus time 0. To this end, we first need to compute the means of the log-transformed expression values by gene and time then the log fold changes by subtracting the mean log expressions between time 8 and time 0 and between time 4 and time 0. Note that we also include here the gene biotype that we will use later on to represent the genes. We will save the fold changes in a new data frame called rna_fc.
rna_fc <- rna %>% select(gene, time,
gene_biotype, expression_log) %>%
group_by(gene, time, gene_biotype) %>%
summarize(mean_exp = mean(expression_log)) %>%
pivot_wider(names_from = time,
values_from = mean_exp) %>%
mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`)
`summarise()` has grouped output by 'gene', 'time'. You can override using the
`.groups` argument.
We can then build a ggplot with the newly created dataset rna_fc
. Building plots with ggplot2
is typically an iterative process. We start by
defining the dataset we’ll use, lay out the axes, and choose a geom:
ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
geom_point()
Then, we start modifying this plot to extract more information from it. For
instance, we can add transparency (alpha
) to avoid overplotting:
ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
geom_point(alpha = 0.3)
We can also add colors for all the points:
ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
geom_point(alpha = 0.3, color = "blue")
Or to color each gene in the plot differently, you could use a
vector as an input to the argument color. ggplot2
will
provide a different color corresponding to different values in the
vector. Here is an example where we color with gene_biotype
:
ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) +
geom_point(alpha = 0.3, aes(color = gene_biotype))
We can also specify the colors directly inside the mapping provided in
the ggplot()
function. This will be seen by any geom layers and the
mapping will be determined by the x- and y-axis set up in aes()
.
ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0,
color = gene_biotype)) +
geom_point(alpha = 0.3)
Finally, we could also add a diagonal line with the geom_abline()
function:
ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0,
color = gene_biotype)) +
geom_point(alpha = 0.3) +
geom_abline(intercept = 0)
Notice that we can change the geom layer from geom_point
to geom_jitter
and colors will be still determined by gene_biotype
.
ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0,
color = gene_biotype)) +
geom_jitter(alpha = 0.3) +
geom_abline(intercept = 0)
Challenge
Use what you just learned to create a scatter plot of
expression_log
oversample
from therna
dataset with the time showing in different colors. Is this a good way to show this type of data?Solution
ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_point(aes(color = time))
Boxplot
We can use boxplots to visualize the distribution of gene expressions within each sample:
ggplot(data = rna,
mapping = aes(y = expression_log, x = sample)) +
geom_boxplot()
By adding points to boxplot, we can have a better idea of the number of measurements and of their distribution:
ggplot(data = rna,
mapping = aes(y = expression_log, x = sample)) +
geom_jitter(alpha = 0.2, color = "tomato") +
geom_boxplot(alpha = 0)
Challenge
Note how the boxplot layer is in front of the jitter layer? What do you need to change in the code to put the boxplot below the points?
Solution
We should switch the order of these two geoms:
ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.2, color = "tomato")
You may notice that the values on the x-axis are still not properly readable. Let’s change the orientation of the labels and adjust them vertically and horizontally so they don’t overlap. You can use a 90-degree angle, or experiment to find the appropriate angle for diagonally oriented labels:
ggplot(data = rna,
mapping = aes(y = expression_log, x = sample)) +
geom_jitter(alpha = 0.2, color = "tomato") +
geom_boxplot(alpha = 0) +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Challenge
Add color to the data points on your boxplot according to the duration of the infection (
time
).Hint: Check the class for
time
. Consider changing the class oftime
from integer to factor directly in the ggplot mapping. Why does this change how R makes the graph?Solution
# time as integer ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_jitter(alpha = 0.2, aes(color = time)) + geom_boxplot(alpha = 0) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
# time as factor ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + geom_boxplot(alpha = 0) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Homework Challenge
Boxplots are useful summaries, but hide the shape of the distribution. For example, if the distribution is bimodal, we would not see it in a boxplot. An alternative to the boxplot is the violin plot, where the shape (of the density of points) is drawn.
- Replace the box plot with a violin plot; see
geom_violin()
.- Fill in the violins according to the time with the argument
fill
.- Modify the violin plot to fill in the violins by
sex
.Solution
ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_violin(aes(fill = as.factor(time))) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_violin(aes(fill = sex)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Line plots
Let’s calculate the mean expression per duration of the infection for the 10 genes having the highest log fold changes comparing time 8 versus time 0.
First, we need to select the genes and create a subset of rna
called sub_rna
containing the 10 selected genes, then we need to group the data and calculate the mean gene expression within each group:
rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0))
genes_selected <- rna_fc$gene[1:10]
sub_rna <- rna %>%
filter(gene %in% genes_selected)
mean_exp_by_time <- sub_rna %>%
group_by(gene,time) %>%
summarize(mean_exp = mean(expression_log))
`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
We can build the line plot with duration of the infection on the x-axis and the mean expression on the y-axis:
ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) +
geom_line()
Unfortunately, this does not work because we plotted data for all the genes
together. We need to tell ggplot to draw a line for each gene by modifying
the aesthetic function to include group = gene
:
ggplot(data = mean_exp_by_time,
mapping = aes(x = time, y = mean_exp, group = gene)) +
geom_line()
We will be able to distinguish genes in the plot if we add colors
(using color
also automatically groups the data):
ggplot(data = mean_exp_by_time,
mapping = aes(x = time, y = mean_exp, color = gene)) +
geom_line()
Faceting
ggplot2
has a special technique called faceting that allows
the user to split one plot into multiple (sub) plots based on a factor
included in the dataset. These different subplots inherit the same
properties (axes limits, ticks, …) to facilitate their direct
comparison. We will use it to make a line plot across time for each
gene:
ggplot(data = mean_exp_by_time,
mapping = aes(x = time, y = mean_exp)) + geom_line() +
facet_wrap(~ gene)
Here both x- and y-axis have the same scale for all the sub plots. You can change this default behavior by modifying scales
in order to allow a free scale for the y-axis:
ggplot(data = mean_exp_by_time,
mapping = aes(x = time, y = mean_exp)) +
geom_line() +
facet_wrap(~ gene, scales = "free_y")
Now we would like to split the line in each plot by the sex of the mice. To do that we need to calculate the mean expression in the data frame grouped by gene
, time
, and sex
:
mean_exp_by_time_sex <- sub_rna %>%
group_by(gene, time, sex) %>%
summarize(mean_exp = mean(expression_log))
`summarise()` has grouped output by 'gene', 'time'. You can override using the
`.groups` argument.
We can now make the faceted plot by splitting further by sex using color
(within a single plot):
ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = sex)) +
geom_line() +
facet_wrap(~ gene, scales = "free_y")
Usually plots with white background look more readable when printed. We can set
the background to white using the function theme_bw()
. Additionally, we can remove
the grid:
ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = sex)) +
geom_line() +
facet_wrap(~ gene, scales = "free_y") +
theme_bw() +
theme(panel.grid = element_blank())
Homework Challenge
Use what you just learned to create a plot that depicts how the average expression of each chromosome changes through the duration of infection.
Solution
mean_exp_by_chromosome <- rna %>% group_by(chromosome_name, time) %>% summarize(mean_exp = mean(expression_log))
`summarise()` has grouped output by 'chromosome_name'. You can override using the `.groups` argument.
ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, y = mean_exp)) + geom_line() + facet_wrap(~ chromosome_name, scales = "free_y")
The facet_wrap
geometry extracts plots into an arbitrary number of dimensions
to allow them to cleanly fit on one page. On the other hand, the facet_grid
geometry allows you to explicitly specify how you want your plots to be
arranged via formula notation (rows ~ columns
; a .
can be used as
a placeholder that indicates only one row or column).
Let’s modify the previous plot to compare how the mean gene expression of males and females has changed through time:
# One column, facet by rows
ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = gene)) +
geom_line() +
facet_grid(sex ~ .)
# One row, facet by column
ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = gene)) +
geom_line() +
facet_grid(. ~ sex)
ggplot2
themes
In addition to theme_bw()
, which changes the plot background to white, ggplot2
comes with several other themes which can be useful to quickly change the look
of your visualization. The complete list of themes is available
at http://docs.ggplot2.org/current/ggtheme.html. theme_minimal()
and
theme_light()
are popular, and theme_void()
can be useful as a starting
point to create a new hand-crafted theme.
The
ggthemes package
provides a wide variety of options (including an Excel 2003 theme).
The ggplot2
extensions website provides a list
of packages that extend the capabilities of ggplot2
, including additional
themes.
Customisation
Let’s come back to the faceted plot of mean expression by time and gene, colored by sex.
Take a look at the ggplot2
cheat
sheet,
and think of ways you could improve the plot.
Now, we can change names of axes to something more informative than ‘time’ and ‘mean_exp’, and add a title to the figure:
ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = sex)) +
geom_line() +
facet_wrap(~ gene, scales = "free_y") +
theme_bw() +
theme(panel.grid = element_blank()) +
labs(title = "Mean gene expression by duration of the infection",
x = "Duration of the infection (in days)",
y = "Mean gene expression")
The axes have more informative names, but their readability can be improved by increasing the font size:
ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = sex)) +
geom_line() +
facet_wrap(~ gene, scales = "free_y") +
theme_bw() +
theme(panel.grid = element_blank()) +
labs(title = "Mean gene expression by duration of the infection",
x = "Duration of the infection (in days)",
y = "Mean gene expression") +
theme(text = element_text(size = 16))
Note that it is also possible to change the fonts of your plots. If you are on
Windows, you may have to install
the extrafont
package, and follow the
instructions included in the README for this package.
We can further customize the color of x- and y-axis text, the color of the grid, etc. We can also for example move the legend to the top by setting legend.position
to "top"
.
ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = sex)) +
geom_line() +
facet_wrap(~ gene, scales = "free_y") +
theme_bw() +
theme(panel.grid = element_blank()) +
labs(title = "Mean gene expression by duration of the infection",
x = "Duration of the infection (in days)",
y = "Mean gene expression") +
theme(text = element_text(size = 16),
axis.text.x = element_text(colour = "royalblue4", size = 12),
axis.text.y = element_text(colour = "royalblue4", size = 12),
panel.grid = element_line(colour="lightsteelblue1"),
legend.position = "top")
If you like the changes you created better than the default theme, you can save them as an object to be able to easily apply them to other plots you may create. Here is an example with the histogram we have previously created.
blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4",
size = 12),
axis.text.y = element_text(colour = "royalblue4",
size = 12),
text = element_text(size = 16),
panel.grid = element_line(colour="lightsteelblue1"))
ggplot(rna, aes(x = expression_log)) +
geom_histogram(bins = 20) +
blue_theme
Challenge
With all of this information in hand, try to create the most ugly plot that you can think about. You can use one of the plots generated in this exercise as a starting point. Use the RStudio
ggplot2
cheatsheet for inspiration. Here are some ideas:
- See if you can change the thickness of the lines.
- Can you find a way to change the name of the legend? What about its labels? (hint: look for a ggplot function starting with
scale_
)- Try using a different color palette or manually specifying the colors for the lines (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/).
Composing plots
Faceting is a great tool for splitting one plot into multiple subplots, but sometimes you may want to produce a single figure that contains multiple independent plots, i.e. plots that are based on different variables or even different data frames.
Let’s start by creating the two plots that we want to arrange next to each other:
The first graph counts the number of unique genes per chromosome. We first need to reorder the levels of chromosome_name
and filter the unique genes per chromosome. We also change the scale of the y-axis to a log10 scale for better readability.
rna$chromosome_name <- factor(rna$chromosome_name,
levels = c(1:19,"X","Y"))
count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>%
distinct() %>% ggplot() +
geom_bar(aes(x = chromosome_name), fill = "seagreen",
position = "dodge", stat = "count") +
labs(y = "log10(n genes)", x = "chromosome") +
scale_y_log10()
count_gene_chromosome
Below, we also remove the legend altogether by setting the legend.position
to "none"
.
exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time),
color=sex)) +
geom_boxplot(alpha = 0) +
labs(y = "Mean gene exp",
x = "time") + theme(legend.position = "none")
exp_boxplot_sex
The patchwork package
provides an elegant approach to combining figures using the +
to
arrange figures (typically side by side). More specifically the |
explicitly arranges them side by side and /
stacks them on top of
each other.
install.packages("patchwork")
library("patchwork")
count_gene_chromosome + exp_boxplot_sex
## or count_gene_chromosome | exp_boxplot_sex
count_gene_chromosome / exp_boxplot_sex
We can combine further control the layout of the final composition
with plot_layout
to create more complex layouts:
count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1)
count_gene_chromosome +
(count_gene_chromosome + exp_boxplot_sex) +
exp_boxplot_sex +
plot_layout(ncol = 1)
The last plot can also be created using the |
and /
composers:
count_gene_chromosome /
(count_gene_chromosome | exp_boxplot_sex) /
exp_boxplot_sex
Learn more about patchwork
on its
webpage or in this
video.
Another option is the gridExtra
package that allows to combine
separate ggplots into a single figure using grid.arrange()
:
install.packages("gridExtra")
library(gridExtra)
grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2)
In addition to the ncol
and nrow
arguments, used to make simple
arrangements, there are tools for constructing more complex
layouts.
Exporting plots
After creating your plot, you can save it to a file in your favorite format. The Export tab in the Plot pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters.
Instead, use the ggsave()
function, which allows you easily change
the dimension and resolution of your plot by adjusting the appropriate
arguments (width
, height
and dpi
).
Make sure you have the fig_output/
folder in your working directory.
my_plot <- ggplot(data = mean_exp_by_time_sex,
mapping = aes(x = time, y = mean_exp, color = sex)) +
geom_line() +
facet_wrap(~ gene, scales = "free_y") +
labs(title = "Mean gene expression by duration of the infection",
x = "Duration of the infection (in days)",
y = "Mean gene expression") +
guides(color=guide_legend(title="Gender")) +
theme_bw() +
theme(axis.text.x = element_text(colour = "royalblue4", size = 12),
axis.text.y = element_text(colour = "royalblue4", size = 12),
text = element_text(size = 16),
panel.grid = element_line(colour="lightsteelblue1"),
legend.position = "top")
ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15,
height = 10)
# This also works for grid.arrange() plots
combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex,
ncol = 2, widths = c(4, 6))
ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot,
width = 10, dpi = 300)
Note: The parameters width
and height
also determine the font size in the saved plot.
Other packages for visualisation
ggplot2
is a very powerful package that fits very nicely in our
tidy data and tidy tools pipeline. There are other visualization
packages in R that shouldn’t be ignored.
Base graphics {-}
The default graphics system that comes with R, often called base R
graphics is simple and fast. It is based on the painter’s or canvas
model, where different output are directly overlaid on top of each
other (see figure \@ref(fig:paintermodel)). This is a fundamental
difference with ggplot2
(and with lattice
, described below), that
returns dedicated objects, that are rendered on screen or in a file,
and that can even be updated.
par(mfrow = c(1, 3))
plot(1:20, main = "First layer, produced with plot(1:20)")
plot(1:20, main = "A horizontal red line, added with abline(h = 10)")
abline(h = 10, col = "red")
plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)")
abline(h = 10, col = "red")
rect(5, 5, 15, 15, lwd = 3)
Another main difference is that base graphics’ plotting function try
to do the right thing based on their input type, i.e. they will
adapt their behaviour based on the class of their input. This is again
very different from what we have in ggplot2
, that only accepts
dataframes as input, and that requires plots to be constructed bit by
bit.
par(mfrow = c(2, 2))
boxplot(rnorm(100),
main = "Boxplot of rnorm(100)")
boxplot(matrix(rnorm(100), ncol = 10),
main = "Boxplot of matrix(rnorm(100), ncol = 10)")
hist(rnorm(100))
hist(matrix(rnorm(100), ncol = 10))
The out-of-the-box approach in base graphics can be very efficient for
simple, standard figures, that can be produced very quickly with a
single line of code and a single function such as plot
, or hist
,
or boxplot
, … The defaults are however not always the most
appealing and tuning of figures, especially when they become more
complex (for example to produce facets), can become lengthy and
cumbersome.
The lattice package {-}
The lattice
package is similar to ggplot2
in that is uses
dataframes as input, returns graphical objects and supports
faceting. lattice
however isn’t based on the grammar of graphics
and has a more convoluted interface.
A good reference for the lattice
package is @latticebook.
-
Source: Data Visualization Cheat Sheet. ↩
Key Points
Visualization in R
Joining tables
Overview
Teaching: XX min
Exercises: XX minQuestions
Join tables in R
Objectives
Understand the need and concept of table joins
Understand the different types of joins
Understand the importance of keys in joins and the implications of using non-unique keys
Joining tables
In many real life situations, data are spread across multiple tables. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources.
It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a column that would be common to all the tables, for example, an attribute that uniquely identifies the subjects.
The dplyr
package provides a set of join functions for combining
two data frames based on matches within specified columns.
For further reading, please refer to the chapter about table joins in R for Data Science.
The Data Transformation Cheat Sheet also provides a short overview on table joins.
Combining tables
We are going to illustrate join using a common example from the bioinformatics world, where annotations about genes are scattered in different tables that have one or more shared columns.
The data we are going to use are available in the following package.
if (!require("rWSBIM1207"))
BiocManager::install("UCLouvain-CBIO/rWSBIM1207")
library("rWSBIM1207")
data(jdf)
The data is composed of several tables.
The first table, jdf1
, contains protein UniProt1 unique
accession number (uniprot
variable), the most likely sub-cellular
localisation of these respective proteins (organelle
variable) as
well as the proteins identifier (entry
).
jdf1
# A tibble: 25 × 3
uniprot organelle entry
<chr> <chr> <chr>
1 P26039 Actin cytoskeleton TLN1_MOUSE
2 Q99PL5 Endoplasmic reticulum/Golgi apparatus RRBP1_MOUSE
3 Q6PB66 Mitochondrion LPPRC_MOUSE
4 P11276 Extracellular matrix FINC_MOUSE
5 Q6PR54 Nucleus - Chromatin RIF1_MOUSE
6 Q05793 Extracellular matrix PGBM_MOUSE
7 P19096 Cytosol FAS_MOUSE
8 Q9JKF1 Plasma membrane IQGA1_MOUSE
9 Q9QZQ1-2 Plasma membrane AFAD_MOUSE
10 Q6NS46 Nucleus - Non-chromatin RRP5_MOUSE
# … with 15 more rows
The second table, jdf2
, contains the name of the gene that codes for the
protein (gene_name
variable), a description of the gene
(description
variable), the uniprot accession number (this is the
common variable that can be used to join tables) and the species the
protein information comes from (organism
variable).
jdf2
# A tibble: 25 × 4
gene_name description uniprot organism
<chr> <chr> <chr> <chr>
1 Iqgap1 Ras GTPase-activating-like protein IQGAP1 Q9JKF1 Mmus
2 Hspa5 78 kDa glucose-regulated protein P20029 Mmus
3 Pdcd11 Protein RRP5 homolog Q6NS46 Mmus
4 Tfrc Transferrin receptor protein 1 Q62351 Mmus
5 Hspd1 60 kDa heat shock protein, mitochondrial P63038 Mmus
6 Tln1 Talin-1 P26039 Mmus
7 Smc1a Structural maintenance of chromosomes protein 1A Q9CU62 Mmus
8 Lamc1 Laminin subunit gamma-1 P02468 Mmus
9 Hsp90b1 Endoplasmin P08113 Mmus
10 Mia3 Melanoma inhibitory activity protein 3 Q8BI84 Mmus
# … with 15 more rows
We now want to join these two tables into a single one containing all variables.
We are going to use the full_join
function of dplyr
to do
so,
Th function will automatically find the common variable (in this case uniprot
)
to match observations from the first and second table.
library("dplyr")
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
full_join(jdf1, jdf2)
Joining, by = "uniprot"
# A tibble: 25 × 6
uniprot organelle entry gene_name description organism
<chr> <chr> <chr> <chr> <chr> <chr>
1 P26039 Actin cytoskeleton TLN1… Tln1 Talin-1 Mmus
2 Q99PL5 Endoplasmic reticulum/Golgi ap… RRBP… Rrbp1 Ribosome-b… Mmus
3 Q6PB66 Mitochondrion LPPR… Lrpprc Leucine-ri… Mmus
4 P11276 Extracellular matrix FINC… Fn1 Fibronectin Mmus
5 Q6PR54 Nucleus - Chromatin RIF1… Rif1 Telomere-a… Mmus
6 Q05793 Extracellular matrix PGBM… Hspg2 Basement m… Mmus
7 P19096 Cytosol FAS_… Fasn Fatty acid… Mmus
8 Q9JKF1 Plasma membrane IQGA… Iqgap1 Ras GTPase… Mmus
9 Q9QZQ1-2 Plasma membrane AFAD… Mllt4 Isoform 1 … Mmus
10 Q6NS46 Nucleus - Non-chromatin RRP5… Pdcd11 Protein RR… Mmus
# … with 15 more rows
In these examples, each observation of the jdf1
and jdf2
tables are uniquely identified by their UniProt accession number. Such
variables are called keys. Keys are used to match observations
across different tables.
Now let’s look at a third table, jdf3
. It also contains the column uniProt,
but it is written differently!
jdf3
# A tibble: 25 × 4
gene_name description UniProt organism
<chr> <chr> <chr> <chr>
1 Iqgap1 Ras GTPase-activating-like protein IQGAP1 Q9JKF1 Mmus
2 Hspa5 78 kDa glucose-regulated protein P20029 Mmus
3 Pdcd11 Protein RRP5 homolog Q6NS46 Mmus
4 Tfrc Transferrin receptor protein 1 Q62351 Mmus
5 Hspd1 60 kDa heat shock protein, mitochondrial P63038 Mmus
6 Tln1 Talin-1 P26039 Mmus
7 Smc1a Structural maintenance of chromosomes protein 1A Q9CU62 Mmus
8 Lamc1 Laminin subunit gamma-1 P02468 Mmus
9 Hsp90b1 Endoplasmin P08113 Mmus
10 Mia3 Melanoma inhibitory activity protein 3 Q8BI84 Mmus
# … with 15 more rows
In case none of the variable names match, we can set manually the variables to
use for the matching. These variables can be set using the by
argument,
as shown below with the jdf1
(as
above) and jdf3
tables, where the UniProt accession number is
encoded using a different capitalisation.
names(jdf3)
[1] "gene_name" "description" "UniProt" "organism"
full_join(jdf1, jdf3, by = c("uniprot" = "UniProt"))
# A tibble: 25 × 6
uniprot organelle entry gene_name description organism
<chr> <chr> <chr> <chr> <chr> <chr>
1 P26039 Actin cytoskeleton TLN1… Tln1 Talin-1 Mmus
2 Q99PL5 Endoplasmic reticulum/Golgi ap… RRBP… Rrbp1 Ribosome-b… Mmus
3 Q6PB66 Mitochondrion LPPR… Lrpprc Leucine-ri… Mmus
4 P11276 Extracellular matrix FINC… Fn1 Fibronectin Mmus
5 Q6PR54 Nucleus - Chromatin RIF1… Rif1 Telomere-a… Mmus
6 Q05793 Extracellular matrix PGBM… Hspg2 Basement m… Mmus
7 P19096 Cytosol FAS_… Fasn Fatty acid… Mmus
8 Q9JKF1 Plasma membrane IQGA… Iqgap1 Ras GTPase… Mmus
9 Q9QZQ1-2 Plasma membrane AFAD… Mllt4 Isoform 1 … Mmus
10 Q6NS46 Nucleus - Non-chromatin RRP5… Pdcd11 Protein RR… Mmus
# … with 15 more rows
As can be seen above, the variable name of the first table is retained in the joined one.
Challenge
Using the
full_join
function, join tablesjdf4
andjdf5
. What has happened for observationsP26039
andP02468
?Solution
full_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 14 × 6 uniprot organelle entry gene_name description organism <chr> <chr> <chr> <chr> <chr> <chr> 1 P26039 Actin cytoskeleton TLN1… <NA> <NA> <NA> 2 Q99PL5 Endoplasmic reticulum/Golgi ap… RRBP… <NA> <NA> <NA> 3 Q6PB66 Mitochondrion LPPR… <NA> <NA> <NA> 4 P11276 Extracellular matrix FINC… <NA> <NA> <NA> 5 Q6PR54 Nucleus - Chromatin RIF1… <NA> <NA> <NA> 6 Q05793 Extracellular matrix PGBM… <NA> <NA> <NA> 7 P19096 Cytosol FAS_… Fasn Fatty acid… Mmus 8 Q9JKF1 Plasma membrane IQGA… <NA> <NA> <NA> 9 Q9QZQ1-2 Plasma membrane AFAD… <NA> <NA> <NA> 10 Q6NS46 Nucleus - Non-chromatin RRP5… <NA> <NA> <NA> 11 P02468 <NA> <NA> Lamc1 Laminin su… Mmus 12 P08113 <NA> <NA> Hsp90b1 Endoplasmin Mmus 13 Q8BI84 <NA> <NA> Mia3 Melanoma i… Mmus 14 Q6P5D8 <NA> <NA> Smchd1 Structural… Mmus
P26039
andP02468
are only present injdf4
andjdf5
respectively, and their respective values for the variables of the table have been encoded as missing.
Different types of joins
Above, we have used the full_join
function, that fully joins two
tables and keeps all observations, adding missing values if
necessary. Sometimes, we want to be selective, and keep observations
that are present in only one or both tables.
- An inner join keeps observations that are present in both tables.
- A left join keeps observations that are present in the left (first) table, dropping those that are only present in the other.
- A right join keeps observations that are present in the right (second) table, dropping those that are only present in the other.
- A full join keeps all observations.
Challenge
Join tables
jdf4
andjdf5
, keeping only observations injdf4
.Solution
left_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 10 × 6 uniprot organelle entry gene_name description organism <chr> <chr> <chr> <chr> <chr> <chr> 1 P26039 Actin cytoskeleton TLN1… <NA> <NA> <NA> 2 Q99PL5 Endoplasmic reticulum/Golgi ap… RRBP… <NA> <NA> <NA> 3 Q6PB66 Mitochondrion LPPR… <NA> <NA> <NA> 4 P11276 Extracellular matrix FINC… <NA> <NA> <NA> 5 Q6PR54 Nucleus - Chromatin RIF1… <NA> <NA> <NA> 6 Q05793 Extracellular matrix PGBM… <NA> <NA> <NA> 7 P19096 Cytosol FAS_… Fasn Fatty acid… Mmus 8 Q9JKF1 Plasma membrane IQGA… <NA> <NA> <NA> 9 Q9QZQ1-2 Plasma membrane AFAD… <NA> <NA> <NA> 10 Q6NS46 Nucleus - Non-chromatin RRP5… <NA> <NA> <NA>
Challenge
Join tables
jdf4
andjdf5
, keeping only observations injdf5
.Solution
right_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 5 × 6 uniprot organelle entry gene_name description organism <chr> <chr> <chr> <chr> <chr> <chr> 1 P19096 Cytosol FAS_MOUSE Fasn Fatty acid synthase Mmus 2 P02468 <NA> <NA> Lamc1 Laminin subunit gamma-1 Mmus 3 P08113 <NA> <NA> Hsp90b1 Endoplasmin Mmus 4 Q8BI84 <NA> <NA> Mia3 Melanoma inhibitory activity p… Mmus 5 Q6P5D8 <NA> <NA> Smchd1 Structural maintenance of chro… Mmus
Challenge
Join tables
jdf4
andjdf5
, keeping observations observed in both tables.Solution
inner_join(jdf4, jdf5)
Joining, by = "uniprot"
# A tibble: 1 × 6 uniprot organelle entry gene_name description organism <chr> <chr> <chr> <chr> <chr> <chr> 1 P19096 Cytosol FAS_MOUSE Fasn Fatty acid synthase Mmus
Multiple matches
Sometimes, keys aren’t unique.
In the jdf6
table below, we see that the accession number Q99PL5
is repeated
twice. According to this table, the ribosomial protein binding protein 1 localises in the
endoplasmic reticulum and in the Golgi apparatus.
jdf6
# A tibble: 5 × 4
uniprot organelle entry isoform
<chr> <chr> <chr> <dbl>
1 P26039 Actin cytoskeleton TLN1_MOUSE 1
2 Q99PL5 Endoplasmic reticulum RRBP1_MOUSE 1
3 Q99PL5 Golgi apparatus RRBP1_MOUSE 2
4 Q6PB66 Mitochondrion LPPRC_MOUSE 1
5 P11276 Extracellular matrix FINC_MOUSE 1
If we now want to join jdf6
and jdf2
, the variables of the latter
will be duplicated.
inner_join(jdf6, jdf2)
Joining, by = "uniprot"
# A tibble: 5 × 7
uniprot organelle entry isoform gene_name description organism
<chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 P26039 Actin cytoskeleton TLN1_MOU… 1 Tln1 Talin-1 Mmus
2 Q99PL5 Endoplasmic reticulum RRBP1_MO… 1 Rrbp1 Ribosome-b… Mmus
3 Q99PL5 Golgi apparatus RRBP1_MO… 2 Rrbp1 Ribosome-b… Mmus
4 Q6PB66 Mitochondrion LPPRC_MO… 1 Lrpprc Leucine-ri… Mmus
5 P11276 Extracellular matrix FINC_MOU… 1 Fn1 Fibronectin Mmus
In the case above, repeating is useful, as it completes jdf6
with
correct information from jdf2
.
But one needs however to be careful when duplicated keys exist in both tables.
Let’s now use jdf7 for the join. It also has 2 entries for the uniprot Q99PL5
jdf7
# A tibble: 5 × 6
gene_name description uniprot organism isoform_num measure
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Rrbp1 Ribosome-binding protein 1 Q99PL5 Mmus 1 102
2 Rrbp1 Ribosome-binding protein 1 Q99PL5 Mmus 2 3
3 Iqgap1 Ras GTPase-activating-like pro… Q9JKF1 Mmus 1 13
4 Hspa5 78 kDa glucose-regulated prote… P20029 Mmus 1 54
5 Pdcd11 Protein RRP5 homolog Q6NS46 Mmus 1 28
Let’s we create an inner join between jdf6
and jdf7
(both having duplicated Q99PL5
entries)
inner_join(jdf6, jdf7)
Joining, by = "uniprot"
# A tibble: 4 × 9
uniprot organelle entry isoform gene_name description organism isoform_num
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Q99PL5 Endoplasmic … RRBP… 1 Rrbp1 Ribosome-b… Mmus 1
2 Q99PL5 Endoplasmic … RRBP… 1 Rrbp1 Ribosome-b… Mmus 2
3 Q99PL5 Golgi appara… RRBP… 2 Rrbp1 Ribosome-b… Mmus 1
4 Q99PL5 Golgi appara… RRBP… 2 Rrbp1 Ribosome-b… Mmus 2
# … with 1 more variable: measure <dbl>
Challenge
Interpret the result of the inner join above, where both tables have duplicated keys.
Solution
jdf6
has two entries, one for each possible sub-cellular localisation of the protein.jdf7
has also two entries, referring to two different quantitative measurements (variablemeasure
). When joining the duplicated keys, you get all possible combinations.In this case, we obtain wrong information: both proteins in the ER and in the GA both have value 102 and 3.
inner_join(jdf6, jdf7)
Joining, by = "uniprot"
# A tibble: 4 × 9 uniprot organelle entry isoform gene_name description organism isoform_num <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> 1 Q99PL5 Endoplasmic … RRBP… 1 Rrbp1 Ribosome-b… Mmus 1 2 Q99PL5 Endoplasmic … RRBP… 1 Rrbp1 Ribosome-b… Mmus 2 3 Q99PL5 Golgi appara… RRBP… 2 Rrbp1 Ribosome-b… Mmus 1 4 Q99PL5 Golgi appara… RRBP… 2 Rrbp1 Ribosome-b… Mmus 2 # … with 1 more variable: measure <dbl>
Matching across multiple keys
So far, we have matched tables using a single key (possibly with different names in the two tables). Sometimes, it is necessary to match tables using multiple keys. A typical example is when multiple variables are needed to discriminate different rows in a tables.
Following up from the last example, we see that the duplicated UniProt
accession numbers in the jdf6
and jdf7
tables refer to different
isoforms of the same RRBP1 gene.
jdf6
# A tibble: 5 × 4
uniprot organelle entry isoform
<chr> <chr> <chr> <dbl>
1 P26039 Actin cytoskeleton TLN1_MOUSE 1
2 Q99PL5 Endoplasmic reticulum RRBP1_MOUSE 1
3 Q99PL5 Golgi apparatus RRBP1_MOUSE 2
4 Q6PB66 Mitochondrion LPPRC_MOUSE 1
5 P11276 Extracellular matrix FINC_MOUSE 1
jdf7
# A tibble: 5 × 6
gene_name description uniprot organism isoform_num measure
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Rrbp1 Ribosome-binding protein 1 Q99PL5 Mmus 1 102
2 Rrbp1 Ribosome-binding protein 1 Q99PL5 Mmus 2 3
3 Iqgap1 Ras GTPase-activating-like pro… Q9JKF1 Mmus 1 13
4 Hspa5 78 kDa glucose-regulated prote… P20029 Mmus 1 54
5 Pdcd11 Protein RRP5 homolog Q6NS46 Mmus 1 28
To uniquely identify isoforms, we should consider two keys:
-
the UniProt accession number (named
uniprot
in both tables) -
and the isoform number (called
isoform
andisoform_num
respectively)
Because the isoform status was encoded using different names
(which is, of course a source of confusion), jdf6
and jdf7
are
only automatically joined based on the shared uniprot
key.
If the isoform status was encoded the same way in both tables, the join would have been automatically done on both keys!
Here, we need to join using both keys and need to explicitly name the variables used for the join.
inner_join(jdf6, jdf7, by = c("uniprot" = "uniprot", "isoform" = "isoform_num"))
# A tibble: 2 × 8
uniprot organelle entry isoform gene_name description organism measure
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Q99PL5 Endoplasmic reti… RRBP… 1 Rrbp1 Ribosome-b… Mmus 102
2 Q99PL5 Golgi apparatus RRBP… 2 Rrbp1 Ribosome-b… Mmus 3
We now see that isoform 1 localised to the ER and has a measured value of 102, while isoform 2, that localised to the GA, has a measured value of 3.
Ideally, the isoform variables should be named identically in the two tables to enable an automatic join with the two keys.
An alternative could be to rename the isoform_num
from jdf7 in order to
have the both keys names present in both tables, enabling an automatic join.
This can be done easily using the rename function from dplyr
package.
jdf7 %>% rename(isoform = isoform_num)
# A tibble: 5 × 6
gene_name description uniprot organism isoform measure
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Rrbp1 Ribosome-binding protein 1 Q99PL5 Mmus 1 102
2 Rrbp1 Ribosome-binding protein 1 Q99PL5 Mmus 2 3
3 Iqgap1 Ras GTPase-activating-like protein… Q9JKF1 Mmus 1 13
4 Hspa5 78 kDa glucose-regulated protein P20029 Mmus 1 54
5 Pdcd11 Protein RRP5 homolog Q6NS46 Mmus 1 28
inner_join(jdf6,
jdf7 %>%
rename(isoform = isoform_num))
Joining, by = c("uniprot", "isoform")
# A tibble: 2 × 8
uniprot organelle entry isoform gene_name description organism measure
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Q99PL5 Endoplasmic reti… RRBP… 1 Rrbp1 Ribosome-b… Mmus 102
2 Q99PL5 Golgi apparatus RRBP… 2 Rrbp1 Ribosome-b… Mmus 3
Row and column binding
There are two other important functions in R, rbind
and cbind
,
that can be used to combine two dataframes.
cbind
can be used to bind two data frames by columns, but both must have the same number of rows.
d2
a b
1 4 4
2 5 5
d3
v1 v2 v3
1 1 3 5
2 2 4 6
cbind(d2, d3)
a b v1 v2 v3
1 4 4 1 3 5
2 5 5 2 4 6
rbind
, can be used to bind two data frames by rows, but both must have the same number of columns, and the same column names!
d1
x y
1 1 1
2 2 2
3 3 3
d2
a b
1 4 4
2 5 5
using rbind(d1, d2)
would produce an error because both data frames do not have the
same column names (even if they have the same number of columns)
If we change the names of d2, it works!
names(d2) <- names(d1)
d1
x y
1 1 1
2 2 2
3 3 3
d2
x y
1 4 4
2 5 5
rbind(d1, d2)
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
Note that beyond the dimensions and column names that are required to
match, the real meaning of rbind
is to bind data frames that contain
observations for the same set of variables - there is more than only
the column names!
Note: rbind
and cbind
are base R functions. The tidyverse
alternatives from the dplyr
package are bind_rows
and bind_cols
and work similarly.
-
UniProt is the protein information database. Its mission is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. ↩
Key Points
Join tables of data in R
Next steps
Overview
Teaching: XX min
Exercises: XX minQuestions
SummarizedExperiment
Objectives
Introduce the notion of data containers
Give an overview of the
SummarizedExperiment
, extensively used in omics analyses
Next steps
Data in bioinformatics is often complex. For example, an experiment can have multiple types of files e.g. counts, sample information and gene information. To deal with this, developers define specialised data containers (termed classes) that match the properties of the data they need to handle.
This aspect is central to the Bioconductor1 project which uses the same core data infrastructure across packages. This certainly contributed to Bioconductor’s success. Bioconductor package developers are advised to make use of existing infrastructure to provide coherence, interoperability and stability to the project as a whole. You will learn more about Bioconductor in the next episode.
SummarizedExperiment
To illustrate a Bioconductor omics data container, we’ll present the SummarizedExperiment
class. This is a core structure for omics data such as genomics, transcriptomics, proteomics, methylation and cytometry. The image below from this website shows how a large number of Bioconductor packages make use of SummarizedExperiment.
What is it
The figure below represents the anatomy of SummarizedExperiment.
assay()
,assays()
: A matrix-like or list of matrix-like objects of identical dimension- rows: genes, genomic coordinates, etc.
- columns: samples, cells, etc.
colData()
: Annotations on each column, as a DataFrame. E.g., description of each samplerowData()
and / orrowRanges()
: Annotations on each row.- E.g.,
rowRanges()
: coordinates of gene / exons in transcripts / etc. - E.g.,
rowData()
:P-values and log-fold change of each gene after differential expresssion analysis.
- E.g.,
metadata()
: List of unstructured metadata describing the overall content of the object.
Benefits of SummarizedExperiment format
-
Coordination of samples and features. As soon as the data for a project are distributed in multiple tables or files, the alignment of data records or the consistency of identifiers is precarious. The coordinated nature of the SummarizedExperiment container overcomes this by guaranteeing that during data manipulation, the dimensions of the different slots will always match (i.e the columns in the expression data and the rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. For example, if we had to exclude one sample from the assay, it would be automatically removed from the sample metadata in the same operation. The metadata slots can grow additional co-variates (columns) without affecting the other structures.
-
Interoperability between packages. There are a lot of R packages that make use of SummarizedExperiment format. For the user this makes analysis easier as less data wrangling is required when outputs of one package can work as input for another. And with many packages using the SummarizedExperiment format there are less different formats to learn. If you follow a training focused on RNA sequencing analysis, you may learn to use the Bioconductor
DESeq2
package to do some differential expression analyses.DESeq2
’s whole analysis is handled in aSummarizedExperiment
. Or if you perform RNA sequencing analysis following tidy principles with the tidybulk package you can input your data in SummarizedExperiment format. You can see packages making use of SummarizedExperiment (Depend/Import) on the SummarizedExperiment homepage, some shown in the screenshot below.
When would you use it
You may encounter a SummarizedExperiment object
- using a Bioconductor package function that imports your omics experiment files and produces a SummarizedExperiment object. The tximeta for transcript-level quantification is one example of this.
- using processed data provided in SummarizedExperiment format. The Recount3 project provides RNA-sequencing gene, exon, and exon-exon junction counts for 8,679 and 10,088 different studies for human and mouse respectively in SummarizedExperiment format.
- through creating one yourself to store your data or share with a collaborator
Exploring a SummarizedExperiment
Let’s explore a SummarizedExperiment object in R. We’ll load in an example. This is the same RNA data we’ve been working with but also contains genomic coordinates/ranges. Ranges are an optional part of a SummarizedExperiment. When they’re attached, the object is called a RangedSummarizedExperiment
.
se <- readRDS("course-data/data/GSE96870/se2.rds")
se
class: RangedSummarizedExperiment
dim: 1474 22
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse
We can access the expression matrix with
the assay
function:
head(assay(se))
GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Asl 1170 361 400 586 626 988
Apod 36194 10347 9173 10620 13021 29594
Cyp2d22 4060 1616 1603 1901 2171 3349
Klk6 287 629 641 578 448 195
Fcrls 85 233 244 237 180 38
Slc2a4 782 231 248 265 313 786
GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Asl 836 535 586 597 938 1035
Apod 24959 13668 13230 15868 27769 34301
Cyp2d22 3122 2008 2254 2277 2985 3452
Klk6 186 1101 537 567 327 233
Fcrls 68 375 199 177 89 67
Slc2a4 528 249 266 357 654 693
GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Asl 494 481 666 937 803 541
Apod 11258 11812 15816 29242 20415 13682
Cyp2d22 1883 2014 2417 3678 2920 2216
Klk6 742 881 828 250 798 710
Fcrls 300 233 231 81 303 285
Slc2a4 271 304 349 715 513 320
GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl 473 748 576 1192
Apod 11088 15916 11166 38148
Cyp2d22 1821 2842 2011 4019
Klk6 894 501 598 259
Fcrls 248 179 184 68
Slc2a4 248 350 317 796
dim(assay(se))
[1] 1474 22
We can access the sample metadata using the colData
function:
colData(se)
DataFrame with 22 rows and 10 columns
title geo_accession organism age sex
<character> <character> <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks Female
GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
GSM2545339 CNS_RNA-seq_13C GSM2545339 Mus musculus 8 weeks Female
GSM2545340 CNS_RNA-seq_14C GSM2545340 Mus musculus 8 weeks Male
... ... ... ... ... ...
GSM2545353 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks Female
GSM2545354 CNS_RNA-seq_4C GSM2545354 Mus musculus 8 weeks Male
GSM2545362 CNS_RNA-seq_5C GSM2545362 Mus musculus 8 weeks Female
GSM2545363 CNS_RNA-seq_6C GSM2545363 Mus musculus 8 weeks Male
GSM2545380 CNS_RNA-seq_9C GSM2545380 Mus musculus 8 weeks Female
infection strain time tissue mouse
<factor> <character> <factor> <factor> <factor>
GSM2545336 InfluenzaA C57BL/6 Day8 Cerebellum 14
GSM2545337 NonInfected C57BL/6 Day0 Cerebellum 9
GSM2545338 NonInfected C57BL/6 Day0 Cerebellum 10
GSM2545339 InfluenzaA C57BL/6 Day4 Cerebellum 15
GSM2545340 InfluenzaA C57BL/6 Day4 Cerebellum 18
... ... ... ... ... ...
GSM2545353 NonInfected C57BL/6 Day0 Cerebellum 4
GSM2545354 NonInfected C57BL/6 Day0 Cerebellum 2
GSM2545362 InfluenzaA C57BL/6 Day4 Cerebellum 20
GSM2545363 InfluenzaA C57BL/6 Day4 Cerebellum 12
GSM2545380 InfluenzaA C57BL/6 Day8 Cerebellum 19
dim(colData(se))
[1] 22 10
We can also access the feature metadata using the rowData
function:
head(rowData(se))
DataFrame with 6 rows and 11 columns
gene ENTREZID product gbkey
<character> <character> <character> <character>
Asl Asl 109900 argininosuccinate ly.. mRNA
Apod Apod 11815 apolipoprotein D, tr.. mRNA
Cyp2d22 Cyp2d22 56448 cytochrome P450, fam.. mRNA
Klk6 Klk6 19144 kallikrein related-p.. mRNA
Fcrls Fcrls 80891 Fc receptor-like S, .. mRNA
Slc2a4 Slc2a4 20528 solute carrier famil.. mRNA
external_gene_name ensembl_gene_id external_synonym chromosome_name
<character> <character> <character> <character>
Asl Asl ENSMUSG00000025533 2510006M18Rik 5
Apod Apod ENSMUSG00000022548 NA 16
Cyp2d22 Cyp2d22 ENSMUSG00000061740 2D22 15
Klk6 Klk6 ENSMUSG00000050063 Bssp 7
Fcrls Fcrls ENSMUSG00000015852 2810439C17Rik 3
Slc2a4 Slc2a4 ENSMUSG00000018566 Glut-4 11
gene_biotype phenotype_description
<character> <character>
Asl protein_coding abnormal circulating..
Apod protein_coding abnormal lipid homeo..
Cyp2d22 protein_coding abnormal skin morpho..
Klk6 protein_coding abnormal cytokine le..
Fcrls protein_coding decreased CD8-positi..
Slc2a4 protein_coding abnormal circulating..
hsapiens_homolog_associated_gene_name
<character>
Asl ASL
Apod APOD
Cyp2d22 CYP2D6
Klk6 KLK6
Fcrls FCRL2
Slc2a4 SLC2A4
dim(rowData(se))
[1] 1474 11
If genomic coordinates are present, we can access using the rowRanges
function:
rowRanges(se)
GRanges object with 1474 ranges and 11 metadata columns:
seqnames ranges strand | gene ENTREZID
<Rle> <IRanges> <Rle> | <character> <character>
Asl 5 130024208-130024330 - | Asl 109900
Apod 16 31314552-31314808 - | Apod 11815
Cyp2d22 15 82380078-82380260 - | Cyp2d22 56448
Klk6 7 43824544-43824595 + | Klk6 19144
Fcrls 3 87263678-87263765 - | Fcrls 80891
... ... ... ... . ... ...
Mgst3 1 167393751-167393797 - | Mgst3 66447
Lrrc52 1 167466093-167466780 - | Lrrc52 240899
Rxrg 1 167598362-167598800 + | Rxrg 20183
Lmx1a 1 167688300-167688991 + | Lmx1a 110648
Pbx1 1 168431314-168432169 - | Pbx1 18514
product gbkey external_gene_name
<character> <character> <character>
Asl argininosuccinate ly.. mRNA Asl
Apod apolipoprotein D, tr.. mRNA Apod
Cyp2d22 cytochrome P450, fam.. mRNA Cyp2d22
Klk6 kallikrein related-p.. mRNA Klk6
Fcrls Fc receptor-like S, .. mRNA Fcrls
... ... ... ...
Mgst3 microsomal glutathio.. mRNA Mgst3
Lrrc52 leucine rich repeat .. mRNA Lrrc52
Rxrg retinoid X receptor .. mRNA Rxrg
Lmx1a LIM homeobox transcr.. mRNA Lmx1a
Pbx1 pre B cell leukemia .. mRNA Pbx1
ensembl_gene_id external_synonym chromosome_name gene_biotype
<character> <character> <character> <character>
Asl ENSMUSG00000025533 2510006M18Rik 5 protein_coding
Apod ENSMUSG00000022548 <NA> 16 protein_coding
Cyp2d22 ENSMUSG00000061740 2D22 15 protein_coding
Klk6 ENSMUSG00000050063 Bssp 7 protein_coding
Fcrls ENSMUSG00000015852 2810439C17Rik 3 protein_coding
... ... ... ... ...
Mgst3 ENSMUSG00000026688 2010012L10Rik 1 protein_coding
Lrrc52 ENSMUSG00000040485 4930413P14Rik 1 protein_coding
Rxrg ENSMUSG00000015843 Nr2b3 1 protein_coding
Lmx1a ENSMUSG00000026686 Lmx1.1 1 protein_coding
Pbx1 ENSMUSG00000052534 2310056B04Rik 1 protein_coding
phenotype_description hsapiens_homolog_associated_gene_name
<character> <character>
Asl abnormal circulating.. ASL
Apod abnormal lipid homeo.. APOD
Cyp2d22 abnormal skin morpho.. CYP2D6
Klk6 abnormal cytokine le.. KLK6
Fcrls decreased CD8-positi.. FCRL2
... ... ...
Mgst3 decreased mean corpu.. MGST3
Lrrc52 abnormal sperm physi.. LRRC52
Rxrg abnormal bone minera.. RXRG
Lmx1a abnormal bony labyri.. LMX1A
Pbx1 abnormal adrenal gla.. PBX1
-------
seqinfo: 182 sequences from an unspecified genome; no seqlengths
rowRanges
gives us the same information as rowData
but also includes the coordinates.
Creating a SummarizedExperiment
Often a package will create a SummarizedExperiment for you but we will demonstrate how a SummarizedExperiment can be created. The 3 different tables that make up a SummarizedExperiment object are:
- An expression matrix
- A table describing the samples
- A table describing the genes
You would import these tables into R or generate them in R. Here we will create them from the se
object that we have.
count_matrix <- assay(se)
sample_metadata <- colData(se)
gene_metadata <- rowRanges(se) # or rowData(se)
We will create a SummarizedExperiment
from these tables using the SummarizedExperiment
constructor. We need to provide inputs for the arguments assays
(count matrix), colData
(sample metadata) and rowData
(gene metadata). Because assays
can accept multiple assays (e.g. raw counts, log normalized) we put them in a list (SimpleList) and give each assay a name (we’ll use counts). We can see we need to do this in the help for the SummarizedExperiment function ?SummarizedExperiment
.
# BiocManager::install("SummarizedExperiment")
library("SummarizedExperiment")
se_created <- SummarizedExperiment(assays = SimpleList(counts=count_matrix),
colData = sample_metadata,
rowRanges = gene_metadata) # or rowData =
se_created
class: RangedSummarizedExperiment
dim: 1474 22
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse
If we want to save a SummarizedExperiment object we can use the saveRDS
function that we saw in a previous episode.
saveRDS(se_created, file = "data_output/se_created.rds")
Subsetting a SummarizedExperiment
SummarizedExperiment can be subset just like with data frames, with numerics, characters or logicals. There are two dimensions, the first dimension is features (genes) and the second dimension is samples.
To subset to the first 5 features we could do:
se[1:5, ]
class: RangedSummarizedExperiment
dim: 5 22
metadata(0):
assays(1): counts
rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse
To subset to the first 3 samples we could do:
se[, 1:3]
class: RangedSummarizedExperiment
dim: 1474 3
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(3): GSM2545336 GSM2545337 GSM2545338
colData names(10): title geo_accession ... tissue mouse
Below, we create a new instance of class SummarizedExperiment that contains only the first 5 features for the first 3 samples.
se1 <- se[1:5, 1:3]
se1
class: RangedSummarizedExperiment
dim: 5 3
metadata(0):
assays(1): counts
rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(3): GSM2545336 GSM2545337 GSM2545338
colData names(10): title geo_accession ... tissue mouse
assay(se1)
GSM2545336 GSM2545337 GSM2545338
Asl 1170 361 400
Apod 36194 10347 9173
Cyp2d22 4060 1616 1603
Klk6 287 629 641
Fcrls 85 233 244
colData(se1)
DataFrame with 3 rows and 10 columns
title geo_accession organism age sex
<character> <character> <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks Female
GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
infection strain time tissue mouse
<factor> <character> <factor> <factor> <factor>
GSM2545336 InfluenzaA C57BL/6 Day8 Cerebellum 14
GSM2545337 NonInfected C57BL/6 Day0 Cerebellum 9
GSM2545338 NonInfected C57BL/6 Day0 Cerebellum 10
rowData(se1)
DataFrame with 5 rows and 11 columns
gene ENTREZID product gbkey
<character> <character> <character> <character>
Asl Asl 109900 argininosuccinate ly.. mRNA
Apod Apod 11815 apolipoprotein D, tr.. mRNA
Cyp2d22 Cyp2d22 56448 cytochrome P450, fam.. mRNA
Klk6 Klk6 19144 kallikrein related-p.. mRNA
Fcrls Fcrls 80891 Fc receptor-like S, .. mRNA
external_gene_name ensembl_gene_id external_synonym chromosome_name
<character> <character> <character> <character>
Asl Asl ENSMUSG00000025533 2510006M18Rik 5
Apod Apod ENSMUSG00000022548 NA 16
Cyp2d22 Cyp2d22 ENSMUSG00000061740 2D22 15
Klk6 Klk6 ENSMUSG00000050063 Bssp 7
Fcrls Fcrls ENSMUSG00000015852 2810439C17Rik 3
gene_biotype phenotype_description
<character> <character>
Asl protein_coding abnormal circulating..
Apod protein_coding abnormal lipid homeo..
Cyp2d22 protein_coding abnormal skin morpho..
Klk6 protein_coding abnormal cytokine le..
Fcrls protein_coding decreased CD8-positi..
hsapiens_homolog_associated_gene_name
<character>
Asl ASL
Apod APOD
Cyp2d22 CYP2D6
Klk6 KLK6
Fcrls FCRL2
We can also use the colData()
function to subset on something from the sample metadata, or the rowData()
to subset on something from the feature metadata.
For example, here we keep only miRNAs and the non infected samples:
se1 <- se[rowData(se)$gene_biotype == "miRNA",
colData(se)$infection == "NonInfected"]
se1
class: RangedSummarizedExperiment
dim: 7 7
metadata(0):
assays(1): counts
rownames(7): Mir1901 Mir378a ... Mir128-1 Mir7682
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(7): GSM2545337 GSM2545338 ... GSM2545353 GSM2545354
colData names(10): title geo_accession ... tissue mouse
assay(se1)
GSM2545337 GSM2545338 GSM2545343 GSM2545348 GSM2545349 GSM2545353
Mir1901 45 44 74 55 68 33
Mir378a 11 7 9 4 12 4
Mir133b 4 6 5 4 6 7
Mir30c-2 10 6 16 12 8 17
Mir149 1 2 0 0 0 0
Mir128-1 4 1 2 2 1 2
Mir7682 2 0 4 1 3 5
GSM2545354
Mir1901 60
Mir378a 8
Mir133b 3
Mir30c-2 15
Mir149 2
Mir128-1 1
Mir7682 5
colData(se1)
DataFrame with 7 rows and 10 columns
title geo_accession organism age sex
<character> <character> <character> <character> <factor>
GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
GSM2545343 CNS_RNA-seq_20C GSM2545343 Mus musculus 8 weeks Male
GSM2545348 CNS_RNA-seq_27C GSM2545348 Mus musculus 8 weeks Female
GSM2545349 CNS_RNA-seq_28C GSM2545349 Mus musculus 8 weeks Male
GSM2545353 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks Female
GSM2545354 CNS_RNA-seq_4C GSM2545354 Mus musculus 8 weeks Male
infection strain time tissue mouse
<factor> <character> <factor> <factor> <factor>
GSM2545337 NonInfected C57BL/6 Day0 Cerebellum 9
GSM2545338 NonInfected C57BL/6 Day0 Cerebellum 10
GSM2545343 NonInfected C57BL/6 Day0 Cerebellum 11
GSM2545348 NonInfected C57BL/6 Day0 Cerebellum 8
GSM2545349 NonInfected C57BL/6 Day0 Cerebellum 7
GSM2545353 NonInfected C57BL/6 Day0 Cerebellum 4
GSM2545354 NonInfected C57BL/6 Day0 Cerebellum 2
rowData(se1)
DataFrame with 7 rows and 11 columns
gene ENTREZID product gbkey
<character> <character> <character> <character>
Mir1901 Mir1901 100316686 microRNA 1901 precursor_RNA
Mir378a Mir378a 723889 microRNA 378a precursor_RNA
Mir133b Mir133b 723817 microRNA 133b precursor_RNA
Mir30c-2 Mir30c-2 723964 microRNA 30c-2 precursor_RNA
Mir149 Mir149 387167 microRNA 149 precursor_RNA
Mir128-1 Mir128-1 387147 microRNA 128-1 precursor_RNA
Mir7682 Mir7682 102466847 microRNA 7682 precursor_RNA
external_gene_name ensembl_gene_id external_synonym chromosome_name
<character> <character> <character> <character>
Mir1901 Mir1901 ENSMUSG00000084565 Mirn1901 18
Mir378a Mir378a ENSMUSG00000105200 Mirn378 18
Mir133b Mir133b ENSMUSG00000065480 mir 133b 1
Mir30c-2 Mir30c-2 ENSMUSG00000065567 mir 30c-2 1
Mir149 Mir149 ENSMUSG00000065470 Mirn149 1
Mir128-1 Mir128-1 ENSMUSG00000065520 Mirn128 1
Mir7682 Mir7682 ENSMUSG00000106406 mmu-mir-7682 1
gene_biotype phenotype_description
<character> <character>
Mir1901 miRNA NA
Mir378a miRNA abnormal mitochondri..
Mir133b miRNA no abnormal phenotyp..
Mir30c-2 miRNA NA
Mir149 miRNA increased circulatin..
Mir128-1 miRNA no abnormal phenotyp..
Mir7682 miRNA NA
hsapiens_homolog_associated_gene_name
<character>
Mir1901 NA
Mir378a MIR378A
Mir133b MIR133B
Mir30c-2 MIR30C2
Mir149 NA
Mir128-1 MIR128-1
Mir7682 NA
We can subset to a genomic region of interest using the subsetByOverlaps
function.
# Define a region of interest e.g. interval 1 to 10,000,000 of chromosome 1
roi <- GRanges(seqnames="1", ranges=1:10000000)
# Subset se object for only rows in the region of interest
subsetByOverlaps(se, roi)
class: RangedSummarizedExperiment
dim: 20 22
metadata(0):
assays(1): counts
rownames(20): Sox17 Rp1 ... Snord87 Tcf24
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(10): title geo_accession ... tissue mouse
Challenge
- Extract the gene expression levels of the first 3 genes in samples at time 0 and at time 8.
- Extract all genes on chromosome Y between 1 and 10,000,000 on the negative strand.
Solution
assay(se)[1:3, colData(se)$time != "Day4"]
GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343 Asl 1170 361 400 988 836 535 Apod 36194 10347 9173 29594 24959 13668 Cyp2d22 4060 1616 1603 3349 3122 2008 GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545351 GSM2545353 Asl 938 1035 494 481 937 541 Apod 27769 34301 11258 11812 29242 13682 Cyp2d22 2985 3452 1883 2014 3678 2216 GSM2545354 GSM2545380 Asl 473 1192 Apod 11088 38148 Cyp2d22 1821 4019
# Equivalent to assay(se)[1:3, colData(se)$time == "Day0" | colData(se)$time == "Day8"]
GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343 Asl 1170 361 400 988 836 535 Apod 36194 10347 9173 29594 24959 13668 Cyp2d22 4060 1616 1603 3349 3122 2008 GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545351 GSM2545353 Asl 938 1035 494 481 937 541 Apod 27769 34301 11258 11812 29242 13682 Cyp2d22 2985 3452 1883 2014 3678 2216 GSM2545354 GSM2545380 Asl 473 1192 Apod 11088 38148 Cyp2d22 1821 4019
roi <- GRanges(seqnames="Y", ranges = 1:10000000, strand = "-") subsetByOverlaps(se, roi)
class: RangedSummarizedExperiment dim: 3 22 metadata(0): assays(1): counts rownames(3): Ddx3y Uty Gm4017 rowData names(11): gene ENTREZID ... phenotype_description hsapiens_homolog_associated_gene_name colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380 colData names(10): title geo_accession ... tissue mouse
Adding variables to metadata
We can also add information to the sample metadata. Suppose that you want to add the center where the samples were collected…
colData(se)$center <- rep("University of Illinois", nrow(colData(se)))
colData(se)
DataFrame with 22 rows and 11 columns
title geo_accession organism age sex
<character> <character> <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks Female
GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
GSM2545339 CNS_RNA-seq_13C GSM2545339 Mus musculus 8 weeks Female
GSM2545340 CNS_RNA-seq_14C GSM2545340 Mus musculus 8 weeks Male
... ... ... ... ... ...
GSM2545353 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks Female
GSM2545354 CNS_RNA-seq_4C GSM2545354 Mus musculus 8 weeks Male
GSM2545362 CNS_RNA-seq_5C GSM2545362 Mus musculus 8 weeks Female
GSM2545363 CNS_RNA-seq_6C GSM2545363 Mus musculus 8 weeks Male
GSM2545380 CNS_RNA-seq_9C GSM2545380 Mus musculus 8 weeks Female
infection strain time tissue mouse
<factor> <character> <factor> <factor> <factor>
GSM2545336 InfluenzaA C57BL/6 Day8 Cerebellum 14
GSM2545337 NonInfected C57BL/6 Day0 Cerebellum 9
GSM2545338 NonInfected C57BL/6 Day0 Cerebellum 10
GSM2545339 InfluenzaA C57BL/6 Day4 Cerebellum 15
GSM2545340 InfluenzaA C57BL/6 Day4 Cerebellum 18
... ... ... ... ... ...
GSM2545353 NonInfected C57BL/6 Day0 Cerebellum 4
GSM2545354 NonInfected C57BL/6 Day0 Cerebellum 2
GSM2545362 InfluenzaA C57BL/6 Day4 Cerebellum 20
GSM2545363 InfluenzaA C57BL/6 Day4 Cerebellum 12
GSM2545380 InfluenzaA C57BL/6 Day8 Cerebellum 19
center
<character>
GSM2545336 University of Illinois
GSM2545337 University of Illinois
GSM2545338 University of Illinois
GSM2545339 University of Illinois
GSM2545340 University of Illinois
... ...
GSM2545353 University of Illinois
GSM2545354 University of Illinois
GSM2545362 University of Illinois
GSM2545363 University of Illinois
GSM2545380 University of Illinois
This illustrates that the metadata slots can grow indefinitely without affecting the other structures!
tidySummarizedExperiment
You may be wondering, can we use tidyverse commands to interact with SummarizedExperiment objects. The answer is yes, we can with the tidySummarizedExperiment package.
Remember what our SummarizedExperiment object looks like.
se
class: RangedSummarizedExperiment
dim: 1474 22
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(11): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(11): title geo_accession ... mouse center
Load tidyverse and tidySummarizedExperiment and then take a look at the se object again.
# BiocManager::install("tidySummarizedExperiment")
library("tidyverse")
library("tidySummarizedExperiment")
se
# A SummarizedExperiment-tibble abstraction: 32,428 × 22
# [90mFeatures=1474 | Samples=22 | Assays=counts[0m
.feature .sample counts title geo_accession organism age sex infection
<chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <fct>
1 Asl GSM2545336 1170 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
2 Apod GSM2545336 36194 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
3 Cyp2d22 GSM2545336 4060 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
4 Klk6 GSM2545336 287 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
5 Fcrls GSM2545336 85 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
6 Slc2a4 GSM2545336 782 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
7 Exd2 GSM2545336 1619 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
8 Gjc2 GSM2545336 288 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
9 Plp1 GSM2545336 43217 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
10 Gnb4 GSM2545336 1071 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
# … with 40 more rows, and 21 more variables: strain <chr>, time <fct>,
# tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
# product <chr>, gbkey <chr>, external_gene_name <chr>,
# ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
# gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
# end <int>, width <int>, strand <fct>
It’s still a SummarizedExperiment object but now we can view it as a tibble. Note the first line of the output says this, it’s a SummarizedExperiment-tibble abstraction. We can also see in the second line of the output the number of transcripts and samples.
Reverting to standard SummarizedExperiment view
If we want to revert to the standard SummarizedExperiment view we can do that.
options("restore_SummarizedExperiment_show" = TRUE) se
class: RangedSummarizedExperiment dim: 1474 22 metadata(0): assays(1): counts rownames(1474): Asl Apod ... Lmx1a Pbx1 rowData names(11): gene ENTREZID ... phenotype_description hsapiens_homolog_associated_gene_name colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380 colData names(11): title geo_accession ... mouse center
If we want to revert back to tidy SummarizedExperiment view we can.
options("restore_SummarizedExperiment_show" = FALSE) se
# A SummarizedExperiment-tibble abstraction: 32,428 × 22 # [90mFeatures=1474 | Samples=22 | Assays=counts[0m .feature .sample counts title geo_accession organism age sex infection <chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <fct> 1 Asl GSM2545336 1170 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 2 Apod GSM2545336 36194 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 3 Cyp2d22 GSM2545336 4060 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 4 Klk6 GSM2545336 287 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 5 Fcrls GSM2545336 85 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 6 Slc2a4 GSM2545336 782 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 7 Exd2 GSM2545336 1619 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 8 Gjc2 GSM2545336 288 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 9 Plp1 GSM2545336 43217 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 10 Gnb4 GSM2545336 1071 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… # … with 40 more rows, and 21 more variables: strain <chr>, time <fct>, # tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>, # product <chr>, gbkey <chr>, external_gene_name <chr>, # ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>, # gene_biotype <chr>, phenotype_description <chr>, # hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>, # end <int>, width <int>, strand <fct>
We can now use tidyverse commands to interact with the SummarizedExperiment object.
We can use filter
to filter for rows using a condition e.g. to remove a failed sample.
se %>%
filter(.sample != "GSM2545336")
# A SummarizedExperiment-tibble abstraction: 30,954 × 21
# [90mFeatures=1474 | Samples=21 | Assays=counts[0m
.feature .sample counts title geo_accession organism age sex infection
<chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <fct>
1 Asl GSM2545337 361 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
2 Apod GSM2545337 10347 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
3 Cyp2d22 GSM2545337 1616 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
4 Klk6 GSM2545337 629 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
5 Fcrls GSM2545337 233 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
6 Slc2a4 GSM2545337 231 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
7 Exd2 GSM2545337 2288 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
8 Gjc2 GSM2545337 595 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
9 Plp1 GSM2545337 101241 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
10 Gnb4 GSM2545337 1791 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec…
# … with 40 more rows, and 21 more variables: strain <chr>, time <fct>,
# tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
# product <chr>, gbkey <chr>, external_gene_name <chr>,
# ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
# gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
# end <int>, width <int>, strand <fct>
We can use select
to specify columns we want to view.
se %>%
select(.sample)
tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A tibble: 32,428 × 1
.sample
<chr>
1 GSM2545336
2 GSM2545336
3 GSM2545336
4 GSM2545336
5 GSM2545336
6 GSM2545336
7 GSM2545336
8 GSM2545336
9 GSM2545336
10 GSM2545336
# … with 32,418 more rows
se %>%
count(.sample)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 22 × 2
.sample n
<chr> <int>
1 GSM2545336 1474
2 GSM2545337 1474
3 GSM2545338 1474
4 GSM2545339 1474
5 GSM2545340 1474
6 GSM2545341 1474
7 GSM2545342 1474
8 GSM2545343 1474
9 GSM2545344 1474
10 GSM2545345 1474
# … with 12 more rows
We can use mutate
to add metadata info.
se %>%
mutate(center = "University of Melbourne")
# A SummarizedExperiment-tibble abstraction: 32,428 × 22
# [90mFeatures=1474 | Samples=22 | Assays=counts[0m
.feature .sample counts title geo_accession organism age sex infection
<chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <fct>
1 Asl GSM2545336 1170 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
2 Apod GSM2545336 36194 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
3 Cyp2d22 GSM2545336 4060 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
4 Klk6 GSM2545336 287 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
5 Fcrls GSM2545336 85 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
6 Slc2a4 GSM2545336 782 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
7 Exd2 GSM2545336 1619 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
8 Gjc2 GSM2545336 288 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
9 Plp1 GSM2545336 43217 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
10 Gnb4 GSM2545336 1071 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz…
# … with 40 more rows, and 21 more variables: strain <chr>, time <fct>,
# tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>,
# product <chr>, gbkey <chr>, external_gene_name <chr>,
# ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
# gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
# end <int>, width <int>, strand <fct>
We can also combine commands with the tidyverse pipe %>%
.
For example, we could combine group_by
and summarise
to get the total counts for each sample.
se %>%
group_by(.sample) %>%
summarise(total_counts=sum(counts))
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 22 × 2
.sample total_counts
<chr> <int>
1 GSM2545336 3039671
2 GSM2545337 2602360
3 GSM2545338 2458618
4 GSM2545339 2500082
5 GSM2545340 2479024
6 GSM2545341 2413723
7 GSM2545342 2349728
8 GSM2545343 3105652
9 GSM2545344 2524137
10 GSM2545345 2506038
# … with 12 more rows
Some other tidyverse commands we haven’t seen before but that are useful are slice
, distinct
and unite
.
We can use slice
to choose rows by number e.g. to choose the first row.
se %>%
slice(1)
# A SummarizedExperiment-tibble abstraction: 1 × 1
# [90mFeatures=1 | Samples=1 | Assays=counts[0m
.feature .sample counts title geo_accession organism age sex infection
<chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <fct>
1 Asl GSM2545336 1170 CNS_R… GSM2545336 Mus mus… 8 we… Fema… Influenz…
# … with 21 more variables: strain <chr>, time <fct>, tissue <fct>,
# mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>, product <chr>,
# gbkey <chr>, external_gene_name <chr>, ensembl_gene_id <chr>,
# external_synonym <chr>, chromosome_name <chr>, gene_biotype <chr>,
# phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>,
# seqnames <fct>, start <int>, end <int>, width <int>, strand <fct>
We can use distinct
to see what unique information we have.
se %>%
distinct(.sample, infection, sex)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 22 × 3
.sample sex infection
<chr> <fct> <fct>
1 GSM2545336 Female InfluenzaA
2 GSM2545337 Female NonInfected
3 GSM2545338 Female NonInfected
4 GSM2545339 Female InfluenzaA
5 GSM2545340 Male InfluenzaA
6 GSM2545341 Male InfluenzaA
7 GSM2545342 Female InfluenzaA
8 GSM2545343 Male NonInfected
9 GSM2545344 Female InfluenzaA
10 GSM2545345 Male InfluenzaA
# … with 12 more rows
We can use unite
to combine multiple columns into a single column.
se %>%
unite("group", c(infection, time))
tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A SummarizedExperiment-tibble abstraction: 32,428 × 22
# [90mFeatures=1474 | Samples=22 | Assays=counts[0m
.feature .sample counts title geo_accession organism age sex group strain
<chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <chr> <chr>
1 Asl GSM254… 1170 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
2 Apod GSM254… 36194 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
3 Cyp2d22 GSM254… 4060 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
4 Klk6 GSM254… 287 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
5 Fcrls GSM254… 85 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
6 Slc2a4 GSM254… 782 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
7 Exd2 GSM254… 1619 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
8 Gjc2 GSM254… 288 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
9 Plp1 GSM254… 43217 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
10 Gnb4 GSM254… 1071 CNS_… GSM2545336 Mus mus… 8 we… Fema… Infl… C57BL…
# … with 40 more rows, and 19 more variables: tissue <fct>, mouse <fct>,
# center <chr>, gene <chr>, ENTREZID <chr>, product <chr>, gbkey <chr>,
# external_gene_name <chr>, ensembl_gene_id <chr>, external_synonym <chr>,
# chromosome_name <chr>, gene_biotype <chr>, phenotype_description <chr>,
# hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>,
# end <int>, width <int>, strand <fct>
We can treat se
as a normal tibble for plotting.
Here we plot the distribution of counts per sample.
se %>%
ggplot(aes(counts + 1, group=.sample, color=infection)) +
geom_density() +
scale_x_log10() +
theme_bw()
To work with genomic coordinates (ranges) with tidy methods, performing subsetting etc, we can use the plyranges package.
Challenge
- Extract the miRNA and NonInfected this time using tidyverse commands.
- Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8 this time using tidyverse commands.
- Create a bar plot of the sample counts using the se object. Hint use geom_col to plot the summarised counts.
Solution
se %>% filter(gene_biotype == "miRNA" & infection == "NonInfected")
# A SummarizedExperiment-tibble abstraction: 49 × 7 # [90mFeatures=7 | Samples=7 | Assays=counts[0m .feature .sample counts title geo_accession organism age sex infection <chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <fct> 1 Mir1901 GSM2545337 45 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 2 Mir378a GSM2545337 11 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 3 Mir133b GSM2545337 4 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 4 Mir30c-2 GSM2545337 10 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 5 Mir149 GSM2545337 1 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 6 Mir128-1 GSM2545337 4 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 7 Mir7682 GSM2545337 2 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 8 Mir1901 GSM2545338 44 CNS_… GSM2545338 Mus mus… 8 we… Fema… NonInfec… 9 Mir378a GSM2545338 7 CNS_… GSM2545338 Mus mus… 8 we… Fema… NonInfec… 10 Mir133b GSM2545338 6 CNS_… GSM2545338 Mus mus… 8 we… Fema… NonInfec… # … with 39 more rows, and 21 more variables: strain <chr>, time <fct>, # tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>, # product <chr>, gbkey <chr>, external_gene_name <chr>, # ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>, # gene_biotype <chr>, phenotype_description <chr>, # hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>, # end <int>, width <int>, strand <fct>
se %>% filter(time == "Day0" | time == "Day8") %>% group_by(.sample) %>% slice(1:3)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 42 × 30 # Groups: .sample [14] .feature .sample counts title geo_accession organism age sex infection <chr> <chr> <int> <chr> <chr> <chr> <chr> <fct> <fct> 1 Asl GSM2545336 1170 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 2 Apod GSM2545336 36194 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 3 Cyp2d22 GSM2545336 4060 CNS_… GSM2545336 Mus mus… 8 we… Fema… Influenz… 4 Asl GSM2545337 361 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 5 Apod GSM2545337 10347 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 6 Cyp2d22 GSM2545337 1616 CNS_… GSM2545337 Mus mus… 8 we… Fema… NonInfec… 7 Asl GSM2545338 400 CNS_… GSM2545338 Mus mus… 8 we… Fema… NonInfec… 8 Apod GSM2545338 9173 CNS_… GSM2545338 Mus mus… 8 we… Fema… NonInfec… 9 Cyp2d22 GSM2545338 1603 CNS_… GSM2545338 Mus mus… 8 we… Fema… NonInfec… 10 Asl GSM2545341 988 CNS_… GSM2545341 Mus mus… 8 we… Male Influenz… # … with 32 more rows, and 21 more variables: strain <chr>, time <fct>, # tissue <fct>, mouse <fct>, center <chr>, gene <chr>, ENTREZID <chr>, # product <chr>, gbkey <chr>, external_gene_name <chr>, # ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>, # gene_biotype <chr>, phenotype_description <chr>, # hsapiens_homolog_associated_gene_name <chr>, seqnames <fct>, start <int>, # end <int>, width <int>, strand <fct>
se %>% group_by(.sample) %>% summarise(total_counts=sum(counts)) %>% ggplot(aes(x = .sample, y = total_counts)) + geom_col() + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
Interactive exploration
iSEE is a Bioconductor package that provides an interactive graphical user interface for exploring data stored in SummarizedExperiment objects.
# We won't use here but to install and use
# BiocManager::install('iSEE')
# library(iSEE)
# iSEE(se)
Extensions to SummarizedExperiment
- SingleCellExperiment container extends SummarizedExperiment for single-cell data. It follows similar conventions, rows represent features (genes, transcripts, genomic regions) and columns represent cells. It provides methods for storing dimensionality reduction results and data for alternative feature sets (e.g., synthetic spike-in transcripts, antibody-derived tags). It is the central data structure for Bioconductor single-cell packages.
The Orchestrating Single-Cell Analysis with Bioconductor book is a great resource for single cell analysis that describes using SingleCellExperiment.
As there is tidySummarizedExperiment for interacting with SummarizedExperiment objects using tidyverse commands, there is equivalently the tidySingleCellExperiment package for SingleCellExperiment objects.
There are also extensions to SummarizedExperiment for other data types:
-
SpatialExperiment for spatial data
-
TreeSummarizedExperiment for microbiome and other data with hierarchical structures
-
Qfeatures for proteomics and metabolomics mass spectrometry data
-
MultiAssayExperiment for multi-omics data
-
The Bioconductor was initiated by Robert Gentleman, one of the two creators of the R language. Bioconductor provides tools dedicated to omics data analysis. Bioconductor uses the R statistical programming language, and is open source and open development. ↩
Key Points
SummarizedExperiment
represents an efficient way to store and to handle omics dataSummarizedExperiment is used in many Bioconductor packages
We can use
tidySummarizedExperiment
to interact with SummarizedExperiment objects using tidyverse commands
Bioconductor
Overview
Teaching: XX min
Exercises: XX minQuestions
What is Bioconductor?
How can I use Bioconductor effectively for my analysis?
Objectives
Give an overview of the Bioconductor project including its website
Introduce concepts of reproducibility, coherence, interoperability and stability
Bioconductor
In the previous lesson we have already learned a little bit about the Bioconductor[^Bioconductor] project. In this class we will formalize our understanding of the Bioconductor[^Bioconductor] project.
Important: Remember that Bioconductor packages are downloaded via the function
BiocManager::install()
.
Bioconductor is a repository that collects open-source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assays in R. In addition, Bioconductor supports development, education and a thriving community. The The broad goals of the Bioconductor project are:
- To provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
- To facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. literature data from PubMed, annotation data from Entrez genes.
- To provide a common software platform that enables the rapid development and deployment of extensible, scalable, and interoperable software.
- To further scientific understanding by producing high-quality documentation and reproducible research.
- To train researchers on computational and statistical methods for the analysis of genomic data.
One of the best ways to explore the Bioconductor project is its website.
The Bioconductor website
The website tells us that there are over 2,000 packages. This is obviously way too many packages to explore individually. So how would you find potentially useful packages for the analysis of your dataset?
Exploring different topics with biocViews
In Bioconductor, each package is classified as belonging to different categories.
These different categories are called biocViews
and they are structured
as follows:
- Software: Packages that provide functions for statistical or graphical methods.
- AnnotationData: Packages that store annotations (i.e genome annotation) and respective access functions.
- ExperimentData: Packages that store example datasets.
- Workflow: Packages that assemble html tutorials using multiple packages for an analysis.
Each category is further divided into additional sub-categories, which are divided into sub-sub-categories that refer to specific assays, techniques, and research fields.
BiocViews can help tremendously with identifying packages that could be of use to you in the analysis of your dataset. These are easily searchable once you have navigated to the page that lists all packages.
Challenge
In your groups, identify a research topic of interest and then use the search function to find packages related to this topic. Think about the following:
- What keywords define your topic of interest?
- What type of packages are good for beginners and why?
- What other information on this website may be useful?
Exploiting the detailed documentation of the Bioconductor project
The other part of the Bioconductor website crucial for anyone’s learning journey is the help page. This page collects some outstanding Bioconductor learning resources such as
- Comprehensive books introducing coverage of a research field
- Courses and conference materials
- Videos
- Community resources and tutorials.
- Support site
Most importantly it introduces the concept of vignettes
, which are part of
Bioconductor’s mission to enhance reproducibility through rigorous documentation.
In Bioconductor almost every package (certainly every software package) has to
include a vignette. Vignettes are small tutorials that explain common use cases
of a package. For example, let’s explore the vignettes available for
SummarizedExperiment
:
browseVignettes(package = "SummarizedExperiment")
This should open a separate browser tab that will list availabe vignettes. By
clicking on the html
link you will open a nicely formatted tutorial that
you should be easily able to follow. Vignettes are a great place to start when
trying to get familar with a new package.
Core Bioconductor principles
Bioconductor is organized around some core principles:
- interoperablility with existing infrastructure to facilitate reuse and avoid replication
- coherence in coding, documentation and use of existing infrastructure
- rigorous documentation
- reproducibility
- stability ensuring that there are limited clashes due to versions
While it is not absolutely necessary for the end-user to remember these, the core principles explain some of the idiosyncrasies of Bioconductor that you may come across.
S4 classes
Challenge
Use the function
str
on these
SummarizedExperiment object you worked with during the last lession. What oddity do you notice? Hint: Compare the output to output ofstr
function applied torna
.Solution
We can see certain elements of the object starting with an @.
S4 programming allows object-oriented programming in R and thus ensure coherence, stability, and interoperability. Object-oriented programming is a computer programming model that organizes software design around data, or objects, rather than functions and logic. An object can be defined as a data field that has unique attributes and behavior.In S4, certain classes are defined with specific accessibility functions (called methods).
# These two statements will result in the same output.
se@colData
DataFrame with 22 rows and 10 columns
title geo_accession organism age sex
<character> <character> <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks Female
GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
GSM2545339 CNS_RNA-seq_13C GSM2545339 Mus musculus 8 weeks Female
GSM2545340 CNS_RNA-seq_14C GSM2545340 Mus musculus 8 weeks Male
... ... ... ... ... ...
GSM2545353 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks Female
GSM2545354 CNS_RNA-seq_4C GSM2545354 Mus musculus 8 weeks Male
GSM2545362 CNS_RNA-seq_5C GSM2545362 Mus musculus 8 weeks Female
GSM2545363 CNS_RNA-seq_6C GSM2545363 Mus musculus 8 weeks Male
GSM2545380 CNS_RNA-seq_9C GSM2545380 Mus musculus 8 weeks Female
infection strain time tissue mouse
<factor> <character> <factor> <factor> <factor>
GSM2545336 InfluenzaA C57BL/6 Day8 Cerebellum 14
GSM2545337 NonInfected C57BL/6 Day0 Cerebellum 9
GSM2545338 NonInfected C57BL/6 Day0 Cerebellum 10
GSM2545339 InfluenzaA C57BL/6 Day4 Cerebellum 15
GSM2545340 InfluenzaA C57BL/6 Day4 Cerebellum 18
... ... ... ... ... ...
GSM2545353 NonInfected C57BL/6 Day0 Cerebellum 4
GSM2545354 NonInfected C57BL/6 Day0 Cerebellum 2
GSM2545362 InfluenzaA C57BL/6 Day4 Cerebellum 20
GSM2545363 InfluenzaA C57BL/6 Day4 Cerebellum 12
GSM2545380 InfluenzaA C57BL/6 Day8 Cerebellum 19
colData(se) #colData is a method that allows access to the column meta data
DataFrame with 22 rows and 10 columns
title geo_accession organism age sex
<character> <character> <character> <character> <factor>
GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks Female
GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
GSM2545339 CNS_RNA-seq_13C GSM2545339 Mus musculus 8 weeks Female
GSM2545340 CNS_RNA-seq_14C GSM2545340 Mus musculus 8 weeks Male
... ... ... ... ... ...
GSM2545353 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks Female
GSM2545354 CNS_RNA-seq_4C GSM2545354 Mus musculus 8 weeks Male
GSM2545362 CNS_RNA-seq_5C GSM2545362 Mus musculus 8 weeks Female
GSM2545363 CNS_RNA-seq_6C GSM2545363 Mus musculus 8 weeks Male
GSM2545380 CNS_RNA-seq_9C GSM2545380 Mus musculus 8 weeks Female
infection strain time tissue mouse
<factor> <character> <factor> <factor> <factor>
GSM2545336 InfluenzaA C57BL/6 Day8 Cerebellum 14
GSM2545337 NonInfected C57BL/6 Day0 Cerebellum 9
GSM2545338 NonInfected C57BL/6 Day0 Cerebellum 10
GSM2545339 InfluenzaA C57BL/6 Day4 Cerebellum 15
GSM2545340 InfluenzaA C57BL/6 Day4 Cerebellum 18
... ... ... ... ... ...
GSM2545353 NonInfected C57BL/6 Day0 Cerebellum 4
GSM2545354 NonInfected C57BL/6 Day0 Cerebellum 2
GSM2545362 InfluenzaA C57BL/6 Day4 Cerebellum 20
GSM2545363 InfluenzaA C57BL/6 Day4 Cerebellum 12
GSM2545380 InfluenzaA C57BL/6 Day8 Cerebellum 19
Challenge
Why is it bad practice to use the @ to access parts of an object?
Solution
Using methods functions to access certain parts of the object enhances readibility of your code. If the underlying class structure changes (i.e. the name of the element) the method will still work.
If you want to know more about S4 classes you can find more information in these slides on their implementation in the Bioconductor package.
The release cycle
Bioconductor has two releases each year, typically in April and October, where all packages are updated to their next version in a way that they are interoperable (i.e. they do not clash when you load more than one). The releases conincide with the releases of new R versions, which also happen twice a year. This has two significant implications:
1) To ensure that packages on Bioconductor work flawlessly always use
BiocManager::install
to install a package, even when it is not a technically
Bioconductor package. Biconductor mirrors most other R package repositories like
CRAN and so the package will be most likely available. This will avoid clashed
with packages being ahead of the Bioconductor release.
2) You will need to update your Bioconductor packages twice a year (after updating R) to have all the latest versions. Refer to this guide for updating R qnd R-Studio
and this guide for updating Bioconductor.
Working with annotations
Bioconductor provides extensive annotation resources. These can be gene centric, or genome centric. Annotations can be provided in packages curated by Bioconductor, or obtained from web-based resources.
Gene centric AnnotationDbi packages include:
- Organism level: e.g. org.Mm.eg.db.
- Platform level: e.g. hgu133plus2.db, hgu133plus2.probes, hgu133plus2.cdf.
- System-biology level: GO.db Genome centric GenomicFeatures packages include
- Transcriptome level: e.g. TxDb.Hsapiens.UCSC.hg19.knownGene, EnsDb.Hsapiens.v75.
- Generic genome features: Can generate via GenomicFeatures One web-based resource accesses biomart, via the biomaRt package:
- Query web-based ‘biomart’ resource for genes, sequence, SNPs, and etc
The most popular annotation packages have been modified so that they can make use
of a new set of methods to more easily access their contents. These four methods
are named: columns
, keytypes
, keys
and select
. They can currently be used with all chip
, organism,
and TxDb
packages along with the popular GO.db
package.
An extremely common kind of Annotation package is the so called platform based or
chip based package type. This package is intended to make the manufacturer labels
for a series of probes or probesets to a wide range of gene-based features. A package
of this kind will load an ChipDb
object. Below is a set of examples to show how you
might use the standard 4 methods to interact with an object of this type.
First we need to load the package:
suppressPackageStartupMessages({
library(hgu95av2.db)
})
If we list the contents of this package, we can see that one of the many things loaded is an object named after the package “hgu95av2.db”:
ls("package:hgu95av2.db")
[1] "hgu95av2" "hgu95av2_dbconn" "hgu95av2_dbfile"
[4] "hgu95av2_dbInfo" "hgu95av2_dbschema" "hgu95av2.db"
[7] "hgu95av2ACCNUM" "hgu95av2ALIAS2PROBE" "hgu95av2CHR"
[10] "hgu95av2CHRLENGTHS" "hgu95av2CHRLOC" "hgu95av2CHRLOCEND"
[13] "hgu95av2ENSEMBL" "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"
[16] "hgu95av2ENZYME" "hgu95av2ENZYME2PROBE" "hgu95av2GENENAME"
[19] "hgu95av2GO" "hgu95av2GO2ALLPROBES" "hgu95av2GO2PROBE"
[22] "hgu95av2MAP" "hgu95av2MAPCOUNTS" "hgu95av2OMIM"
[25] "hgu95av2ORGANISM" "hgu95av2ORGPKG" "hgu95av2PATH"
[28] "hgu95av2PATH2PROBE" "hgu95av2PFAM" "hgu95av2PMID"
[31] "hgu95av2PMID2PROBE" "hgu95av2PROSITE" "hgu95av2REFSEQ"
[34] "hgu95av2SYMBOL" "hgu95av2UNIPROT"
We can look at this object to learn more about it:
hgu95av2.db
ChipDb object:
| DBSCHEMAVERSION: 2.1
| Db type: ChipDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMANCHIP_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| MANUFACTURER: Affymetrix
| CHIPNAME: Affymetrix HG_U95Av2 Array
| MANUFACTURERURL: http://www.affymetrix.com
| EGSOURCEDATE: 2021-Apr14
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: ENTREZID
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2021-02-01
| GOEGSOURCEDATE: 2021-Apr14
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL:
| GPSOURCEDATE: 2021-Feb16
| ENSOURCEDATE: 2021-Feb16
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Mon Apr 26 21:53:12 2021
Please see: help('select') for usage information
If we want to know what kinds of data are retriveable via select, then we should use the columns method like this:
columns(hgu95av2.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROBEID" "PROSITE" "REFSEQ" "SYMBOL"
[26] "UCSCKG" "UNIPROT"
If we are further curious to know more about those values for columns, we can consult the help pages. Asking about any of these values will pull up a manual page describing the different fields and what they mean.
help("SYMBOL")
If we are curious about what kinds of fields we could potential use as keys to query
the database, we can use the keytypes
method. In a perfect world, this method will
return values very similar to what was returned by columns, but in reality, some kinds
of values make poor keys and so this list is often shorter.
keytypes(hgu95av2.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROBEID" "PROSITE" "REFSEQ" "SYMBOL"
[26] "UCSCKG" "UNIPROT"
If we want to extract some sample keys of a particular type, we can use the keys
method.
head(keys(hgu95av2.db, keytype="SYMBOL"))
[1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
And finally, if we have some keys, we can use select
to extract them. By simply
using appropriate argument values with select we can specify what keys we want to
look up values for (keys), what we want returned back (columns) and the type of
keys that we are passing in (keytype).
#1st get some example keys
k <- head(keys(hgu95av2.db,keytype="PROBEID"))
# then call select
select(hgu95av2.db, keys=k, columns=c("SYMBOL","GENENAME"), keytype="PROBEID")
'select()' returned 1:1 mapping between keys and columns
PROBEID SYMBOL
1 1000_at MAPK3
2 1001_at TIE1
3 1002_f_at CYP2C19
4 1003_s_at CXCR5
5 1004_at CXCR5
6 1005_at DUSP1
GENENAME
1 mitogen-activated protein kinase 3
2 tyrosine kinase with immunoglobulin like and EGF like domains 1
3 cytochrome P450 family 2 subfamily C member 19
4 C-X-C motif chemokine receptor 5
5 C-X-C motif chemokine receptor 5
6 dual specificity phosphatase 1
An organism level package (an ‘org’ package) uses a central gene identifier (e.g.
Entrez Gene id) and contains mappings between this identifier and other kinds of
identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name
of an org package is always of the form org.
Challenge
Display the OrgDb object for the
org.Hs.eg.db
package. Use the columns method to discover which sorts of annotations can be extracted from it. Finally, use the keys method to extract UNIPROT identifiers and then pass those keys in to the select method in such a way that you extract the gene symbol and KEGG pathway information for each. Use the help system as needed to learn which values to pass in to columns in order to achieve this.Solution
library(org.Hs.eg.db) uniKeys <- head(keys(org.Hs.eg.db, keytype="UNIPROT")) cols <- c("SYMBOL", "PATH") select(org.Hs.eg.db, keys=uniKeys, columns=cols, keytype="UNIPROT")
A TxDb package (a ’TxDb’ package) connects a set of genomic coordinates to various transcript oriented features. The package can also contain Identifiers to features such as genes and transcripts, and the internal schema describes the relationships between these different elements. All TxDb containing packages follow a specific naming scheme that tells where the data came from as well as which build of the genome it comes from.
Challenge
Display the TxDb object for the
TxDb.Hsapiens.UCSC.hg19.knownGene
package. As before, use the columns and keytypes methods to discover which sorts of annotations can be extracted from it. Use thekeys
method to extract just a few gene identifiers and then pass those keys in to theselect
method in such a way that you extract the transcript ids and transcript starts for each.Solution
library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene keys <- head(keys(txdb, keytype="GENEID")) cols <- c("TXID", "TXSTART") select(txdb, keys=keys, columns=cols, keytype="GENEID")
Key Points
Bioconductor collates interoperable packages for analyses of biological data.
Using biocView can help you to find packages for a specific topic.