Code
year | Africa | Americas | Asia | Europe | Oceania |
---|---|---|---|---|---|
2002 | 53.33 | 72.42 | 69.23 | 76.70 | 79.74 |
2007 | 54.81 | 73.61 | 70.73 | 77.65 | 80.72 |
Converting “wide” to “long” format
Vishal Katti
July 8, 2022
pivot_longer()
to convert your wide data to long data. This is part 1 of the Pivoting your tables with Tidyr series.
One of the primary data manipulation operations is pivoting your tabular data from “wide” format to “long” format and vice-versa.
The idea is to make your tabular data “tidy” i.e.
In other words, every column contains just one type of information, every row in the table is a snapshot or a version of the information your table captures and every cell contains just one piece of information.1
While the wide format is more human-readable, the long format is preferred and is desirable for data and plotting operations using R, Python or other data processing programming languages. The {tidyr} R package has functions that allow you to transform your tabular data between the two formats.
In this post, we will see how to convert a wide dataframe to long format using the pivot_longer() function from {tidyr} package.
Consider the following data table. It has been created from the famous Gapminder dataset. This table shows the average life expectancy in each continent for 2 years. While some of you may say that Gapminder data contains records for a lot more number of years, here we consider just the latest 2 years for ease of explanation and visual purposes.
my_data
is in the wide format as we have continent names in column headers and average life expectancy values in each of those columns. To convert this tibble to the long format, we need to pull together the continent names in one column and their corresponding values into another column. See Figure 2
The long format of this table would ideally have only year
, continent
and average_life_expectancy
columns and look something like the table below.
The long format has repeated values of the column that are not gathered/collected. In this case, the year
column gets its values repeated for each row.
Let’s recreate the above transformation in R. First, we create the my_data
table.
year | Africa | Americas | Asia | Europe | Oceania |
---|---|---|---|---|---|
2002 | 53.33 | 72.42 | 69.23 | 76.70 | 79.74 |
2007 | 54.81 | 73.61 | 70.73 | 77.65 | 80.72 |
To convert this table into long format, we use the pivot_longer() function from {tidyr} R package. Let us see how to use this function.
Use `formals` to view all the formal arguments of a function and their default values. `formals` returns a named list.
$data
$cols
$names_to
[1] "name"
$names_prefix
NULL
$names_sep
NULL
$names_pattern
NULL
$names_ptypes
NULL
$names_transform
NULL
$names_repair
[1] "check_unique"
$values_to
[1] "value"
$values_drop_na
[1] FALSE
$values_ptypes
NULL
$values_transform
NULL
$...
The result of formals(pivot_longer)
tells us that the minimum information needed to use this function is to provide values to the data
and cols
arguments as all other arguments have default values and hence, are optional.
Using only the minimum arguments with pivot_longer()
, we get a long formatted tibble with the columns year
, name
and value
.
year | name | value |
---|---|---|
2002 | Africa | 53.33 |
2002 | Americas | 72.42 |
2002 | Asia | 69.23 |
2002 | Europe | 76.70 |
2002 | Oceania | 79.74 |
2007 | Africa | 54.81 |
2007 | Americas | 73.61 |
2007 | Asia | 70.73 |
2007 | Europe | 77.65 |
2007 | Oceania | 80.72 |
Notice that the continent names and their corresponding average life expectancy values appear in columns named name
and value
. These are the default column names. We can change these column names by providing our own names to the arguments names_to
and values_to
.
Since the year
column is the only one that remains as is, we can rewrite the above pivot_longer
statement as below
year | continent | average_life_expectancy |
---|---|---|
2002 | Africa | 53.33 |
2002 | Americas | 72.42 |
2002 | Asia | 69.23 |
2002 | Europe | 76.70 |
2002 | Oceania | 79.74 |
2007 | Africa | 54.81 |
2007 | Americas | 73.61 |
2007 | Asia | 70.73 |
2007 | Europe | 77.65 |
2007 | Oceania | 80.72 |
If you are a visual person like me and wish to see this transformation with explanations, check out this GIF I made using good ol’ Powerpoint.
pivot_longer()
is the successor for the great gather()
function and has many advantages over the latter. pivot_longer()
repeats all the values in the columns that are not included in the cols
argument. Therefore, if your dataframe/tibble had a primary key prior to the transformation, the primary key of your transformed “longer” dataframe is your old primary key + the new column created by names_to
. This function has many other arguments that allow some truly great transformations. Mastering this function (and its wide counterpart) is a great skill upgrade while massaging your data to make it “tidy”.
Happy Gathering!
Long vs. Wide Data: What’s the Difference? https://www.statology.org/long-vs-wide-data/↩︎
@online{katti2022,
author = {Katti, Vishal},
title = {Pivoting Your Tables with {Tidyr:} {Part} {I}},
date = {2022-07-08},
url = {https://vishalkatti.com/posts/tidyr-pivot-longer/},
langid = {en},
abstract = {This post demonstrates how to use `pivot\_longer()` to
convert your wide data to long data. This is part 1 of the Pivoting
your tables with Tidyr series.}
}