RBBS - 4 Tidy Data

Part 4 of the coRps R Building Blocks Series (RBBS). All content can be found on this blog under the rbbs category as well as on the USAID-OHA-SI/coRps GitHub repo.

RBBS 4 - Tidy Data

For our R Building Blocks session today, we will focus on understanding what tidy data is, why it matters, and how to use tidyr in R to wrangle data on your own. This session is modeled after Chapter 12 of R for Data Science.

Learning Objectives

- Learn what tidy data is and the importance of tidy data
- Learn how to recognize tidy vs. untidy data
- Use `tidyr` to wrangle data in R

Recording

USAID staff can use this link to access today’s recording (not available to external users).

Materials

Setup

For these sessions, we’ll be using RStudio which is an IDE, “Integrated development environment” that makes it easier to work with R. For help on getting setup and installing packages, please reference this guide.

Load Packages

Let’s get started by loading some important packages. When we load the tidyverse, you’ll notice that a couple other packages are being loaded as well, including tidyr(), the main set of tools to tidy data. We’ll focus more in depth on how to use the tidyverse toolkit in next week’s session.

library(tidyverse) #install.packages("tidyverse")

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(scales) #install.packages("scales")
library(glitr) #remotes::install_github("USAID-OHA-SI/glitr", build_vignettes = TRUE)
library(glamr)
library(here)
library(readxl)

Basics of Tidy Data

In this session, we’ll talk about organizing data in an efficient, reproducible, and collaborative way, known as Tidy Data. While this process takes some upfront work and deliberate thought about data structure, this work pays off today and in the long-run. By having data in a tidy format and utilizing the tool in the tidyverse, you’ll spend less time working on and munging messy data and have more time to focus on the analytic questions.

So what is tidy data?

Tidy data is a standard way of mapping the meaning of a dataset to its structure. There are 3 central components to tidy data.

Each variable forms a column
Each observation forms a row
Each cell is a measurement

Let’s take a look at this visually using sample data from the glitr package with glimpse(), head(), and View().

#load the dataset in your Global Environment
data(hts)

#prints out a preview of your data; indicators run vertically (with type) and data runs horizontally
glimpse(hts)

## Rows: 1,695
## Columns: 7
## $ operatingunit <chr> "Saturn", "Saturn", "Saturn", "Saturn", "Saturn", "Satur~
## $ primepartner  <chr> "Auriga", "Auriga", "Auriga", "Auriga", "Auriga", "Aurig~
## $ indicator     <chr> "HTS_TST", "HTS_TST", "HTS_TST", "HTS_TST", "HTS_TST", "~
## $ modality      <chr> "Index", "Index", "IndexMod", "IndexMod", "Inpat", "Inpa~
## $ period        <chr> "FY49", "FY50", "FY49", "FY50", "FY49", "FY50", "FY49", ~
## $ period_type   <chr> "targets", "targets", "targets", "targets", "targets", "~
## $ value         <dbl> 3080, 6880, 890, 780, 6550, 2170, 870, 2900, 1200, 1690,~

head(hts)

## # A tibble: 6 x 7
##   operatingunit primepartner indicator modality period period_type value
##   <chr>         <chr>        <chr>     <chr>    <chr>  <chr>       <dbl>
## 1 Saturn        Auriga       HTS_TST   Index    FY49   targets      3080
## 2 Saturn        Auriga       HTS_TST   Index    FY50   targets      6880
## 3 Saturn        Auriga       HTS_TST   IndexMod FY49   targets       890
## 4 Saturn        Auriga       HTS_TST   IndexMod FY50   targets       780
## 5 Saturn        Auriga       HTS_TST   Inpat    FY49   targets      6550
## 6 Saturn        Auriga       HTS_TST   Inpat    FY50   targets      2170

#your traditional tabular view
View(hts)

This is an example of a tidy dataset. Each row is a distinct reporting period (contained in the variable period) with information on where the data were reported (in operatingunit) and by whom (primepartner). Each column is a variable, showing testing indicators across the Modality/Age/Sex/Result disaggregations and the value in the value column. For more information on the dataset, we can use the ? to see the documentation through a “help” file. The ? is extremely useful for getting help files across all functions.

Data Structure

Once we have the value, observation, and variable defined, we can start to talk more about how these elements are physically structured in the dataset. Dataset structure refers to the physical layout of the data, both in how it is physically structured and how it appears in the sheet when you open it.

Data CAN be structured in many ways, but ideally it is structured as either long or wide.

Long (stacked/panel) data stacks the data. For example, the measurement (period) in this dataset have been stored in a cell rather than a column name for FY49 an FY50. Value refers to the number of individuals tested.

#prints out a preview of your data; indicators run vertically (with type) and data runs horizontally

head(hts)

## # A tibble: 6 x 7
##   operatingunit primepartner indicator modality period period_type value
##   <chr>         <chr>        <chr>     <chr>    <chr>  <chr>       <dbl>
## 1 Saturn        Auriga       HTS_TST   Index    FY49   targets      3080
## 2 Saturn        Auriga       HTS_TST   Index    FY50   targets      6880
## 3 Saturn        Auriga       HTS_TST   IndexMod FY49   targets       890
## 4 Saturn        Auriga       HTS_TST   IndexMod FY50   targets       780
## 5 Saturn        Auriga       HTS_TST   Inpat    FY49   targets      6550
## 6 Saturn        Auriga       HTS_TST   Inpat    FY50   targets      2170

Wide spreads the data out to the right with many columns. Here, each observation is based on a OU + primepartner + indicator + modality + period_type. Unit of measurement is the number of individuals tested in each quarter. We can use tidyr::pivot_wider to pivot this HTS data wide.

#Pivot data wide

hts_wide <- hts %>% 
  pivot_wider(names_from= period, values_from = value)

head(hts_wide)

## # A tibble: 6 x 12
##   operatingunit primepartner indicator modality   period_type  FY49  FY50 FY49Q1
##   <chr>         <chr>        <chr>     <chr>      <chr>       <dbl> <dbl>  <dbl>
## 1 Saturn        Auriga       HTS_TST   Index      targets      3080  6880     NA
## 2 Saturn        Auriga       HTS_TST   IndexMod   targets       890   780     NA
## 3 Saturn        Auriga       HTS_TST   Inpat      targets      6550  2170     NA
## 4 Saturn        Auriga       HTS_TST   Malnutrit~ targets       870    NA     NA
## 5 Saturn        Auriga       HTS_TST   MobileMod  targets      2900  1200     NA
## 6 Saturn        Auriga       HTS_TST   OtherMod   targets      1690   780     NA
## # ... with 4 more variables: FY50Q1 <dbl>, FY49Q2 <dbl>, FY49Q3 <dbl>,
## #   FY49Q4 <dbl>

What are the advantages of working with long data? First, most visualization/statistical software/Excel is designed to work with long data. Long data can easily be filtered, pivoted, or analyzed. Row optimization facilitates analysis.

As a rule of thumb, if a column name contains data on which you wish to sort, then it should likely be stored in a cell instead of a column name.

Sometimes, you’ll see a mixture of these two styles which is not ideal.

Pivoting Data

When you are pivoting a dataset, you are restructuring the physical layout into an manner. Stacking columns into new rows and creating new columns to collect the values that have been transposed.

The column headers are transposed into one column and the values are all stored in one column now, making the data tidy and easier to work with.

The process of physically altering the layout of a dataset by stacking columns into new rows and creating new columns to collect the values that have been transposed.

NOTICE: once the date are pivoted to a long format, you can easily filter by the primepartner + period. In the WIDE data, if you are “filtering” on the quarter, you will be turning columns on and off. This is cumbersome if you have many columns and not always transparent.

To perform pivots in R, there are two primary functions:

pivot_longer() - pivots your data long (stacked)
pivot_wider() - pivots your data wide (columns)

What about the MSDs?

The MSDs are semi-tidy datasets rather than tidy. This was a conscious decision to strike a balance between row length and a full-long dataset. To play around with it, load the most recent OUxIM dataset.You can manually download it from PEPFAR Panorama or follow the steps in the glamr vignette, called “Data Extraction from Panorama”.

Two features that are not tidy were (1) periods, which each their own column (qtr1, qtr2, qtr3, qtr4), and (2) categoryoptioncombos, which combined all the useful information like disaggregation, age, sex, status, etc.

As such, if you were performing analytics on the MSD, you may find yourself having to pivot and reshape the data accordingly. Another helpful function built by SI is the glamr::reshape_msd() function in the glamr package, that transforms the MSD into a tidy format.

Working with Messy Data in R

Now that we know what a tidy dataset looks like, let’s dive deeper into some messy data to see if we can identify some key threats to tidy data. The goal of this exercise is not to learn all the ways to tidy the data in R just yet, but rather to give you an intuition about what the barriers to tidyness are and how we could think through fixing these.

UNAIDS Data

Let’s start with the UNAIDS 2020 Estimate data, which we often use within our office to answer analytic questions about epidemic control and progress to the 90’s/95’s goals. Our team munged and tidied this data this year and a detailed understanding of the dataset and data cleaning process can be found on the the MindTheGap website.

We’ll read in the data in the same format as how it exists on our Google Drive (feel free to download as an .xlsx file and use readxl::read_xlsx() or pull directly from Google Drive with googlesheets4::read_sheet()), and use the View() function to see how the data are imported in R.

file_name <- "HIV_estimates_from_1990-to-present.xlsx"

df <-  readxl::read_xlsx(
      path = here("2022-03-07",file_name),
      sheet = 1)

view(df)

head(df, n = 15L)

## # A tibble: 15 x 51
##    `HIV estimates ~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11
##    <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 1990-2020         <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  2 Source: UNAIDS 2~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  3 <NA>              <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  4 <NA>              <NA>  <NA>  Adul~ <NA>  <NA>  Youn~ <NA>  <NA>  Youn~ <NA> 
##  5 <NA>              <NA>  <NA>  Esti~ Low   High  Esti~ Low   High  Esti~ Low  
##  6 1990.0            <NA>  Glob~ 0.3   0.2   0.3   0.2   0.1   0.4   0.1   <0.1 
##  7 1990.0            <NA>  Asia~ <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1 
##  8 1990.0            AFG   Afgh~ <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1 
##  9 1990.0            AUS   Aust~ <0.1  <0.1  0.1   <0.1  <0.1  <0.1  <0.1  <0.1 
## 10 1990.0            BGD   Bang~ ...   ...   ...   ...   ...   ...   ...   ...  
## 11 1990.0            BTN   Bhut~ <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1 
## 12 1990.0            BRN   Brun~ ...   ...   ...   ...   ...   ...   ...   ...  
## 13 1990.0            KHM   Camb~ <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1  <0.1 
## 14 1990.0            CHN   China ...   ...   ...   ...   ...   ...   ...   ...  
## 15 1990.0            PRK   Demo~ ...   ...   ...   ...   ...   ...   ...   ...  
## # ... with 40 more variables: ...12 <chr>, ...13 <chr>, ...14 <chr>,
## #   ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>,
## #   ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>,
## #   ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>, ...29 <chr>,
## #   ...30 <chr>, ...31 <chr>, ...32 <chr>, ...33 <chr>, ...34 <chr>,
## #   ...35 <chr>, ...36 <chr>, ...37 <chr>, ...38 <chr>, ...39 <chr>,
## #   ...40 <chr>, ...41 <chr>, ...42 <chr>, ...43 <chr>, ...44 <chr>, ...

We need to skip quite a few rows, 5 to be exact, since our table header start on row 6. However, we have another issue here, where the indicator is encoded in row 5 (Adults (15-49) prevalence (%)) but the breakdown in estimate and confidence interval is on row 6. As such, if we skip 5 rows with readxl::read_xlsx(path = here("2022-03-07",file_name), sheet = 1, skip = 5), we will lose the meaning of the indicators.

To circumvent this, our team created a indicator names cross walk when we began tidying up this data. For the purpose of this session, the detailed process of how we renamed and joined these elements is less important - for now, let’s just take a look at the dataset after addressing the indicator name issue and identify what is left to address to make this data tidy.

file_name <- "unaids-semi-clean.csv"

df_semi <-  read_csv(
      file = here("2022-03-07",file_name))

view(df_semi)

head(df_semi, n = 15L)

## # A tibble: 15 x 51
##     year iso   country        `prev_15-49_al~` `prev_15-49_al~` `prev_15-49_al~`
##    <dbl> <chr> <chr>          <chr>            <chr>            <chr>           
##  1  1990 <NA>  Global         0.3              0.2              0.3             
##  2  1990 <NA>  Asia and the ~ <0.1             <0.1             <0.1            
##  3  1990 AFG   Afghanistan    <0.1             <0.1             <0.1            
##  4  1990 AUS   Australia      <0.1             <0.1             0.1             
##  5  1990 BGD   Bangladesh     ...              ...              ...             
##  6  1990 BTN   Bhutan         <0.1             <0.1             <0.1            
##  7  1990 BRN   Brunei Daruss~ ...              ...              ...             
##  8  1990 KHM   Cambodia       <0.1             <0.1             <0.1            
##  9  1990 CHN   China          ...              ...              ...             
## 10  1990 PRK   Democratic Pe~ ...              ...              ...             
## 11  1990 FJI   Fiji           <0.1             <0.1             <0.1            
## 12  1990 IND   India          ...              ...              ...             
## 13  1990 IDN   Indonesia      <0.1             <0.1             <0.1            
## 14  1990 JPN   Japan          <0.1             <0.1             <0.1            
## 15  1990 LAO   Lao People De~ <0.1             <0.1             <0.1            
## # ... with 45 more variables: `prev_15-24_female_est` <chr>,
## #   `prev_15-24_female_low` <chr>, `prev_15-24_female_high` <chr>,
## #   `prev_15-24_male_est` <chr>, `prev_15-24_male_low` <chr>,
## #   `prev_15-24_male_high` <chr>, deaths_all_all_est <chr>,
## #   deaths_all_all_low <chr>, deaths_all_all_high <chr>,
## #   `deaths_0-14_all_est` <chr>, `deaths_0-14_all_low` <chr>,
## #   `deaths_0-14_all_high` <chr>, `deaths_15+_all_est` <chr>, ...

Exercises

Using View(), what are some of the elements of this dataset that make it un-tidy?
How would you begin to think through addressing some of these elements to make the data tidy?

Another useful function in the tidyr package is the separate function. As you can see, our UNAIDS data has the age, sex, and statistic type encoded into the indicator name. We can first pivot the data long to have one stacked column for the indicator using pivot_longer() and then use the separate function to split the variables by the underscore into multiple different columns (indicator, age, sex, and stat)

df_semi %>% 
  tidyr::pivot_longer(-c(year, iso, country),
                        names_to = c("indicator")) %>%
    tidyr::separate(indicator, sep = "_", into = c("indicator", "age", "sex", "stat")) %>% 
  head()

## # A tibble: 6 x 8
##    year iso   country indicator age   sex    stat  value
##   <dbl> <chr> <chr>   <chr>     <chr> <chr>  <chr> <chr>
## 1  1990 <NA>  Global  prev      15-49 all    est   0.3  
## 2  1990 <NA>  Global  prev      15-49 all    low   0.2  
## 3  1990 <NA>  Global  prev      15-49 all    high  0.3  
## 4  1990 <NA>  Global  prev      15-24 female est   0.2  
## 5  1990 <NA>  Global  prev      15-24 female low   0.1  
## 6  1990 <NA>  Global  prev      15-24 female high  0.4

While this is not perfectly tidy, we are getting closer. There are still a couple other elements to address to tidy this up in its entirety, including:

Encoding the regions as their own variable rather than a row
Addressing special characters (< or >)
Addressing missing values
Breaking down nested columns for stat type

For a more in-depth dive in how to tidy the UNAIDS data, check out our CoRps session about munge_unaids, the cleaning and tidying function that we built to tidy this data.

Additional Resources

For more practice with Tidy Datasets, check out our CoRps session on Working with Messy Datasets.

For a good guide on how to use the tidyr functions, see this cheatsheet.

For some additional reading about tidy data, check out Hadley Wickham’s journal article about tidy data and Data Organization in Spreadsheets- Karl W. Broman & Kara H. Woo (2018).

Next Up

This session, we covered the foundational elements of tidy data from Chapter 12 of R for Data Science. While this session offers a glimpse of how to tidy data in R, we’ll dive deeper next week into learning the ins and outs of Base R and the TidyVerse.