RBBS - 8 Strings & Dates
2022-04-25 Karishma Srikanth
corps rbbs r
Part 8 of the coRps R Building Blocks Series (RBBS). All content can be found on this blog under the rbbs
category as well as on the USAID-OHA-SI/coRps GitHub repo.
RBBS 8 - Strings & Dates
Over the last few sessions, we’ve explored how to bring data into R for manipulation and covered the principles of tidy data and transformations using base R and the Tidyverse. Another key component of transformations that we will explore is how to manipulate strings and dates using tools in the tidyverse and regular expressions (regexps).
For our R Building Blocks session today, we will focus on understanding
what regular expressions are, how to utilize them and the stringr
package in R to manipulate strings, and finally how to wrangle dates in
R using the lubridate
package.
This session is modeled after Chapter 14 and Chapter 16 of R for Data Science.
Learning Objectives
- Understand what regular expressions are and how to use them to manipulate strings
- Gain the ability to use the `stringr` package to manipulate strings
- Learn how to work with Date formatting in R
Recording
USAID staff can use this link to access today’s recording (not available to external users).
Setup
For these sessions, we’ll be using RStudio which is an IDE, “Integrated development environment” that makes it easier to work with R. For help on getting setup and installing packages, please reference this guide.
Materials
Load Packages
Let’s get started by loading some important packages. When we load the
tidyverse
, you’ll notice that a couple other packages are being loaded
as well, including stringr
, the main set of tools to manipulate
strings in the Tidyverse. Note that the lubridate
package is not part
of the core tidyverse packages, so if you are working with dates/times,
you’ll want to load this package as well.
library(tidyverse) #install.packages("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
String Theory 101
Let’s start with a brief review of what strings are and how we can create them in R. Most of the time, we’ll be working with strings that are pre-defined in our dataset, but the intuition of how to create strings can help us better understand how to manipulate them.
Any value that is written within a pair of single or double quotes in R is treated as a string. Unlike most other coding languages, R does not make a distinction between the single or double quotes when creating a string, but it’s important to note that R will always store strings within double quotes. As such, it is generally best practice to use double quotes when creating strings.
a <- "Starting and ending string with double quote"
print(a)
## [1] "Starting and ending string with double quote"
b <- 'Starting and ending with a single quote'
print(b)
## [1] "Starting and ending with a single quote"
Another rule to keep in mind is that the quote at the beginning of the string has to match the quote at the end of the string. If they are mixed, it will throw an error.
e <- "Mixed quotes'
print(e)
Error: unexpected symbol in: "print(e)
If you forget to close a quote, you’ll see +
, the continuation
character:
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK
Lastly, you can store multiple strings in a character vector, using
c()
c("storing", "multiple", "strings")
## [1] "storing" "multiple" "strings"
stringr
Functions
While base R has numerous functions that are useful to working with
strings, they can be inconsistent and difficult to remember. For this
session, we’ll focus on functions in the stringr
package in the
Tidyverse. All of these functions begin with the str
naming
convention, making them more intuitive to use and remember.
String Length
To count the number of characters in a string (including spaces), use
str_length()
str_length(c("OHA", "r building block", NA))
## [1] 3 16 NA
Concatenating Strings
To combine two or more strings, use str_c()
str_c("a", "b", "c")
## [1] "abc"
We can also use the sep
argument to control how the strings will be
separated when they are combined.
str_c("x", "y", sep = ",")
## [1] "x,y"
str_c()
is also vectorized, meaning that it will output shorter
vectors based on a character vector. Let’s take a look at an example
that we might see with PEPFAR data.
Let’s say that we are working with some country-level data and we want to create a naming convention so that each output file has a consistent naming convention.
We’d first want to store the countries that we are working with as a
character vector. We can then use str_c()
and generate a unique string
vector output for each file name.
country <- c("Mozambique", "Nigeria", "Malawi")
str_c("output/", country, "-data-report.pdf")
## [1] "output/Mozambique-data-report.pdf" "output/Nigeria-data-report.pdf"
## [3] "output/Malawi-data-report.pdf"
Another key rule to note is that elements of length 0 are dropped in the concatenation. Let’s take this a step further - let’s say we were working with semi-annual data that is only released in Q2 and Q4, like OVC data.
If we include an if()
statement in this str_c()
function for OVC
(where OVC is TRUE in only Q2 and Q4), what will the output look like?
What would it look like if we changed the period to Q1? Try changing the
code for the period and see what happens!
ou <- "Tanzania"
period <- "Q2"
ovc <- ifelse(period == "Q2" | period == "Q4", TRUE, FALSE)
str_c(
"output/", ou, "-", period, "-data-report",
if(ovc) "-with-ovc",
".csv"
)
## [1] "output/Tanzania-Q2-data-report-with-ovc.csv"
Subsetting Strings
To extract parts of a string, use str_sub()
using the start
and
end
arguments to assign the position of the string you want to subset.
Here’s an example of how we’d use this in our work, by slicing just the fiscal year from the period variable.
period <- c("FY19Q4", "FY20Q4", "FY21Q4")
str_sub(period, start = 1, end = 4)
## [1] "FY19" "FY20" "FY21"
Changing the case
To manipulate the case of the strings, we can use:
str_to_lower()
- changes all characters to lowercasestr_to_upper()
- changes all characters to uppercasestr_to_title()
- changes first letter of all words to uppercasestr_to_sentence()
- changes first letter of first words to uppercase
str_to_upper("change lower to upper")
#> [1] "CHANGE LOWER TO UPPER"
str_to_lower("CHANGE UPPER tO LOwER")
#> [1] "change upper to lower"
str_to_title("mAke strinGs tITLe-like")
#> [1] "Make Strings Title-Like"
str_to_sentence("tuRN string INTO sENTence")
#> [1] "Turn string into sentence"
Regular Expressions
You can think of regular expressions as the language that describes the patterns in strings. They are constructed in a similar manner to arithmetic expressions, by using special characters and various operators to combine smaller expressions. While these can be daunting to learn at first, they can ultimately be very useful to your ability to wrangle and manipulate strings.
In R, regular expressions are written as strings - however, some characters can be be represented directly as an R string, and some are instead represented as special characters, which are patterns that hold specific meaning:
Let’s start with some basic matching, using str_view()
. In its
simplest form, we can use regular expressions to match exact strings.
You’ll notice here that the “io” pattern in each string is highlighted
using str_view()
x <- c("WH Region", "Ethiopia", "Asia Region")
str_view(x, "io")
Period (.
) - matches any character
Time to introduce our first special character! In regular expressions, the period matches any character. In the example below, we are asking R to match the characters one to the right and one to the left of the “o”.
x <- c("WH Region", "Ethiopia", "Asia Region.")
str_view(x, ".o.")
Escape (\
) - backslash used to escape special behavior
You might be wondering, if the period notation matches any characters, how do I match the character “.” specifically? In these cases, we can use “escapes” to tell the regular expression to match the character exactly, rather than special character behavior.
As such, we’ll need the regexp \.
. Let’s see what happens when we try
to match the explicit pattern “a.”
x <- c("WH.Region", "Ethiopia", "Asia.Region.")
str_view(x, "a\.")
## Error: '\.' is an unrecognized escape in character string starting ""a\."
We see here that R is unable to recognize the escape character - since
we use strings to represent regular expressions, and the backslash is
also used as an escape for strings, we need to use the string \\.
as
our escape character.
x <- c("WH.Region", "Ethiopia", "Asia.Region")
str_view(x, "a\\.")
And it worked! From now on, we can think of the strings that represent
regular expressions as \\.
Bonus question - We can use a similar logic to match specific \
-
using what you know about regular expressions so far, how many escapes
do you need to match one \
?
Anchors - anchoring at the start or end of the string
- Use
^
to match the start of the string - Use
$
to match the end of the string
indicator <- c("HTS_TST", "TX_CURR", "HTS_TST_POS")
str_view(indicator, "^HTS")
ou <- c("WH.Region", "Ethiopia", "Asia.Region")
str_view(ou, "Region$")
Additional Characters
There are countless special characters that match to more than one character. Here’s an abbreviated list of some helpful characters to know:
Character Classes
- Use
\d
to match the any digit - Use
\s
to match white-space - Use
[x,y,z]
to match one of x, y, or z - Use
[^x,y,z]
to match anything but x, y, or z
Pattern Repetition
- Use
?
to match 0 or 1 times - Use
+
to match 1 or more times - Use
*
to match 0 or more times
Specify number of matches
- Use
{n}
exactly n matches - Use
{n,}
n or more matches - Use
{,m}
at most m matches - Use
{n,m}
between n and m matches
At this point, your head may be spinning with regular expressions. Let’s do a bonus example to practice our regular expression skills once more.
Bonus Example
"C:/Users/ksrikanth/Documents/Data/MER_Structured_Datasets_PSNU_IM_FY19-22_20211217_v2_1_Zambia.zip"
Let’s say I wanted to match this specific pattern in the above filepath exactly - how would I go about constructing a regular expression to to match this? (see slides for a breakdown of the answer)
Useful Stringr Tools
Now that we have a solid foundation for understanding regular expressions, let’s see if we can apply this logic to some real-life situations that we would run into with our own data using some helpful stringr tools.
Detecting Matches
We can use str_detect()
to determine if a character vector matches a
certain pattern or not. str_detect()
will return a logical vector of
the same length as the original input.
Let’s take a look at the example we used earlier - if we wanted to
identify the strings that match with just the HTS (testing) indicators,
we could use str_detect()
and the anchoring regular expression ^
to
match every string that starts with the pattern “HTS”.
indicator <- c("HTS_TST", "TX_CURR", "HTS_TST_POS")
str_detect(indicator, "^HTS")
## [1] TRUE FALSE TRUE
The nice thing about str_detect()
is that we can easily combine it
with our dplyr
functions to filter and tidy our dataset. For instance,
we could add this function to a filter()
command to filter a dataframe
to just HTS indicators.
df %>%
filter(str_detect(indicator, "^HTS"))
Replacing Matches
To replace matches with new strings, we can use str_replace()
. The
syntax for str_replace()
is as follows:
str_replace(vector, "old pattern", "new pattern")
Based on this logic, what do you think the following output will look like?
fy <- c("FY19", "FY20", "FY21")
str_replace(fy, "FY", "20")
## [1] "2019" "2020" "2021"
Using str_replace()
allows us to quickly manipulate our fiscal year
variable into the calendar year.
Splitting Matches
We can also use str_split()
to split a string into different pieces -
this is especially useful for tidying your data after a pivot and is
similar to the tidyr::separate()
function.
Let’s take an example of data being encoded into indicator names, like
below (with age, sex, and the statistic type written out in the string
name). We can use to str_split()
to split on the “_” pattern to
isolate each string into distinct pieces.
x <- c("prev_15-49_female_estimate", "deaths_all_all_high", "plhiv_15+_male_low")
x %>% str_split("_")
## [[1]]
## [1] "prev" "15-49" "female" "estimate"
##
## [[2]]
## [1] "deaths" "all" "all" "high"
##
## [[3]]
## [1] "plhiv" "15+" "male" "low"
Phew, we made it! Regular expressions can be incredibly daunting but
deeply useful for analytics. While this may not be a comprehensive
overview of everything that regular expressions and stringr
have to
offer, hopefully, you feel a little more equipped to tackle strings head
on.
Nevertheless, as you start working through regexp, function
documentation (?function()
), the OHA/SI Github
site, and Google are your best
friends!
Dates and Times
Now that we are experts in working with strings, let’s apply some of
that logic to working with dates and times in R. In day-to-day data use,
we may come across multiple different formats for date/time data that
can complicate our analyses. While we can use base R to do some of these
manipulations, the lubridate
package is a helpful and user-friendly
set of functions that make date/time manipulation much easier. As a
reminder, lubridate
is not a part of the core tidyverse, so you will
need to load it separately whenever you need it.
library(lubridate)
Creating dates/times
There are 3 main types of data and time data:
- Date: printed as
<date>
- Time: printed as
<time>
- Date-time: date plus a time, printed as
<dttm>
Let’s illustrate this using two functions from the lubridate
package:
today()
which tells you the current day and now()
which tells you
the current day and time. These functions can be tremendous useful when
you are trying to perform date/time calculations as a part of your
analysis.
today()
#> [1] "2022-04-26"
now()
#> [1] "2022-04-26 19:17:05 EDT"
There are 3 main ways to create a date/time variable:
From a string
Let’s focus first on how to parse a string into a date format, since
that is often the format that we get date/time data in. We can use some
helpful functions from lubridate
to automatically populate the date
format once we specify the order of the string.
For example, let’s say we had the string “April 25, 2022”. This
string is in “Month Day, Year” format - as such, we can use mdy()
to
parse this string into a proper date-format.
Similarly, we can use ymd()
for strings with “Year Month Day” format
or dmy()
for strings with “Day Month Year”.
These functions also work with unquoted numbers if you have numeric data instead:
ymd(20220426)
#> [1] "2022-04-26"
To add the date-time element, we can add the suffix “_hms” for “hours,
months, seconds” to these parsing functions (ymd_hms()
). To specify
the timezone, use the tz
argument. To return a list of valid time
zones, use OlsonNames()
.
mdy_hms("04/26/22 18:36:59", tz = "EST")
#> [1] "2022-04-25 18:36:59 EST"
From individual columns in the dataset
Sometimes, we’ll see individual components of a date spread across
multiple columns - we can manipulate this into a date format using
make_date()
or make_date_time()
.
Let’s create an example tibble of 3 dates spread out across multiple columns.
date_df <- data.frame(year = c(2003, 2022, 1998),
month = c(1, 4, 10),
day = c(19, 25, 26),
hour = c(10, 6, 16),
minute = c(50, 35, 22))
date_df
## year month day hour minute
## 1 2003 1 19 10 50
## 2 2022 4 25 6 35
## 3 1998 10 26 16 22
Since we have columns for year, month, day, hour, and minute, we’ll use
make_date_time()
to include the time component.
date_df2 <- date_df %>%
mutate(date = make_datetime(year, month, day, hour, minute))
date_df2
## year month day hour minute date
## 1 2003 1 19 10 50 2003-01-19 10:50:00
## 2 2022 4 25 6 35 2022-04-25 06:35:00
## 3 1998 10 26 16 22 1998-10-26 16:22:00
And now we have a date-time variable!
Other types
To switch between a date-time format and a date format, you can use
as_date()
and as_date_time()
:
as_datetime(today())
#> [1] "2022-04-26 UTC"
as_date(now())
#> [1] "2022-04-26"
Sometimes, R will read date/time variables as a numeric value that shows
the number of seconds that have elapsed since January 1, 1970 (POSIX
time). If you are dealing with dates in this format, you can also use
as_date()
.
As an example, let’s use the unclass()
to return to the raw data
format of the date value from today()
. From this, we can see that the
number of seconds from January 1, 1970 to April 26, 2022 is 19108. Then,
let’s call the as_date()
function on that number as see what happens.
unclass(today())
#> [1] 19108
as_date(19108)
#> [1] "2022-04-26"
Identifying components of dates
Using lubridate
, you can also extract parts of the date with the
following functions:
year()
month()
mday()
- day of the monthyday()
- day of the yearwday()
- day of the weekhour()
minute()
second()
datetime <- ymd_hms("2019-01-12 23:34:56")
year(datetime)
#> [1] 2019
month(datetime)
#> [1] 1
mday(datetime)
#> [1] 12
yday(datetime)
#> [1] 12
wday(datetime)
#> [1] 7
Calculations with Dates
Once you have your date variables formatted correctly, it becomes really intuitive to start performing calculations with them.
When we subtract two dates, we get a “difftime” object. Let’s see what happens when we have an individual’s birthday and want to calculate their age.
When we simply subtract, we get an age in units of days, which is not
very intuitive. However, we can use the function as.duration()
from
the lubridate
package which reports the time in seconds, as well as a
more intuitive rounded figure.
age <- today() - ymd(20120210)
age
as.duration(age)
#> Time difference of 3728 days
#> [1] "322099200s (~10.21 years)"
Additional Resources
For more reading on working with strings and dates, check out Chapter 14 and 16 of R for Data Science.
For a good guide on how to use the stringr
functions, see this
cheatsheet.
For a good guide on how to use the lubridate
functions, see this
cheatsheet.