RBBS 5 - Base

In this R Building Blocks session, we will focus on understanding base R and how it connects to the larger tidyverse. This session draws on Advanced R and Hands-On Programming in R.

Learning Objectives

  • Grounding/introduction to base R
  • Learn how base R defines and conceptualizes different types of data objects
  • Learn some basic tools for manipulating these objects
  • Connect to the larger tidyverse

Recording

USAID staff can use this link to access today’s recording (not available to external users).

Material

Setup

For these sessions, we’ll be using RStudio which is an IDE, “Integrated development environment” that makes it easier to work with R. For help on getting set up and installing packages, please reference this guide.

Load Packages

The beauty of base R is that you don’t need to load any extra packages. All of the functions are already contained in R itself. However, we will also be going over how some of the basic R functions connect to the functions available in the tidyverse, so we recommend loading this set of packages as well.

library(tidyverse) #install.packages("tidyverse")

Introduction to base R

Base R is a package that contains the basic functions R requires to function as a programming language: arithmetic, input/output, basic programming support, etc. The contents of the base R package are available through inheritance from any environment and don’t need to be separately loaded. For a complete list of functions, use library(help = "base") and see the base R cheat sheet for a great summary of useful commands.

So why learn base R given the plethora of fancier functions that are now available? Because it provides the underlying structures for all of these more sophisticated packages and functions. Without R there would be no tidyverse! Additionally, the tidyverse can be a little overwhelming – there are so many different functions and packages to remember! Going back to basics in base R can illuminate the organizing principles that underlie these packages and can make it easier to put it all together.

We can think of numbers and characters as the atoms of base R, vectors and strings as molecules, lists and data frames as stars, and galaxies as functions and packages – building up the tidyverse from base R step-by-step.

Objects in R

Let’s start with the basics – by defining what an object is in R. It turns out that in R everything is an object – vectors,lists, data frames and even functions.

An object is defined as a data structure having some attributes and methods which act on its attributes.

This makes R very flexible and able to handle many situations and data types. Base R also includes some pre-defined objects that are particularly useful for working with mixed data types.

Assigning names to objects

The assignment operator <- binds together a name and a value, or any type of object. Names must use letters, digits, . and _ but can’t begin with _ or a digit, nor use reserved words.

Below are two examples, one assigning a number and another a letter to object x.

    x <- 3    # Value 3 is assigned to object x
    x <- "a"  # Character "a" is assigned to object x

As a note, there are multiple ways to signal equivalence in R. For our purposes <- is used in most situations and = is used for input to functions (e.g. function(x=3, y=5)).

Learning more about objects in R

A few useful commands can help us understand the objects encountered in R: The command typeof() returns the data type for vectors & object type for more complex structures str() returns the structure. Finally the print() command prints the contents of an object. See below for some example output:

vec<-c(1,2,3)
typeof(vec) 
## [1] "double"
str(vec) ## Note: str=structure not string!
##  num [1:3] 1 2 3
print(vec) ## values contained in an object
## [1] 1 2 3

Another useful base R function is list.files(), which lists the names of all the files in a particular directory whose names include a given pattern.

Data types in R

So let’s dive into the atoms of our R universe — the data types that the more complicated objects consist of. Four common data types used in R are:

Data types Examples
doubles 2.5, 3.7, 8.1
integers 1, 5, 8
logicals TRUE, FALSE
characters a, b, c, abc

Base R also includes a set of commands that allows us to learn what type of variables we are dealing with, which can be particularly useful if these variables are being read in from an existing data set, rather than defined by the user.

is.character("hi")
## [1] TRUE
is.numeric ("hi")
## [1] FALSE
typeof("hi")
## [1] "character"

Variables can also be forced from one form to another — in this case the number 5 can be viewed as a number or as the character “5.”

as.character(5)
## [1] "5"
as.numeric("5")
## [1] 5
as.logical(0)
## [1] FALSE
as.logical(1)  # Note that this will also be FALSE for any other non-zero value
## [1] TRUE

Data types in R: A detour into strings

Before moving on to vectors, let’s zoom in on strings, which are defined as collections of characters. A string is enclosed inside a set of quotes (e.g. x<-"this is a string"). We often need to work with this data type, especially in the process of data cleaning. We may have data that contains important information that needs to be extracted from a longer string. Or we may want to combine strings together to create a new variable or a plot title or label.

A few very useful commands for combining, manipulating and searching strings can be found in base R. The paste() command combines components into longer expressions, while gsub() replaces a chunk of text within a string, and grep() locates smaller chunks of text within a longer string or a vector of strings. These functions will never go out of style, but the tidyverse package stringr includes many more sophisticated functions that can be used to work with strings (see cheatsheet).

paste("happy","birthday",sep='-') # The separation is a single space by default, but can be re-defined.
## [1] "happy-birthday"
vec<-c("happy","birthday") 
paste(vec, collapse=' ') # The collapse parameter turns a vector into a string.
## [1] "happy birthday"

gsub("sad","happy","sad birthday") # Replaces text within a string
## [1] "happy birthday"

grep("a",vec) # Function returns all vector components containing the letter "a"
## [1] 1 2

Data objects in R

Data objects: one-dimensional vectors

Now that we’ve covered the data elements or atoms of the R universe, let’s shift to vectors, the simplest object type and the molecules of the base R universe.

Vectors are 1-dimensional objects consisting of multiple data elements, all of which are the same type. Common vector types include two types of numeric vectors (integer and double) and two types of atomic vectors (logical and character).

Vectors are generated using the c() command, which binds together multiple elements.

#Numeric vectors
int_vec <- c(1L, 6L, 10L) # The `L` ensures numbers are saved as integers rather than doubles.
dbl_vec <- c(1, 2.5, 4.5)

#Atomic vectors
lgl_vec <- c(TRUE, FALSE)
chr_vec <- c("some", "strings")

Note that when the c() command receives inputs of different data types, everything is cast as a character. But don’t worry, R has a structure called a list that can help with this conundrum and will be introduced later.

c(6,FALSE,"string") 
## [1] "6"      "FALSE"  "string"
Commands for working with vectors

A few base R commands are very helpful when working with vectors. For example, the rep() and seq() commands can easily generate a vector that repeats values or a vector that contains a sequence of values defined by its boundaries and intervals. Two integers separated by : – e.g. 1:5 will also generate a sequence of integers.

## Shortcuts for generating vectors
1:5
## [1] 1 2 3 4 5
rep(1:2,times=3); rep(1:2,each=3) # Parameters times/each define how the vector is constructed
## [1] 1 2 1 2 1 2
## [1] 1 1 1 2 2 2
seq(2,3,by=0.5)
## [1] 2.0 2.5 3.0

All of the arithmetic operations in R are vector-based and don’t require loops to be calculated. Additionally sort() and rev() are helpful functions with intuitive names. The unique() function returns all of the unique values in a vector and can be used to quickly assess the contents of a long vector of data. The function table() provides even more information by counting the instances of these unique values.

## Math with vectors
vec<-c(1,2,3)
vec+vec; vec*vec; vec/vec; vec-vec
## [1] 2 4 6
## [1] 1 4 9
## [1] 1 1 1
## [1] 0 0 0

## Vector functions
sort(c(1,3,2)) 
## [1] 1 2 3
rev(c(1,2,3))
## [1] 3 2 1
unique(c(1,1,2)) 
## [1] 1 2
table(c(6,6,7))
## 
## 6 7 
## 2 1

Data Objects: multi-dimensional vectors

While vectors are one-dimensional, R can also handle two-dimensional matrices as well as multi-dimensional arrays. Like vectors, these objects can only contain one data type. However, some of these commands will translate to data frames, a more flexible two-dimensional object that will be introduced later. The commands below include the two-dimensional and multi-dimensional versions of names(),length() and c() – which provide information about the various dimensions of the objects and bind together rows and columns.

Vectors Matrices Arrays
names() rownames(), colnames() dimnames()
length() nrow(), ncol() dim()
c() rbind(), cbind() abind::abind()
t() aperm()
is.null(dim(x)) is.matrix() is.array()

Data Objects: Factors

A factor is an object that can contain only predefined values and is typically used to store categorical data. These pre-defined values are called levels.

Factors are built on top of an integer vector with two attributes: the “factor” class, which makes it behave differently from regular integer vectors and the “levels”, which define set of allowed values. Factors are useful when the universe of possible values is well-understood, even if the values are not all represented in a specific data set. Factors can be generated in base R via the factor() command and other objects can be turned into vectors with as.factor() — but remember that in this case there may be some missing levels not contained in your data sample.

# Generating a factor
x <-factor(c("a", "b", "a", "b"),levels=c("a", "b", "c"))

# Transforming a vector into a factor
y<-c("a","b","a"); x <- as.factor(y)
levels(x)
## [1] "a" "b"

# An example that illustrates the difference between a vector and a factor when 
# some data values not represented in a data set
sex_char<-c("m", "m", "m"); table(sex_char)
## sex_char
## m 
## 3
sex_fac<-factor(sex_char, levels = c("m", "f")); table(sex_fac)
## sex_fac
## m f 
## 3 0

A few factor commands are also listed below. The levels() and factor() commands allows for reassigning the names of levels throughout the entire data set (e.g. switching from m to male) as well as re-ordering levels. The reordering can be useful when a different order than the default is preferable (e.g temporal rather than alphabetical order for months). While factors a great data structure, some of the base R commands for more sophisticated factor operations can be confusing and the tidyverse package forcats has a lot to offer to fill these gaps (see cheat sheet.

# Useful commands for reassigning level names and reordering
levels(sex_fac)<-c("male", "female"); print(sex_fac) # forcats analogs fct_recode(), fct_relabel()
## [1] male male male
## Levels: male female
sex_fac<-factor(sex_fac, levels = c("female", "male")); print(sex_fac) # forcats analogs fct_relevel(), fct_reorder()
## [1] male male male
## Levels: female male

Data Objects: Lists

Next I want to talk about lists, the stars in our universe, which are a very flexible type of object.

A list is an object in which each element can be any type – this includes objects that are vectors in themselves, including vectors of different lengths! This works because each element is actually a reference to another object.

Why are lists so useful? Because they allow for working with mixed data types and even object types. This finally solves our problem from the section on vectors! The list() function preserves data types, unlike c()

lst <- list(x=1:3, y="a", z=c(TRUE, FALSE, TRUE), q=c(2.3, 5.9))
str(lst)
## List of 4
##  $ x: int [1:3] 1 2 3
##  $ y: chr "a"
##  $ z: logi [1:3] TRUE FALSE TRUE
##  $ q: num [1:2] 2.3 5.9

Another useful feature of lists is that you can use them to efficiently apply the same function to each list element, without having to write many nested loops. In base R, lapply() and several related functions also allow for some of these capabilities, but the tidyverse purrr package (see cheat sheet) has many more options so I would suggest looking there. This includes functions such as detect(), keep() and append(), which can accomplish things that would be quite difficult in base R.

lst<-list(
     a=c(1,2,3),
     b=c(4,5))
lapply(lst, FUN=mean) # purrr analog includes map(l1,mean) and others
## $a
## [1] 2
## 
## $b
## [1] 4.5
unlist(lst) # purrr analog is flatten()
## a1 a2 a3 b1 b2 
##  1  2  3  4  5

Data Objects: Data Frames

A data frame is a special case of a list where all elements are the same length. Unlike matrices and arrays, the columns can contain different types of data (numeric, strings, factors etc.). Since each row corresponds to an observation, the length of the columns is always the same.

Data frames can be generated using the data.frame() function and defining each column.

# Generating a data frame consisting of two columns, one numeric and one character
df <- data.frame(col1=1:3,col2=c("a","b","c")) 

Additionally, when R reads in .txt and .csv files they are automatically stored as data frames. Example functions: df <- read.table('file.txt') df <- read.csv('file.csv')

There are many ways to call the columns and rows of a data frame in base R. For example, using the $sign between the data frame name and column name will do it (e.g. df$col_name). You can also dynamically add a column to an existing data frame using the $, as seen below. Finally, data frames can be subset by manually filtering the rows using logical statements. The dplyr package in the tidyverse has many more options for working with data frames (see cheat sheet)[https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf].

# Different ways to call a column in base R
df$col1; df[,1]; df[[1]] #dplyr analog pull(df,col1) 
## [1] 1 2 3
## [1] 1 2 3
## [1] 1 2 3

# Calling a row in base R
df[1,] #slice(df,1)
##   col1 col2
## 1    1    a

# Calling a cell
df[1,1]; df[[1]][1]; df$col1[1] 
## [1] 1
## [1] 1
## [1] 1

## Column names
names(df)
## [1] "col1" "col2"

## Create column
df$col3<-4:6 #dplyr analog is mutate(col3=4:6)

#Filter rows
df[df$col1==2,] #dplyr analog is filter(df,col1==2)
##   col1 col2 col3
## 2    2    b    5

Manipulating data frames

The commands for subsetting data frames build off commands for subsetting vectors. A nice summary of both sets of commands is in the base R cheat sheet. A few of the useful commands for vectors include:

# Examples for subsetting vectors
x<1:5
## [1] NA NA NA NA NA
x[-4] # Note that minus means not that element. 
## [1] a b a
## Levels: a b
x[x<3] # Elements can be susbset using logical operators
## [1] <NA> <NA> <NA>
## Levels: a b
x[x %in% c(1,3,5)] #The %in% command pulls a discrete subset of elements 
## factor(0)
## Levels: a b

The cheat sheet also includes a nice summary of commands for data frames. Data frames inherit many matrix commands like nrow(), ncol() as well as cbind() and rbind(). However, data frames are lists at their core and therefore also be manipulated using commands that map more closely onto lists such as df$column and df[[2]], which extract an element of a list, in this case a column.

Summary

In summary, the logic and structures of base R commands provide us with an architecture to understand the principles of the language. It can help us do many basic functions easily and also understand how different elements of the tidyverse relate to each other. Below is a visual summary of various elements of base R and how they map onto the the tidyverse.

[20220406_petrovic_rbbs_5_baseR_tidyverse_diagram.png]

Exercises for practice

The answer key is located at the bottom.

1) How can indicator PrEP_CURR be modified to PrEP_CT using the gsub() command?

2) What is the simplest way to create the vector: 1.0 1.0 1.0 1.5 1.5 1.5 2.0 2.0 2.0 Bonus: What if I am only interested in only the unique elements of this vector, sorted in descending order and squared?

3) Turn the vector into a character array and add “cm” to the values and then confirm that this is now a character array with length 3.

4) Create a vector that includes 1 number, 1 character & 1 integer

5) 1) Create a list that consists of: –the number 5, –the string “abcd”, –a vector of integers from 1:3 –a factor with levels “a”,“b” and “c”, but no observations o “c”

6) 2) Use a function to list attributes of the components

7) 1) Generate a data frame with 3 rows and 2 columns such that: – column 1 is named x and contains numbers 1,2,3 – column 2 is named y and contains letters t,c,a 2) Add column z, which is filled with zeros

8) Write out “1 cat” using only baseR calls and paste() Bonus: call each y component using a different command

9) 1) Pull out column x and then subset only the elements that are less than 3 2) Print only the row where x=1. Bonus: Print only the rows where y contains consonants

Answer Key

# 1) How can indicator PrEP_CURR be modified to PrEP_CT using the gsub() command?

gsub("CURR","CT","PrEP_CURR")
## [1] "PrEP_CT"


# 2) What is the simplest way to create the vector: 1.0 1.0 1.0 1.5 1.5 1.5 2.0 2.0 2.0 
# Bonus: What if I am only interested in only the unique elements of this vector, 
# sorted in descending order and squared? 

vec<-rep(seq(1,2,by=0.5),each=3); print(vec)
## [1] 1.0 1.0 1.0 1.5 1.5 1.5 2.0 2.0 2.0

# Bonus:
vec<-rev(unique(vec))
vec*vec
## [1] 4.00 2.25 1.00


# 3) Turn the vector into a character array and add "cm" to the values and then
#    confirm that this is now a character array with length 3.

vec<-paste(as.character(vec),"cm")
str(vec)
##  chr [1:3] "2 cm" "1.5 cm" "1 cm"


# 4) Create a vector that includes 1 number, 1 character & 1 integer

# TRICK QUESTION!!! A vector cannot include mixed types so the c() command will
# cast everything as a character. A list can have mixed types and is created with
# the list() command

c(3.5,1L,"a")
## [1] "3.5" "1"   "a"
list(3.5,1L,"a")
## [[1]]
## [1] 3.5
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] "a"


# 5) a) Create a list that consists of:
#          --the number 5, 
#          --the string "abcd", 
#          --a vector of integers from 1:3 
#          --a factor with levels "a","b" and "c", but no observations of "c" 
# 
# 5) b) Use a function to list attributes of the components 

ls1<-list(5,"abcd", 1:3, factor(c("a","a","b"),levels=c("a","b","c")))
str(ls1)
## List of 4
##  $ : num 5
##  $ : chr "abcd"
##  $ : int [1:3] 1 2 3
##  $ : Factor w/ 3 levels "a","b","c": 1 1 2


# 6) b) Generate a data frame with 3 rows and 2 columns such that:
#          -- column 1 is named x and contains numbers 1,2,3 
#          -- column 2 is named y and contains letters t,c,a 
# 6) b) Add column z, which is filled with zeros

df <- data.frame(x=1:3,y=c("t","c","a")); print(df)
##   x y
## 1 1 t
## 2 2 c
## 3 3 a
df$z<-0; print(df)
##   x y z
## 1 1 t 0
## 2 2 c 0
## 3 3 a 0


# 7) Write out "1 cat" using only baseR calls and paste()
# Bonus: call each y component using a different command

paste(df$x[1], paste(df$y[2], df[[2]][3], df[1,2], sep='')) 
## [1] "1 cat"


# 8) a) Pull out column x and then subset only the elements that are less than 3
#    b) Print only the row where x=1.
# Bonus: Print only the rows where y contains consonants

vec<-df$x; vec<-vec[vec<3]; print(vec)
## [1] 1 2

df[df$x==1,]
##   x y z
## 1 1 t 0

# Bonus
df[df$y %in% c("c","t"),]
##   x y z
## 1 1 t 0
## 2 2 c 0