Google API Workflow
2021-04-12 Aaron Chafetz
corps munging wrangling processing api
coRps Session on Google API Workflow
Overview
We work with the GSuite for a lot of our work at USAID. We can take advantage of the Google API improve the automation and collaboration of our workflows, from pushing ggplot outputs to a Google Drive folder or reading directly in a Google Sheet.
The two main packages we use are part of the Tidyverse - googledrive
and googlesheets4
- and maintain by the RStudio team. The underlying
package that allows for authentication is gargle
and only the older
version of the package is whitelisted for use by USAID at the momement.
You will need to install an older version of both gargle
and
googlesheets4
using the code below.
install.packages("devtools")
install.packages("googledrive")
devtools::install_version("gargle", version = "0.5.0", repos = "http://cran.us.r-project.org")
devtools::install_version("googlesheets4", version = "0.2.0", repos = "http://cran.us.r-project.org")
Let’s check that we have the older versions of both.
packageVersion("gargle") == "0.5.0"
## [1] TRUE
packageVersion("googlesheets4") == "0.2.0"
## [1] TRUE
Providing authentication
To start connecting with the Google API, you need to provide your USAID email. The very first time it will launch your internet browser and ask you to confirm approval. You can enter your email or have it prompt you to write in your email if you don’t write it (safer if your code is public). To use each package you will have to authenticate with each one (and have to do this every time to start a new session).
library(googledrive)
library(googlesheets4)
drive_auth()
gs4_auth()
We have a nice function to make loading your email (and other
credentials into a session) as part of the glamr
package. You can view
the instructions from the vignette, credential-management
.
devtools::install_github("USAID-OHA-SI/glamr", build_vignettes = TRUE)
vignette("credential-management", package = "glamr") #launch the vignette
Basically, though you have set your email and it stored is in your Options so its ready to use.
library(glamr)
set_email("rshah@usaid.gov") #just need to set this one time
load_secrets()
Problem to solve
Now we’re ready to rock and roll. For the demo we’re going to try to solve a problem I had to deal with the other day. For HFR, we import any sheet from the Excel submission that has “HFR” in the sheet name. This includes hidden sheets and we realized we were importing and duplicating values for Eswatini since they had a sheet called “HFR v1” that was hidden. We needed to check to see if any of their previous submissions had these extra tabs.
Workflow
Okay, let’s load the packages we need and load our email for use with
googledrive
and googlesheets4
.
library(tidyverse)
library(glamr)
library(googledrive)
library(googlesheets4)
library(fs)
library(readxl)
library(lubridate)
library(glue)
load_secrets()
For HFR, country teams submit their submissions each period via a Google Form. So we have a Google Sheet that stores all the information entered as well as the hyperlink (and id) of the submission.
With googledrive
and googlesheets4
we want to specify the file or
folder id which is found at the end of the url. If your provide the file
name, the packages search through the thousands of files on your GDrive
which takes a ton of time and expends a lot of resources.
Always,always use the Google file id!
Let’s start by looking at the Google Sheet itself that contains all the
information about the submissions. We have the ID we can store as an ID,
as_id()
(or as_sheets_id()
)
hfr_submissions <- as_id("1gQvY1KnjreRO3jl2wzuVCKmKjUUgZDwByVK1c-bzpYI")
And we can open this file in our browser with
googledrive::drive_browse()
.
drive_browse(hfr_submissions)
Now that we have a sense of the sheet structure, let’s read it in
directly to our R session using googlesheets4::read_sheet()
.
df_form <- read_sheet(hfr_submissions)
names(df_form)
## [1] "Timestamp"
## [2] "Email Address"
## [3] "Operating Unit/Country"
## [4] "HFR FY and Period"
## [5] "What type of submission is this?"
## [6] "Is the data being submitted monthly totals or weekly disaggregates?"
## [7] "Which HFR Template(s) are you submitting?"
## [8] "Have the following HFR submission requirements been met? [Are you using the v1.04 FY21 HFR template?]"
## [9] "Have the following HFR submission requirements been met? [Is the meta tab still included in your submission(s)]"
## [10] "Have the following HFR submission requirements been met? [Are the \"Operating Unit/Country\" & \"HFR FY and Period\" fields on the meta tab complete?]"
## [11] "Have the following HFR submission requirements been met? [Do the tabs with your HFR data include \"HFR\" in the tab name?]"
## [12] "Have the following HFR submission requirements been met? [Do the dates in your submission follow the YYYY-MM-DD format]"
## [13] "Have the following HFR submission requirements been met? [Have columns A-G of the HFR tab been filled out?]"
## [14] "Have the following HFR submission requirements been met? [Does your submission files adhere to the following name convention \"HFR_FY[YY]_[Period]_[OU]_[PARTNER IF NEEDED]_[DATE SUBMITTED as YYYYMMDD]\" ?]"
## [15] "Is there any additional information you would like included with your submission?"
## [16] "Who should the HFRG email concerning this submission (names & emails)"
## [17] "Upload your HFR file(s) here"
## [18] "File(s) Rejected or Accepted"
## [19] "HFRG Notes"
## [20] "Feedback sent back to Mission & CC's"
## [21] "HFRG Assigned"
## [22] "Sent to DDC for Processing?"
## [23] "Processed by HFRG (Wavelength)?"
We only need the country and hyperlink to the file, so let’s filter and limit our varaibles.
df_swz <- df_form %>%
select(country = `Operating Unit/Country`,
period = `HFR FY and Period`,
type = `What type of submission is this?`,
file = `Upload your HFR file(s) here`) %>%
filter(country == "Eswatini")
df_swz
## # A tibble: 6 x 4
## country period type file
## <chr> <chr> <chr> <chr>
## 1 Eswatini FY21 Nov Initial https://drive.google.com/open?id=1eo04wabgAC9KQtM~
## 2 Eswatini FY21 Dec Initial https://drive.google.com/open?id=1zi1gHwYWTPZHkzV~
## 3 Eswatini FY21 Jan Initial https://drive.google.com/open?id=1wvfAzuOAc0Wl3eV~
## 4 Eswatini FY21 Feb Initial https://drive.google.com/open?id=1MNn_W1xHgA8Ko3J~
## 5 Eswatini FY21 Feb Re-submi~ https://drive.google.com/open?id=1GN-Q3ioc6yY_PYU~
## 6 Eswatini FY21 Mar Initial https://drive.google.com/open?id=1_FrwNlc4yeSrkdf~
For most countries, they are only submitting one file per submission,
but Eswatini uploades multiple files per submission, one for each
partner (you can’t see it in the preview above because it gets cut off,
but there are three urls in each cell). In order to use the id from the
files’ hyperlinks, we have to breakout the list of hyperlinks in each
cell into a data frame that has one hyperlink per row. The tidyr
package has just the tool we need for this separate_rows()
.
df_swz <- separate_rows(df_swz, file, sep = ", ")
df_swz
## # A tibble: 18 x 4
## country period type file
## <chr> <chr> <chr> <chr>
## 1 Eswatini FY21 Nov Initial https://drive.google.com/open?id=1eo04wabgAC9KQ~
## 2 Eswatini FY21 Nov Initial https://drive.google.com/open?id=1RqECc5RNjE9UU~
## 3 Eswatini FY21 Nov Initial https://drive.google.com/open?id=111WhmmY8WK3RZ~
## 4 Eswatini FY21 Dec Initial https://drive.google.com/open?id=1zi1gHwYWTPZHk~
## 5 Eswatini FY21 Dec Initial https://drive.google.com/open?id=1Uz11s7DyrnrFC~
## 6 Eswatini FY21 Dec Initial https://drive.google.com/open?id=1XJSB7hO8A3vtO~
## 7 Eswatini FY21 Jan Initial https://drive.google.com/open?id=1wvfAzuOAc0Wl3~
## 8 Eswatini FY21 Jan Initial https://drive.google.com/open?id=1fFWqm9HkD71RT~
## 9 Eswatini FY21 Jan Initial https://drive.google.com/open?id=1_NLTRUz4rKSFb~
## 10 Eswatini FY21 Feb Initial https://drive.google.com/open?id=1MNn_W1xHgA8Ko~
## 11 Eswatini FY21 Feb Initial https://drive.google.com/open?id=17S_YtplSvt4Kn~
## 12 Eswatini FY21 Feb Initial https://drive.google.com/open?id=1XzxW2qXzklGja~
## 13 Eswatini FY21 Feb Re-submis~ https://drive.google.com/open?id=1GN-Q3ioc6yY_P~
## 14 Eswatini FY21 Feb Re-submis~ https://drive.google.com/open?id=16V4C0SAblrfPb~
## 15 Eswatini FY21 Feb Re-submis~ https://drive.google.com/open?id=18mcGHW5LhB0nh~
## 16 Eswatini FY21 Mar Initial https://drive.google.com/open?id=1_FrwNlc4yeSrk~
## 17 Eswatini FY21 Mar Initial https://drive.google.com/open?id=1ZgTKfvZA7bkMc~
## 18 Eswatini FY21 Mar Initial https://drive.google.com/open?id=1wDm3k582XzlVM~
Great. Now we have each of the files in their own row. We need to extract the id from the url so that we can use it for downloading. We can use a regular expression to extract this, since all follow the same pattern of coming after “id=” in the url.
ids <- df_swz %>%
mutate(id = str_extract(file, "(?<=id=).*")) %>%
pull()
ids
## [1] "1eo04wabgAC9KQtMnVUafqUexfX2cyE5S" "1RqECc5RNjE9UUKa36vW2tR_yXHB0Q6Pl"
## [3] "111WhmmY8WK3RZQtj-TD2_jc67X2yzl0_" "1zi1gHwYWTPZHkzVwvLKJ-628VvqcFFB4"
## [5] "1Uz11s7DyrnrFCDCHgBOctLHc_oxmXaw0" "1XJSB7hO8A3vtOKFHKUmuBquDYWRgd8r4"
## [7] "1wvfAzuOAc0Wl3eVXGANsh0b0z55c87op" "1fFWqm9HkD71RTutWzUaInt2Fw2s0yTOZ"
## [9] "1_NLTRUz4rKSFb_t-Dgr4-lXwiNzeno7l" "1MNn_W1xHgA8Ko3JsErWI0k_gfRAeZoSj"
## [11] "17S_YtplSvt4KnR6jMK6BJLVEWPio0le3" "1XzxW2qXzklGjaMfyoRICs8sBNTbT0weB"
## [13] "1GN-Q3ioc6yY_PYUwzzk-QidTnZALbpy2" "16V4C0SAblrfPb20pmvX5yd6KFtJFnVmq"
## [15] "18mcGHW5LhB0nhmS7clpiX9YqBG8ZVN_i" "1_FrwNlc4yeSrkdf2TLptfBUcw0UsZ3ZS"
## [17] "1ZgTKfvZA7bkMcAlTTuEXJhLQxwnPf0yh" "1wDm3k582XzlVMpSZyW_XE_LLcqCuDk4m"
The other piece of information we need that we didn’t have from the
original form, is the file name. If we saved the file to the working
directory, this wouldn’t be an issue; but to use the
googldrive::drive_download
and save to a particular folder, you have
to provide the full path, so we need the filename. We can use the
googledrive::drive_get()
function to provide additional information
about the file. We’ll use a function from purrr()
to map over each id
and get the information we need and combine it back into a data frame.
files <- map_dfr(.x = ids,
.f = ~drive_get(as_id(.x))) %>%
mutate(name = str_remove(name, " -.*(?=\\.xlsx)")) #removing the submitter's name
files
## # A tibble: 18 x 3
## name id drive_resource
## * <chr> <chr> <list>
## 1 HFR_FY21_Nov_Eswatini_EGPAF_2020~ 1eo04wabgAC9KQtMnVUafqUex~ <named list [39~
## 2 HFR_FY21_Nov_FHI360_2020-12-14.x~ 1RqECc5RNjE9UUKa36vW2tR_y~ <named list [39~
## 3 HFR_FY21_PD2_Eswatini_TLC_202012~ 111WhmmY8WK3RZQtj-TD2_jc6~ <named list [39~
## 4 HFR_FY21_Dec_Eswatini_EGPAF_2021~ 1zi1gHwYWTPZHkzVwvLKJ-628~ <named list [39~
## 5 HFR_FY21_Dec_FHI360_2021-01-12.x~ 1Uz11s7DyrnrFCDCHgBOctLHc~ <named list [39~
## 6 HFR_FY21_PD3_Eswatini_TLC_202001~ 1XJSB7hO8A3vtOKFHKUmuBquD~ <named list [39~
## 7 HFR_FY21 PD4_Eswatini_TLC_202102~ 1wvfAzuOAc0Wl3eVXGANsh0b0~ <named list [40~
## 8 HFR_FY21_Jan_Eswatini_EGPAF_2021~ 1fFWqm9HkD71RTutWzUaInt2F~ <named list [39~
## 9 HFR_FY21_Jan_FHI360_2021-02-11.x~ 1_NLTRUz4rKSFb_t-Dgr4-lXw~ <named list [39~
## 10 HFR_FY21_Feb_Eswatini_EGPAF_2021~ 1MNn_W1xHgA8Ko3JsErWI0k_g~ <named list [41~
## 11 HFR_FY21_Feb_Eswatini_FHI360_202~ 17S_YtplSvt4KnR6jMK6BJLVE~ <named list [40~
## 12 HFR_FY21_Feb_Eswatini_TLC_202103~ 1XzxW2qXzklGjaMfyoRICs8sB~ <named list [40~
## 13 HFR_FY21_Feb_Eswatini_EGPAF_2021~ 1GN-Q3ioc6yY_PYUwzzk-QidT~ <named list [40~
## 14 HFR_FY21_Feb_Eswatini_FHI360_202~ 16V4C0SAblrfPb20pmvX5yd6K~ <named list [39~
## 15 HFR_FY21_Feb_Eswatini_TLC_202103~ 18mcGHW5LhB0nhmS7clpiX9Yq~ <named list [40~
## 16 HFR_FY21_Mar_Eswatini_EGPAF_2021~ 1_FrwNlc4yeSrkdf2TLptfBUc~ <named list [39~
## 17 HFR_FY21_Mar_Eswatini_FHI360_202~ 1ZgTKfvZA7bkMcAlTTuEXJhLQ~ <named list [39~
## 18 HFR_FY21_Mar_Eswatini_TLC_202104~ 1wDm3k582XzlVMpSZyW_XE_LL~ <named list [39~
If you look at these file names, you’ll see there are a few duplicates from resubmissions with the same names being uploaded.
duplicated(files$name)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] TRUE TRUE TRUE FALSE FALSE FALSE
To resolve this issue, we can add the creation time, which we can extract from nested list of drive_resources to the file name.
files <- files %>%
mutate(created_time = purrr::map_chr(drive_resource, "createdTime") %>%
ymd_hms(tz = "EST"),
created_time = created_time %>%
str_remove(":[:digit:]+\\.[:digit:]+Z") %>%
str_remove_all("-|:") %>%
str_replace(" ", "-"),
name_new = str_replace(name, "\\.xlsx", glue("_{created_time}\\.xlsx")))
files$name_new
## [1] "HFR_FY21_Nov_Eswatini_EGPAF_20201214_20201215-071811.xlsx"
## [2] "HFR_FY21_Nov_FHI360_2020-12-14_20201215-071813.xlsx"
## [3] "HFR_FY21_PD2_Eswatini_TLC_20201214_20201215-071814.xlsx"
## [4] "HFR_FY21_Dec_Eswatini_EGPAF_20210115 _20210115-094622.xlsx"
## [5] "HFR_FY21_Dec_FHI360_2021-01-12_20210115-094625.xlsx"
## [6] "HFR_FY21_PD3_Eswatini_TLC_20200114_20210115-094627.xlsx"
## [7] "HFR_FY21 PD4_Eswatini_TLC_20210212_20210215-025822.xlsx"
## [8] "HFR_FY21_Jan_Eswatini_EGPAF_20210212_20210215-025826.xlsx"
## [9] "HFR_FY21_Jan_FHI360_2021-02-11_20210215-025828.xlsx"
## [10] "HFR_FY21_Feb_Eswatini_EGPAF_20210312_20210315-054234.xlsx"
## [11] "HFR_FY21_Feb_Eswatini_FHI360_20210310_20210315-054238.xlsx"
## [12] "HFR_FY21_Feb_Eswatini_TLC_20210312_20210315-054240.xlsx"
## [13] "HFR_FY21_Feb_Eswatini_EGPAF_20210312_20210412-031253.xlsx"
## [14] "HFR_FY21_Feb_Eswatini_FHI360_20210310_20210412-031257.xlsx"
## [15] "HFR_FY21_Feb_Eswatini_TLC_20210312_20210412-031259.xlsx"
## [16] "HFR_FY21_Mar_Eswatini_EGPAF_20210415_20210415-080403.xlsx"
## [17] "HFR_FY21_Mar_Eswatini_FHI360_20210413_20210415-080405.xlsx"
## [18] "HFR_FY21_Mar_Eswatini_TLC_20210415_20210415-080408.xlsx"
duplicated(files$name_new)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE
With that, we have all the information we need to download the files - we have the id and the file name. We’ll create a temporary folder to store these in and then download them to that folder (one at a time).
folderpath <- dir_create(file_temp())
walk2(.x = files$id,
.y = files$name_new,
.f = ~drive_download(as_id(.x),
file.path(folderpath, .y)))
Wonderful so we have all the Eswatini Excel submissions on our local
computer. We can now check if any additional submissions had more than
one HFR labeled tab. We’ll use readxl::excel_sheets()
to provide us
with the list of tab names and filter out anything that won’t be
imported (ie anything without HFR in the name.)
local_files <- list.files(folderpath, full.names = TRUE)
local_files %>%
set_names() %>% #names the vector (1,2,3...) to be the name of the file
map_df(~ excel_sheets(.x) %>% as_tibble(),
.id = "file") %>%
mutate(file = basename(file),
value = str_trim(value)) %>%
filter(str_detect(value, "HFR")) %>%
group_by(file) %>%
mutate(file_id = cur_group_id(), .before = 1) %>%
mutate(extra = value != "HFR") %>%
ungroup() %>%
rename(sheet = value) %>%
prinf()
## # A tibble: 19 x 4
## file_id file sheet extra
## <int> <chr> <chr> <lgl>
## 1 1 HFR_FY21 PD4_Eswatini_TLC_20210212_20210215-025822.xlsx HFR FALSE
## 2 2 HFR_FY21_Dec_Eswatini_EGPAF_20210115 _20210115-094622.xl~ HFR FALSE
## 3 3 HFR_FY21_Dec_FHI360_2021-01-12_20210115-094625.xlsx HFR FALSE
## 4 4 HFR_FY21_Feb_Eswatini_EGPAF_20210312_20210315-054234.xlsx HFR FALSE
## 5 5 HFR_FY21_Feb_Eswatini_EGPAF_20210312_20210412-031253.xlsx HFR FALSE
## 6 6 HFR_FY21_Feb_Eswatini_FHI360_20210310_20210315-054238.xl~ HFR FALSE
## 7 6 HFR_FY21_Feb_Eswatini_FHI360_20210310_20210315-054238.xl~ HFR ~ TRUE
## 8 7 HFR_FY21_Feb_Eswatini_FHI360_20210310_20210412-031257.xl~ HFR FALSE
## 9 8 HFR_FY21_Feb_Eswatini_TLC_20210312_20210315-054240.xlsx HFR FALSE
## 10 9 HFR_FY21_Feb_Eswatini_TLC_20210312_20210412-031259.xlsx HFR FALSE
## 11 10 HFR_FY21_Jan_Eswatini_EGPAF_20210212_20210215-025826.xlsx HFR FALSE
## 12 11 HFR_FY21_Jan_FHI360_2021-02-11_20210215-025828.xlsx HFR FALSE
## 13 12 HFR_FY21_Mar_Eswatini_EGPAF_20210415_20210415-080403.xlsx HFR FALSE
## 14 13 HFR_FY21_Mar_Eswatini_FHI360_20210413_20210415-080405.xl~ HFR FALSE
## 15 14 HFR_FY21_Mar_Eswatini_TLC_20210415_20210415-080408.xlsx HFR FALSE
## 16 15 HFR_FY21_Nov_Eswatini_EGPAF_20201214_20201215-071811.xlsx HFR FALSE
## 17 16 HFR_FY21_Nov_FHI360_2020-12-14_20201215-071813.xlsx HFR FALSE
## 18 17 HFR_FY21_PD2_Eswatini_TLC_20201214_20201215-071814.xlsx HFR FALSE
## 19 18 HFR_FY21_PD3_Eswatini_TLC_20200114_20210115-094627.xlsx HFR FALSE
We can now see that there was only 1 file (file_id
= 6) that has two
tabs being read in. Phew.
We could done all this manually - copying and pasting each of the 18 urls into our browser from the form, downloading them, opening each, and check each one for hidden tabs. But this workflow takes a bit to think through the first time you do it, but you have the script if you need to repeat this agian.