Introduction

This vignette provides best practices for project workflow at USAID/Office of HIV/AIDS, making it easier to get you going and to align exceptions across the team.

Initiate Project

The first thing we want to do when starting up new work is determining if it fits into an existing bucket or this is completely new. All of our R work exists on GitHub to make our work transparent, improve collaboration, and makee it easy access. You find our repositories for different projects and package under our GitHub organization, USAID-OHA-SI. If the work exists, you can clone the repo (make a local copy) and start working from there. If however its new work, you’ll want to start up a new project and repo. We work with RStudio Projects and git/GitHub

RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents.

To create a RStudio project, you can navigate there from File > New Project or establish through usethis

library(usethis)

create_project("~/Documents/project-transform")

This will get you set you up in RStudio with a self contained project. In order to collaborate we also use git and GitHub. For more on integrating git with RStudio, check out Jenny Bryan’s Happy Git and GitHub for the useR. You an also use a UI tool like GitHub Desktop or GitKraken. If you have git installed and integreated with R, you can again use usethis to set everything up. And usethis even has some help info on getting started with git if this your first time.

use_git() #this prompts a restart of your session
use_github(organisation = "USAID-OHA-SI")

The last thing we’ll use usethis for is to set up a license for the project. We use the MIT license which calls for attribution for downstream uses of your work and specifies that the software is provided as is.

usethis::use_mit_license("Dr. Raj Shah") #add your name

SI Workflow, Enter glamr

Okay, so we go through the basics of setting up a project, now let’s turn to some workflow specifics that glamr is going to help us with.

The primary function we’re going to run to get us going is si_setup. If we look under the hood at si_setup, we can see it actually contains three sub-function.

si_setup
#> function () 
#> {
#>     folder_setup()
#>     setup_gitignore()
#>     setup_readme()
#> }
#> <bytecode: 0x55d37df458e8>
#> <environment: namespace:glamr>

The first one, folder_setup() initiates the main folders we use for our work:

  • Data - where any raw/input data (xlsx/csv/rds) specific to the project are stored
  • Dataout - where any intermediary or final data (xlsx/csv/rds) are output as a product of your code
  • Data_public - where any public data should be stored. This can be posted to github.
  • Scripts - where all the code (R/py) are stored (if there is a local order, make sure to add prefixes to each script, e.g. 00_init.R, 01_data-access.R, 02_data-munging.R, …)
  • Images - any png/jpeg visual outputs from your code
  • Graphics - any svg/pdf visual outputs that will be edited in vector graphics editor, eg Adobe Illustrator or Inkscape
  • AI - any ai files or other files from a graphics editor (exported pngs products will be stored in Images)
  • GIS - any shp files or other GIS releated inputs
  • Documents - any docx/xlsx/pptx/pdf documents that relate to the process or are final outputs
  • markdown - exported md files from a knitr report

If you clone from GitHub not all the folders may exist there, so we recommend running folder_setup() on its own to establish the same organization on your local machine. The reason we have created this into a function is to ensure uniformity across our projects and know what to expect when we pick up someone else’s work (or our own work from the distant past).

The next function that is run is setup_gitignore(), which creates a .gitignore file in your project. The primary purpose of this file is to ensure that our data and outputs are not being published to the web. By default after running si_setup() most data formats/folders will be kept from being published. Below is what the .gitignore file will look like.

  #R basics
  .Rproj.user
  .Rhistory
  .RData
  .Ruserdata
  
  #no data
  *.csv
  *.gz
  *.txt
  *.rds
  *.xlsx
  *.xls
  *.zip
  *.png
  *.pptx
  *.tfl
  *.twb
  *.twbx
  *.tbs
  *.tbm
  *.hyper
  *.sql
  *.parquet
  *.svg
  *.dfx
  *.json
  *.dta
  *.shp
  *.dbf
  
  #nothing from these folders
  AI/*
  GIS/*
  Images/*
  Graphics/*
  Data/*
  Dataout/*

And the last component from si_setup() is to add a standard USAID disclaimer to your project’s README.md file (and creates this file if its missing).

From a project standpoint, now you’re good to go.

Accessing Data

For the most part, our SI work revolves around using the MER Structured Datasets, which are large, cumbersome files. A best practice when working on projects is to store the data you use within that project so its all self-contained. However, since most of our work revolve around a couple of massive datasets (OUxIM, PSNU, PSNUxIM, NAT_SUBNAT, FSD), it makes more sense for us to store these dataset in a central location on our machines rather than in each project.

The problem now is that where I store my MSD files is going to be different than the path to yours. To solve this dilemma, we use a function called si_paths() which access the paths we have stored locally to where our MSDs or Downloads are for instance. This way, when you pick up a coworkers code, you don’t have to change any of the file paths, it just works.

Those local paths will be set once and stored in your .Rprofile. To do so, you will run set_paths to store all the relevant paths (you can ignore any that aren’t relevant to you).

set_paths(folderpath_msd = "~/Documents/Data",
  folderpath_datim =  "~/Documents/DATIM",
  folderpath_downloads =  "~/Downloads")

Running this will open your .Rprofile and you’ll be prompted to paste the pre-copied code into the .Rprofile save and restart your session.

With that stored, you can use si_path() to return the path to your MSD folder (as the default) and then use another glamr function, return_latest(), which looks in the provide folder path for the lastest version of a file that matches the pattern you provided. In this case, we can pass in that we want the last OUxIM MSD and it will return the file path which we can pass into readr::read_rds()

df <- si_path() %>%
  return_latest("OU_IM_FY19") %>% 
  readr::read_rds()