accessing-data.Rmd
In OHA/SI, we our days are filled with pouring over MER Structure Datasets (MSD). These data help inform programming and to make policy decisions, largely through visualization to communicate our analyses. MER data are not publicly available in the forms we have access to.work with, which makes it challenging to share work products, develop training material, or test database schemas. To alleviate these concerns while keeping true to our dataset structure, we have developed a masked data that can easily be supplemented or stand in place of the MSD.
The training dataset is masked with geographic and team names from Minor
League Baseball to stand in for snu1
,
psnu
, mech_name
and
prime_partner_name
. Other unique identifies, e.g. “uids”,
have been generated (msk_gen_uid()
) or in more limited
cases have been striped out entirely. The masked operating unit/country
is the Kingdom of Minoria, which consists of four regions,
snu1
, each with four districts, psnu
(minoria_geo
).
The data consist of only indicators from the PEPFAR HIV clinical
cascade (see gophr::cascade_ind
).
Since these data have been masked, we are able to host them
publicly on our GitHub repository via releases. Each quarter, we
will do our best to update here for your use. To view the PEPFAR MSD
schedule, see glamr::pepfar_data_calendar
. There are a few
different ways for you at access the data to start using them.
The easiest way to access the data are directly from R using this
package. We have built a wrapper around the piggyback
package to work with releases. You can use msk_available()
to check the latest masked MSD version available and then
msk_download()
to download the data. The
msk_available()
will give you a warning if there is a newer
MSD available if you want to create your own version.
#Install OHA packages from rOpenSci
oha_pkgs <- c("gophr", "glamr", "glitr")
install.packages(oha_pkgs, repos = c('https://usaid-oha-si.r-universe.dev', 'https://cloud.r-project.org'))
#Load Dependencies
library(themask)
library(dplyr)
library(gophr)
library(glamr)
library(ggplot2)
library(glitr)
#check if you have the latest version available
msk_available()
#> The latest available masked dataset is PSNU_IM_FY59-62_20241115_v1_1
#> All available masked dataset for download:
#> ✔ 2024.11.15.i [latest]
#>
#> • 2024.08.16.i
#>
#> • 2024.06.14.c
#>
#> • 2024.05.17.i
#>
#> • 2024.03.15.c
#>
#> • 2024.02.15.i
#>
#> • 2023.11.14.i
#>
#> • 2023.09.15.c
#>
#> • 2023.06.16.c
#>
#> By default, the latest file is downloaded but you can specify the version from
#> above list in the tag param of `msk_download()`
From the file path, we can see the date and version in the file name which will match the MSD releases. From this file, we can see that the latest masked version available is from 2023-09-15, which is from Q3, and the v2 indicates this is the quarter’s clean version (as opposed to the initial which would be v1).
To download, we can just run msk_download()
.
#download folder path
download_folder <- temp_folder()
#> ℹ A temporary folder is now available here: '/tmp/RtmpXY9rcV/file195362a56c2f/'
#> ℹ The folder path is stored as the object `folderpath_tmp`
#download
msk_download(download_folder)
#> ✔ The masked MSD has been downloaded to /tmp/RtmpXY9rcV/file195362a56c2f
In the above example, we could have specified the param
launch = TRUE
which opens up the Window’s folder specified
automatically. The other parameter not used is the tag
,
which you can use to specify an
older, available version to use,
e.g. tag = 2023.05.31.i
.
To load the data into R, you can use your normal
gophr::read_psd()
to get going.
df_msd <- return_latest(download_folder, "PSNU_IM") %>%
read_psd()
#> ℹ Latest file in 'file195362a56c2f' matching 'PSNU_IM':
#> 'MER_Structured_TRAINING_Datasets_PSNU_IM_FY59-62_20241115_v1_1.zip'
glimpse(df_msd)
#> Rows: 58,681
#> Columns: 48
#> $ operatingunit <chr> "Minoria", "Minoria", "Minoria", "Minoria", "…
#> $ operatingunituid <chr> "kAYJjRwuXG6", "kAYJjRwuXG6", "kAYJjRwuXG6", …
#> $ country <chr> "Minoria", "Minoria", "Minoria", "Minoria", "…
#> $ snu1 <chr> "Midwest", "Midwest", "Midwest", "Midwest", "…
#> $ snu1uid <chr> "Dh3xiXWw0Z7", "Dh3xiXWw0Z7", "Dh3xiXWw0Z7", …
#> $ cop22_psnu <chr> "Great Lakes", "Peoria", "Wisconsin", "Great …
#> $ cop22_psnuuid <chr> "gwmjspIcLT4", "icPkb3tE61B", "XwREtn25Uj1", …
#> $ cop22_snuprioritization <chr> "1 - Scale-Up: Saturation", "1 - Scale-Up: Sa…
#> $ psnu <chr> "Great Lakes", "Peoria", "Wisconsin", "Great …
#> $ psnuuid <chr> "gwmjspIcLT4", "icPkb3tE61B", "XwREtn25Uj1", …
#> $ snuprioritization <chr> "1 - Scale-Up: Saturation", "1 - Scale-Up: Sa…
#> $ typemilitary <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", …
#> $ dreams <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", …
#> $ prime_partner_name <chr> "Bats", "Crawdads", "Cubs", "Captains", "Chie…
#> $ funding_agency <chr> "HHS/CDC", "USAID", "USAID", "HHS/CDC", "HHS/…
#> $ mech_code <chr> "01549", "01568", "01441", "01498", "01470", …
#> $ mech_name <chr> "Bats", "Crawdads", "Cubs", "Captains", "Chie…
#> $ local_prime_partner <chr> "Local", "Local", "Local", "Local", "Local", …
#> $ indicator <chr> "TX_NEW", "TX_PVLS", "TX_CURR", "HTS_TST_POS"…
#> $ numeratordenom <chr> "N", "D", "N", "N", "D", "N", "N", "N", "N", …
#> $ indicatortype <chr> "DSD", "DSD", "DSD", "DSD", "DSD", "DSD", "DS…
#> $ standardizeddisaggregate <chr> "Age/Sex/HIVStatus", "Age/Sex/Indication/HIVS…
#> $ categoryoptioncomboname <chr> "25-29, Male, Positive", "10-14, Male, Routin…
#> $ use_for_age <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
#> $ ageasentered <chr> "25-29", "10-14", "25-34", "15-24", "35-39", …
#> $ age_2018 <chr> "25-29", "10-14", "Newer Age Band", "Newer Ag…
#> $ age_2019 <chr> "25-29", "10-14", "Newer Age Band", "Newer Ag…
#> $ target_age_2024 <chr> "25-34", "10-14", "25-34", "15-24", "35-49", …
#> $ trendscoarse <chr> "15+", "<15", "15+", "15+", "15+", "<15", "<1…
#> $ sex <chr> "Male", "Male", "Female", "Female", "Male", "…
#> $ statushiv <chr> "Positive", "Positive", "Positive", "Positive…
#> $ statustb <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ statuscx <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ hiv_treatment_status <chr> "New", NA, "Already", NA, NA, NA, NA, NA, NA,…
#> $ otherdisaggregate <chr> NA, "Routine", NA, NA, "Routine", NA, NA, NA,…
#> $ otherdisaggregate_sub <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ modality <chr> NA, NA, NA, "Other Facility", NA, "Other PITC…
#> $ target_modality_2024 <chr> "N/A", "N/A", "N/A", "Other Facility", "N/A",…
#> $ fiscal_year <int> 2059, 2059, 2061, 2061, 2059, 2061, 2061, 206…
#> $ targets <dbl> NA, NA, 2272, 4, NA, NA, NA, NA, NA, NA, NA, …
#> $ qtr1 <dbl> 1, 59, NA, NA, NA, 2, 0, NA, NA, NA, NA, 402,…
#> $ qtr2 <dbl> NA, 55, NA, NA, NA, NA, 0, 2, NA, 6, NA, 431,…
#> $ qtr3 <dbl> NA, 68, NA, NA, NA, 1, 0, 2, 1, NA, 7, 450, N…
#> $ qtr4 <dbl> NA, 66, NA, NA, 1, 1, -1, 8, 1, NA, 5, 428, N…
#> $ cumulative <dbl> 1, 66, NA, NA, 1, 4, -1, 12, 2, 6, 12, 428, N…
#> $ safe_for_net_new <chr> "Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "Y", …
#> $ safe_for_vlc <chr> "Y", "Y", "N", "N", "N", "Y", "Y", "Y", "Y", …
#> $ source_name <chr> "DATIM", "DATIM", "DATIM", "Derived", "DATIM"…
Also note that if you use the masked dataset, you can use
gophr::get_metadata()
, specifying the path to the file and
it will provide information about the masked dataset you can use in the
caption for your plot or other documentation.
meta <- get_metadata(download_folder, "PSNU_IM")
#> ℹ Latest file in 'file195362a56c2f' matching 'PSNU_IM':
#> 'MER_Structured_TRAINING_Datasets_PSNU_IM_FY59-62_20241115_v1_1.zip'
#> Warning: metadata is NO LONGER (v3.2.3) exported by default as a global
#> object.
#> ℹ You must store the output as an object to use, e.g. `meta <-
#> get_metadata()`
meta
#> $curr_pd
#> FY61Q4
#>
#> $curr_fy
#> [1] 2061
#>
#> $curr_fy_lab
#> FY61
#>
#> $curr_qtr
#> [1] 4
#>
#> $source
#> FY61Q4i Faux Training MSD
#>
#> $caption
#> Source: FY61Q4i Faux Training MSD
For those not using R, fear not! You can use these data as well. Instead of having a direct function to do these things, you can manually navigate to the Releases page on GitHub. There, you will want to click on the release tagged “latest”. You can then select the MSD to download it and use it as you normally would a MSD.
If you run msk_available()
and it is out of date, or you
want to create a masked NAT_SUBNAT dataset, you can create your own if
you have the full MSD or NAT_SUBNAT
dataset.
#path to local NAT_SUBNAT dataset
subnat_path <- si_path() %>% return_latest("NAT_SUBNAT")
#store a masked NAT_SUBNAT dataset
msk_create(subnat_path, "../Documents/Data")
#generate a masked NAT_SUBNAT dataset to load to your session
df_masked <- msk_create(subnat_path)
In addition to the masked MSD files produced, we have also created shapefiles that can be used for mapping.
Within themask
package itself, we have a few
sf
dataframes that can be interacted with and plotted -
minoria_shp_ou
, minoria_shp_snu1
,
minoria_shp_psnu
.
ggplot() +
geom_sf(data = minoria_shp_psnu, aes(geometry = geometry, fill = snu1), color = "white", linewidth = 0.25, alpha = 0.5) +
geom_sf(data = minoria_shp_ou, aes(geometry = geometry), fill = NA, color = "white", linewidth = 1) +
geom_sf_text(data = minoria_shp_psnu, aes(geometry = geometry, label = psnu), size = 7/.pt) +
scale_fill_si(palette = "siei", discrete = TRUE) +
scale_color_si(palette = "siei", discrete = TRUE) +
si_style_map() +
labs(x = "", y = "",
title = stringr::str_wrap("INTRODUCING MINORIA, THE NEWEST OU TO PEPFAR", width = 32),
subtitle = "PSNUs labeled on map",
caption = "Source: Minoria faux shapefiles") +
theme(panel.background = element_rect(fill = "#edf5fc", color = "white"),
legend.position = "none")
The shapefile can also be downloaded manually for use in your GIS software of choice, e.g. ArcGIS, QGIS, Tableau. You can download these files from the repository’s Releases page, clicking on the release tagged “shapefiles”. Download the zip file(s) you need and unzip them to start using them with your mapping software.