Reproducible Aanalysis and Reporting for QBA

Author

Koichiro Shiba

Published

January 2, 2025

Overview

This document demonstrates principles of reproducible and efficient data analysis, including using functions, automating repeated analysis via purrr, script-based tables and figures, and how to use Quarto to generate reproducible reports. We will focus on the summary-level exposure misclassification adjustment as a motivating example.

Performing PBA with a single set of bias parameters

PBA is be done via Monte Carlo simulations where estimates are corrected using bias parameters drawn from probability distributions. If you know the underlying theory, you could implement this from scratch; however, it would require hundreds of lines of code every time you perform PBA.

Fortunately, you don’t have to do this - you will use the pba.summary.exp.rr function that is already created and available for this course. This function uses the following inputs.

  • Observed data
    • \(a\) = number of exposed cases
    • \(b\) = number of unexposed cases
    • \(c\) = number of exposed controls
    • \(d\) = number of unexposed controls
  • Bias parameters for sensitivity (beta distribution)
    • \(se1.a\) = alpha parameter for sens among cases
    • \(se1.b\) = beta parameter for sens among cases
    • \(se0.a\) = alpha parameter for sens among controls
    • \(se0.b\) = beta parameter for sens among controls
  • Bias parameters for specificity (trapezoidal)
    • Case Specificity
      • \(sp1.min\) = lower bound of trapezoidal dist’n
      • \(sp1.mod1\) = first mode of trapezoidal
      • \(sp1.mod2\) = second mode of trapezoidal
      • \(sp1.max\) = upper bound of trapezoidal dist’n
    • Control Specificity
      • \(sp0.min\) = lower bound of trapezoidal dist’n
      • \(sp0.mod1\) = first mode of trapezoidal
      • \(sp0.mod2\) = second mode of trapezoidal
      • \(sp0.max\) = upper bound of trapezoidal dist’n
  • \(Sims\) = number of simulations

The source code for the pba.summary.exp.rr function is embedded below.

Code
# Load packages
library(MASS)
library(trapezoid)
library(tidyverse)
library(sandwich)
select <- dplyr::select # overwriting the select function from MASS to the one from dplyr

#Macro for summary level exposure misclass bias adjustment
#Inputs for this macro are as follow:
#Data
  #a=number of exposed cases
  #b=number of unexposed cases
  #c=number of exposed controls
  #d=number of unexposed controls

#bias parameters for sensitivity (beta distribution)
  #se1.a=alpha parameter for sens among cases
  #se1.b=beta parameter for sens among cases
  #se0.a=alpha parameter for sens among controls
  #se0.b=beta parameter for sens among controls

#Bias parameters for specificity (trapezoidal)
  #Case Specificity
  #sp1.min = lower bound of trapezoidal dist'n
  #sp1.mod1 = first mode of trapezoidal
  #sp1.mod2 = second mode of trapezoidal
  #sp1.max = upper bound of trapezoidal dist'n

  #Control Specificity
  #sp0.min = lower bound of trapezoidal dist'n
  #sp0.mod1 = first mode of trapezoidal
  #sp0.mod2 = second mode of trapezoidal
  #sp0.max = upper bound of trapezoidal dist'n
#Sims=number of simulations

pba.summary.exp.rr<-function(a,b,c,d,se1.a,se1.b,se0.a,se0.b,sp1.min,sp1.mod1,sp1.mod2,sp1.max,sp0.min,sp0.mod1,sp0.mod2,sp0.max,SIMS){
  
  
  n_case <- a + b
  n_ctrl <- c + d
  n_exp  <- a + c
  n_unexp<- b + d
  
  # set some important parameters
  niter <- SIMS #number of iterations
  
  # draw correlated sensitivities and specificities
  # correlation =rho1
  rho1=.80
  V=matrix(c(1,rho1,rho1,1),ncol=2)
  Z=mvrnorm(niter,c(0,0),V)
  U=pnorm(Z)
  #NOTE: if you have non-differential misclassification, you would sample se1~beta(niter,se1.a,se1.b) and then set se0<-se1
  se1=qbeta(U[,1],se1.a,se1.b)
  se0=qbeta(U[,2],se0.a,se0.b)
  
  rho2=.80
  V=matrix(c(1,rho2,rho2,1),ncol=2)
  Z=mvrnorm(niter,c(0,0),V)
  U=pnorm(Z)
  #NOTE: similar to above, if you have non-differential misclassification, sample sp1 and then set sp0=sp1
  sp1=qtrapezoid(U[,1],sp1.min,sp1.mod1,sp1.mod2,sp1.max)
  sp0=qtrapezoid(U[,2],sp0.min,sp0.mod1,sp0.mod2,sp0.max)
  
  # calculate bias-adjusted cell frequencies
  ac <- (a - n_case*(1-sp1))/(se1 - (1-sp1)) #bias-adjusted cases, exposed 
  bc <- n_case - ac #bias-adjusted cases, unexposed
  cc <- (c - n_ctrl*(1-sp0))/(se0 - (1-sp0)) #bias-adjusted controls, exposed
  dc <- n_ctrl - cc #bias-adjusted controls, unexposed
  flag<-(ac<=0)|(bc<=0)|(cc<=0)|(dc<=0)
  
  #calculate prevalence of exposure in cases and controls, accounting for sampling error
  PrevE_cases<-rbeta(niter,ac,bc)
  PrevE_controls <- rbeta(niter,cc,dc)
  
  #calculate PPV and NPV of exposure classification in cases and controls
  PPV_case <- (se1*PrevE_cases)/((se1*PrevE_cases)+(1-sp1)*(1-PrevE_cases))
  PPV_control <- (se0*PrevE_controls)/((se0*PrevE_controls)+(1-sp0)*(1-PrevE_controls))
  NPV_case <- (sp1*(1-PrevE_cases))/((1-se1)*PrevE_cases+sp1*(1-PrevE_cases))
  NPV_control <- (sp0*(1-PrevE_controls))/((1-se0)*PrevE_controls+sp0*(1-PrevE_controls))
  
  #calculate the expected number of exposed cases and controls
  ab <- rbinom(niter,a,PPV_case)+rbinom(niter,b,1-NPV_case)
  bb <- n_case-ab
  cb <- rbinom(niter,c,PPV_control)+rbinom(niter,d,1-NPV_control)
  db <- n_ctrl-cb
  
  #calculate bias adjusted RR with second source of uncertainty
  rr_bb <- (ab/(ab+cb))/(bb/(bb+db))
  
  # calculate bias-adjusted risk ratios, third source of uncertainty, bias-adjusted standard error
  se_bb <- sqrt(1/ab+1/bb-1/(ab+cb)-1/(bb+db))
  z <- rnorm(niter)
  rr_bb_cb <- exp(log(rr_bb) - (z*se_bb))
  
  # bind vectors as dataframe
  bound <- data.frame(ac, bc, cc, dc, ab, bb, cb, db, rr_bb_cb)
  
  # housekeeping - retain rows with positive corrected cell freqs
  summary_pba <- bound %>%
    filter(ac > 0) %>%
    filter(bc > 0) %>%
    filter(cc > 0) %>%
    filter(dc > 0) %>%
    filter(ab > 0) %>%
    filter(bb > 0) %>%
    filter(cb > 0) %>%
    filter(db > 0)
  
  q3<-c(quantile(summary_pba$rr_bb_cb,c(0.025,0.5,0.975)),niter-length(summary_pba$rr_bb_cb))
  names(q3)<-c("2.5th %tile","50th %tile", "97.5th %tile","Impossible values")
  return(q3)
}

The function outputs the 2.5th, 50th (median), and 97.5th percentiles of bias-adjusted risk ratios, along with the number of impossible values (simulations with negative values for the adjusted cell counts). The sample scenario in the distributed code used the following parameter values:

Specified parameter values in the example
Parameter Value
\(a\) 40
\(b\) 20
\(c\) 60
\(d\) 80
\(se1.a\) 25
\(se1.b\) 3
\(se0.a\) 45
\(se0.b\) 7
\(sp1.min\) 0.9
\(sp1.mod1\) 0.93
\(sp1.mod2\) 0.97
\(sp1.max\) 1
\(sp0.min\) 0.8
\(sp0.mod1\) 0.83
\(sp0.mod2\) 0.87
\(sp0.max\) 0.9
\(Sims\) 10^6

The function should produce output similar to what is shown below. Due to the probabilistic nature of Monte Carlo simulation, you may get slightly different numbers each time you run the code.

Code
pba.summary.exp.rr(a = 40,b = 20,c = 60,d = 80,
                   se1.a = 25, se1.b = 3, se0.a = 45, se0.b = 7,
                   sp1.min = .9, sp1.mod1 = .93, sp1.mod2 = .97, sp1.max = 1,
                   sp0.min = .8, sp0.mod1 = .83, sp0.mod2 = .87, sp0.max = .9,
                   SIMS = 10^6) |> 
  round(2)
      2.5th %tile        50th %tile      97.5th %tile Impossible values 
             1.54              2.84              7.77           3166.00 

The output is treated as a vector of numeric values in R and does not look fancy when knitted as an HTML Quarto report (this document). There are many packages you can use to make the table prettier in a Quarto document. Here, I show the example using the knitr::kable() and kableExtra.

Code
library(kableExtra)
library(knitr)
output.single <- pba.summary.exp.rr(a = 40,b = 20,c = 60,d = 80,
                                    se1.a = 25, se1.b = 3, se0.a = 45, se0.b = 7,
                                    sp1.min = .9, sp1.mod1 = .93, sp1.mod2 = .97, sp1.max = 1,
                                    sp0.min = .8, sp0.mod1 = .83, sp0.mod2 = .87, sp0.max = .9,
                                    SIMS = 10^6) |> 
  round(2)

output.single |> 
  broom::tidy() |> # transform the output into a tibble
  pivot_wider(names_from = "names", values_from = "x") |> # wider format 
  kable() |> # transform the tibble into a HTML talbe
  kable_paper("hover",full_width = F) # use functions from the kableExtra package to format the table
2.5th %tile 50th %tile 97.5th %tile Impossible values
1.54 2.84 7.8 3307

Repeating PBA with changing parameter values

You now know how to use the pba.summary.exp.rr function to perform the PBA with a single set of pre-specified parameter values (e.g., alphas and betas for sensitivity). These values are determined based on results from a validation study and subject-matter knowledge. But, what if you want to repeat the PBA with changing parameter values?

You may be tempted to copy and paste the pba.summary.exp.rr, modify the parameter values, and save the outputs. You repeat this process for all desired parameter value combinations. Then, you manually create a summary table (entering each QBA result and its corresponding parameter values into a spreadsheet).

DO NOT DO THIS. This approach is not ideal because it is very time consuming and also introduces inevitable human errors (hence, the lack of reproducibility) due to many reasons such as:

  • Failing to change parameter values or incorrectly typing parameter values
  • Failing to rename output objects
  • Using incorrect objects or entering wrong numbers when creating the summary table

Instead, do the following:

  1. Write a generic function that does the following:

    1. Uses a list of parameter value combinations as an input
    2. Applies each combination of parameter values to the pba.summary.exp.rr
    3. Automatically repeats this process for all combinations
  2. Create a list of desired parameter value combinations

  3. Apply the function to the list you created

  4. Use scripts to generate tables and figures summarizing the sensitivity analysis

Example: Changing beta parameters

Below is an example where PBA is repeated with varying beta parameters for sensitivity among cases and non-cases. That is, you will only change \(se1.b\) and \(se0.b\) while fixing all other parameters to the same values as the ones in the original example above.

1. Write a function that uses \(se1.b\) and \(se0.b\) as an input

The code below creates an original function named perform.PBA.with.changingBetas. This function uses user-specified values of \(se1.b\) and \(se0.b\) as inputs and apply them to the pba.summary.exp.rr function while holding all other parameter values constant.

Note that the function uses set.seed() to specify a seed value to be used in the PBA. This ensures the full reproducibility of the result.

Code
library(purrr)
library(flextable)
library(officer)

# Function to perform the pba with specified betas
perform.PBA.with.changingBetas <- function(se1b, se0b){
  set.seed(123)
  rr.out <- pba.summary.exp.rr(a = 40,b = 20,c = 60,d = 80,
                               se1.a = 25, se1.b = se1b, se0.a = 45, se0.b = se0b,
                               sp1.min = .9, sp1.mod1 = .93, sp1.mod2 = .97, sp1.max = 1,
                               sp0.min = .8, sp0.mod1 = .83, sp0.mod2 = .87, sp0.max = .9,
                               SIMS = 10^6)
  
  scenario.specific.output <- round(rr.out,2) |> 
    broom::tidy() |> 
    pivot_wider(names_from = "names", values_from = "x") 
  
  return(scenario.specific.output)
}

2. Create a list of desired parameter values

We will specify a range of the beta parameter values by fixing the alpha values while changing the sensitivity values. In the original example, the mean sensitivity among cases and non-cases from the beta distribution were approximately:

\[E[se1] = \frac{se1.a}{(se1.a + se1.b)} = \frac{25}{(25+3)}=0.89\]

\[E[se0] = \frac{se0.a}{(se0.a + se0.b)} = \frac{45}{(45+7)}=0.87\] Let’s consider the mean sensitivity among cases and non-cases ranging from 0.85 to 1.0 by 0.05 respectively. These values can be specified via seq(0.85, 1, 0.05) in R. You can backcalculate beta values from the mean sensitvity and fixed alpha value:

\[se1.b = \frac{se1.a}{E[se1]}-se1.a = \frac{25}{E[se1]}-25\] \[se0.b = \frac{se0.a}{E[se0]}-se0.a = \frac{45}{E[se0]}-45\] This gives us the following sixteen scenarios with corresponding beta values.

Code
library(purrr)
library(flextable)
library(officer)

# Function to perform the pba with specified betas
perform.PBA.with.changingBetas <- function(se1b, se0b){
  set.seed(123)
  rr.out <- pba.summary.exp.rr(a = 40,b = 20,c = 60,d = 80,
                               se1.a = 25, se1.b = se1b, se0.a = 45, se0.b = se0b,
                               sp1.min = .9, sp1.mod1 = .93, sp1.mod2 = .97, sp1.max = 1,
                               sp0.min = .8, sp0.mod1 = .83, sp0.mod2 = .87, sp0.max = .9,
                               SIMS = 10^6)
  
  scenario.specific.output <- round(rr.out,2) |> 
    broom::tidy() |> 
    pivot_wider(names_from = "names", values_from = "x") 
  
  return(scenario.specific.output)
}

# Back calculate betas based on the sensitivity values for cases and non-cases
parms.betas = expand_grid(
  mean.se1s = seq(0.85, 1, 0.05),
  mean.se0s = seq(0.85, 1, 0.05)
) |> 
  as_tibble(rownames = "scenario") |> 
  mutate(
    se1.b = 25/mean.se1s-25,
    se0.b = 45/mean.se0s-45,  
  ) 

parms.betas |> 
  kable() |> 
  kable_paper("hover",full_width = F)
scenario mean.se1s mean.se0s se1.b se0.b
1 0.85 0.85 4.411765 7.941177
2 0.85 0.90 4.411765 5.000000
3 0.85 0.95 4.411765 2.368421
4 0.85 1.00 4.411765 0.000000
5 0.90 0.85 2.777778 7.941177
6 0.90 0.90 2.777778 5.000000
7 0.90 0.95 2.777778 2.368421
8 0.90 1.00 2.777778 0.000000
9 0.95 0.85 1.315789 7.941177
10 0.95 0.90 1.315789 5.000000
11 0.95 0.95 1.315789 2.368421
12 0.95 1.00 1.315789 0.000000
13 1.00 0.85 0.000000 7.941177
14 1.00 0.90 0.000000 5.000000
15 1.00 0.95 0.000000 2.368421
16 1.00 1.00 0.000000 0.000000

These beta values are now saved as a tibble object (i.e., as a data table), so you do not need to type them manually. The tibble looks like this

Code
parms.betas |> 
  head()
# A tibble: 6 × 5
  scenario mean.se1s mean.se0s se1.b se0.b
  <chr>        <dbl>     <dbl> <dbl> <dbl>
1 1             0.85      0.85  4.41  7.94
2 2             0.85      0.9   4.41  5   
3 3             0.85      0.95  4.41  2.37
4 4             0.85      1     4.41  0   
5 5             0.9       0.85  2.78  7.94
6 6             0.9       0.9   2.78  5   

3. Apply the function to the specified parameter values

Instead of copying and pasting the perform.PBA.with.changingBetas and manually updating the beta value inputs, we will automate this repeated process using the R package called purrr. The map function of this package automatically repeats a specified function using each element of a vector as an input. Using the purrr::map2, you can set two vectors as inputs - in this case, the columns parms.betas$se1.b and parms.betas$se0.b. The code below adds a new column called parms.betas$rr.outputs, which is an output from the perform.PBA.with.changingBetas using the beta values corresponding to each row. Use unnest() to show what’s inside the parms.betas$rr.outputs column.

The output would look like this:

Code
# Apply the new function to the grid of betas
outputs.sensitivity <- parms.betas |> 
  mutate(rr.outputs = map2(se1.b,se0.b, perform.PBA.with.changingBetas)) |> 
  unnest(rr.outputs) |> 
  mutate(across(c(se1.b,se0.b),round,2))

outputs.sensitivity |> 
  head() 
# A tibble: 6 × 9
  scenario mean.se1s mean.se0s se1.b se0.b `2.5th %tile` `50th %tile`
  <chr>        <dbl>     <dbl> <dbl> <dbl>         <dbl>        <dbl>
1 1             0.85      0.85  4.41  7.94          1.65         3.26
2 2             0.85      0.9   4.41  5             1.77         3.51
3 3             0.85      0.95  4.41  2.37          1.89         3.74
4 4             0.85      1     4.41  0             1.97         3.95
5 5             0.9       0.85  2.78  7.94          1.48         2.71
6 6             0.9       0.9   2.78  5             1.6          2.92
# ℹ 2 more variables: `97.5th %tile` <dbl>, `Impossible values` <dbl>

4. Use scripts to generate tables and figures

You can now use scripts to generate a table based on the output tibble. You can group some rows (see what I did for the second column), rename columns for better readability, etc. Then use the flextable package to transform the tibble into a new table object. You can specify fonts, layout, etc.

Code
# Country specific demographic group means of outcome
collapse_rows_df <- function(df, variable){
  group_var <- enquo(variable)
  df %>%
    group_by(!! group_var) %>%
    mutate(groupRow = 1:n()) %>%
    ungroup() %>%
    mutate(!!quo_name(group_var) := ifelse(groupRow == 1, as.character(!! group_var), "")) %>%
    select(-c(groupRow))
}

# Create a publishable table
tbl.output.betas <- outputs.sensitivity |> 
  collapse_rows_df(mean.se1s) |> # apply the function above to group some rows
  rename( # rename variables for readability
    Scenario = scenario,
    `E[se1]` = mean.se1s,
    `E[se0]` = mean.se0s
  ) |> 
  flextable() |>  # Transform a tibble into a flextable
  set_caption(
    "Table. QBA with different betas", # add a title
    align_with_table = F,
    style = fp_text( # specify formatting rules
      font.family = "Times", 
      font.size = 12,
      bold = FALSE,
      italic = FALSE
    )
  ) |> 
  fontsize(size = 12, part = "all") |> 
  font(fontname = "Times", part = "all") |> 
  set_table_properties(layout = "autofit")

tbl.output.betas

Scenario

E[se1]

E[se0]

se1.b

se0.b

2.5th %tile

50th %tile

97.5th %tile

Impossible values

1

0.85

0.85

4.41

7.94

1.65

3.26

13.42

15,292

2

0.90

4.41

5.00

1.77

3.51

14.80

15,334

3

0.95

4.41

2.37

1.89

3.74

16.38

15,320

4

1.00

4.41

0.00

1.97

3.95

18.31

15,183

5

0.9

0.85

2.78

7.94

1.48

2.71

6.97

2,382

6

0.90

2.78

5.00

1.60

2.92

7.68

2,439

7

0.95

2.78

2.37

1.71

3.11

8.35

2,383

8

1.00

2.78

0.00

1.79

3.28

9.29

2,387

9

0.95

0.85

1.32

7.94

1.34

2.33

4.48

178

10

0.90

1.32

5.00

1.46

2.52

4.86

170

11

0.95

1.32

2.37

1.57

2.69

5.25

175

12

1.00

1.32

0.00

1.66

2.84

5.75

174

13

1

0.85

0.00

7.94

1.19

2.06

3.50

0

14

0.90

0.00

5.00

1.31

2.22

3.72

0

15

0.95

0.00

2.37

1.43

2.37

3.93

0

16

1.00

0.00

0.00

1.55

2.52

4.14

0

The flextable object can be exported as a MS Word document using the officer package.

Code
my.fontprops <- fp_text(bold = F, italic = F)

read_docx() |> 
  body_add_flextable(tbl.output.betas) |> 
  body_add_fpar(my.fontprops) |> 
  print(., target = "reproducible_table_QBA.docx")

The following code will save the flextable above as a MS Word file. The exported file would look like this:

One of the benefits of storing an output as a tibble object is that you can use it to flexibly create any figures for visualization using the ggplot2 package. Here, you see an example of a point range plot for the median bias-adjusted estimates with bars for 2.5th and 97.5th percentiles, according to the specified mean sensitivity values among cases and non-cases.

Code
## as a figure
pd <- position_dodge(width = -0.5) # set desired dodge width

outputs.sensitivity |> 
  mutate(mean.se0s = factor(mean.se0s),
         mean.se1s = factor(mean.se1s),
         mean.se1s = fct_rev(mean.se1s)) |> 
  ggplot(aes(x = mean.se1s, y = `50th %tile`, ymin = `2.5th %tile`, ymax = `97.5th %tile`, colour = mean.se0s)) +
  geom_point(position = pd) +
  geom_line(position = pd) +
  geom_errorbar(
  width = 0.01,
  #alpha = 0.5,
  linewidth = 0.5,
  position = pd
  ) +
  theme_minimal() +
  coord_flip() +
  scale_y_continuous(
    limits = c(1,max(outputs.sensitivity$`97.5th %tile`)),
    breaks = c(seq(1,max(outputs.sensitivity$`97.5th %tile`),1))
    ) +
  ggsci::scale_color_npg(name = str_wrap("Sensitivity Among Non-cases (Se0)", width = 25)) +
  xlab("Sensitivity Among Cases (Se1)") +
  ylab("50th Percentile Bias-adjusted RRs") +
  theme(
    text        = element_text(size = 12),
    axis.text   = element_text(size = 12),
    legend.text = element_text(size = 12),
    #legend.position = "bottom",
    #legend.direction = "vertical",
    legend.box.background = element_rect(fill = "white", color = "black"), 
  ) 

This is just one example of how you might do reproducible data analysis and reporting. You could extend this example to:

  1. Be creative and use many other visualization approaches (modify Step 4)
  2. Allow a wider range of the beta parameter values (modify Step 2)
  3. or even create a new function that allows you to vary other parameter values (e.g., specificity parameters; modify Step 1)

The big-picture idea is that you should always write original functions to automate the repetition and scripts for generating tables and figures. Avoid copying, pasting, and manually typing numbers as much as possible.