Skip to contents

Complex simulation levels

Often, simulation levels will be simple, such as a vector of sample sizes:

sim <- new_sim()
sim %<>% set_levels(
  n = c(200,400,800)
)

However, there will be many instances in which more complex objects are needed. For these cases, instead of a vector of numbers or character strings, use a named list of lists. The toy example below illustrates this. Note that the list names ("Beta 1", "Beta 2", and "Normal") become the entries in the sim$results dataframe.

sim <- new_sim()
sim %<>% set_levels(
  n = c(10,100),
  distribution = list(
    "Beta 1" = list(type="Beta", params=c(0.3, 0.7)),
    "Beta 2" = list(type="Beta", params=c(1.5, 0.4)),
    "Normal" = list(type="Normal", params=c(3.0, 0.2))
  )
)
create_data <- function(n, type, params) {
  if (type=="Beta") {
    return(rbeta(n, shape1=params[1], shape2=params[2]))
  } else if (type=="Normal") {
    return(rnorm(n, mean=params[1], sd=params[2]))
  }
}
sim %<>% set_script(function() {
  x <- create_data(L$n, L$distribution$type, L$distribution$params)
  return(list("y"=mean(x)))
})
sim %<>% run()
#> Done. No errors or warnings detected.
sim$results
#>   sim_uid level_id rep_id   n distribution      runtime         y
#> 1       1        1      1  10       Beta 1 0.0006489754 0.2878745
#> 2       2        2      1 100       Beta 1 0.0041592121 0.3316068
#> 3       3        3      1  10       Beta 2 0.0003321171 0.8220943
#> 4       4        4      1 100       Beta 2 0.0003123283 0.7662802
#> 5       5        5      1  10       Normal 0.0003025532 3.0007964
#> 6       6        6      1 100       Normal 0.0002920628 2.9895281

Complex return data

In most situations, the results of simulations will be numeric. However, we may want to return more complex data, such as matrices, lists, or model objects. To do this, we include our complex return data in the list with the special key ".complex". This is illustrated in the toy example below, in which we estimate the parameters of a linear regression and returns these as numeric, but also return the estimated covariance matrix and the entire model object.

sim <- new_sim()
sim %<>% set_levels(n=c(10, 100, 1000))
create_data <- function(n) {
  x <- runif(n)
  y <- 3 + 2*x + rnorm(n)
  return(data.frame("x"=x, "y"=y))
}
sim %<>% set_config(num_sim=2)
sim %<>% set_script(function() {
  dat <- create_data(L$n)
  model <- lm(y~x, data=dat)
  return (list(
    "beta0_hat" = model$coefficients[[1]],
    "beta1_hat" = model$coefficients[[2]],
    ".complex" = list(
      "model" = model,
      "cov_mtx" = vcov(model)
    )
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

After running this simulation, we can examine the numeric results directly by accessing sim$results or using the summarize function, as usual:

sim$results
#>   sim_uid level_id rep_id    n     runtime beta0_hat beta1_hat
#> 1       1        1      1   10 0.003430128  1.870858  3.571281
#> 2       4        1      2   10 0.001072884  3.201885  1.956946
#> 3       2        2      1  100 0.004134655  2.977784  2.117937
#> 4       5        2      2  100 0.001064777  3.048721  1.856485
#> 5       3        3      1 1000 0.001358509  2.983234  2.039734
#> 6       6        3      2 1000 0.001242876  2.964263  2.068172

However, we may also want to look at the complex return data. To do so, we use the special function get_complex, as illustrated below:

c5 <- get_complex(sim, sim_uid=5)
summary(c5$model)
#> 
#> Call:
#> lm(formula = y ~ x, data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.35517 -0.43454  0.09288  0.57668  2.16728 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   3.0487     0.2178  13.999  < 2e-16 ***
#> x             1.8565     0.3513   5.284 7.63e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9712 on 98 degrees of freedom
#> Multiple R-squared:  0.2218, Adjusted R-squared:  0.2138 
#> F-statistic: 27.93 on 1 and 98 DF,  p-value: 7.632e-07
c5$cov_mtx
#>             (Intercept)           x
#> (Intercept)  0.04742899 -0.06847956
#> x           -0.06847956  0.12341769

Using the batch function

The batch function is useful if you want to share data or objects between simulation replicates. Essentially, it allows you to take your simulation replicates and divide them into “batches”; all replicates in a given batch will then share a single set of objects. The most common use case for this is if you have a simulation that involves generating one dataset, analyzing it using multiple methods, and then repeating this a number of times. To illustrate the use of batch using this example, first consider the following simulation:

sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3)
sim %<>% set_script(function() {
  dat <- create_data(n=100)
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

If we look at the “dat_1” column of the results object (equal to the first element of the dat vector created in the simulation script), we see that a unique dataset was created for each simulation replicate:

sim$results[order(sim$results$rep_id),]
#>   sim_uid level_id rep_id        est      runtime mu_hat dat_1
#> 1       1        1      1   est_mean 0.0003786087   3.00  3.37
#> 4       2        2      1 est_median 0.0040125847   3.08  3.79
#> 2       3        1      2   est_mean 0.0003399849   2.88  2.84
#> 5       5        2      2 est_median 0.0003337860   2.86  2.08
#> 3       4        1      3   est_mean 0.0003075600   2.90  4.79
#> 6       6        2      3 est_median 0.0003430843   3.02  1.83

Suppose that instead, we want to run a simulation where we generate one dataset, analyzing it using multiple methods (in this case corresponding to “est_mean” and “est_median”), and then repeating this twice. We can do this using the batch function, as follows:

sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3, batch_levels=NULL)
sim %<>% set_script(function() {
  batch({
    dat <- create_data(n=100)
  })
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

In the code above, we changed two things. One, we added batch_levels=NULL to the set_config call; don’t worry about this for now. Second, we wrapped the code line dat <- create_data(n=100) inside the batch function. Whatever code goes inside the batch function will produce the same output for all simulations in a batch; in this case, if we look at the “dat_1” column of the results object, we can see that one dataset was created and shared by the batch corresponding to sim_uids 1 and 2:

sim$results[order(sim$results$rep_id),]
#>   sim_uid level_id rep_id        est      runtime mu_hat dat_1
#> 1       1        1      1   est_mean 0.0006170273   2.80  3.73
#> 4       2        2      1 est_median 0.0032284260   2.80  3.73
#> 2       3        1      2   est_mean 0.0039565563   3.19  3.18
#> 5       5        2      2 est_median 0.0004200935   3.14  3.18
#> 3       4        1      3   est_mean 0.0004267693   2.98  2.34
#> 6       6        2      3 est_median 0.0004649162   2.98  2.34

However, the situation is often more complicated. What if we have a simulation with multiple level variables, some that correspond to creating data and some that correspond to analyzing the data? This is where the batch_levels config option comes in. It is easy to use; simply specify the names of the level variables that are used within the batch function. Here is an example:

sim <- new_sim()
create_data <- function(n, mu) { rnorm(n=n, mean=mu) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(n=c(10,100), mu=c(3,5), est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=2, batch_levels=c("n", "mu"), return_batch_id=T)
sim %<>% set_script(function() {
  batch({
    dat <- create_data(n=L$n, mu=L$mu)
  })
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

sim$results[order(sim$results$batch_id),]
#>    sim_uid level_id rep_id batch_id   n mu        est      runtime mu_hat dat_1
#> 1        1        1      1        1  10  3   est_mean 0.0005247593   2.53  1.90
#> 9        5        5      1        1  10  3 est_median 0.0004286766   2.78  1.90
#> 2        9        1      2        2  10  3   est_mean 0.0004208088   2.75  4.71
#> 10      13        5      2        2  10  3 est_median 0.0003991127   3.00  4.71
#> 3        2        2      1        3 100  3   est_mean 0.0040965080   3.24  2.36
#> 11       6        6      1        3 100  3 est_median 0.0004251003   3.27  2.36
#> 4       10        2      2        4 100  3   est_mean 0.0004086494   3.10  4.64
#> 12      14        6      2        4 100  3 est_median 0.0003995895   3.22  4.64
#> 5        3        3      1        5  10  5   est_mean 0.0004565716   4.65  4.87
#> 13       7        7      1        5  10  5 est_median 0.0004093647   4.79  4.87
#> 6       11        3      2        6  10  5   est_mean 0.0004103184   4.96  4.65
#> 14      15        7      2        6  10  5 est_median 0.0004045963   4.62  4.65
#> 7        4        4      1        7 100  5   est_mean 0.0004327297   4.94  4.07
#> 15       8        8      1        7 100  5 est_median 0.0004153252   4.98  4.07
#> 8       12        4      2        8 100  5   est_mean 0.0004026890   5.00  4.92
#> 16      16        8      2        8 100  5 est_median 0.0004107952   5.03  4.92

We see that we have achieved what we wanted - the batches were created such that each batch contained two replicates, one for each estimator. We also specified the return_batch_id=T option in set_config so that the results object would return the batch_id. The batch_id is the variable that defines the batches; all simulations that share the same batch_id. It can be useful to use this option to ensure that you are using the batch function correctly.

Here are a few tips to keep in mind when using the batch function:

  1. The code within the batch code block should only create objects; do not attempt to change or delete existing objects, as these changes may be ignored.
  2. In the majority of cases, the batch function will be called just once, at the top of the simulation script. However, it can be used anywhere in the script and can be called multiple times. Never use batch outside of the simulation script.
  3. Although we illustrated the use of the batch function to create a dataset to share between multiple replicates, it can be used for much more, such as taking a sample from an existing dataset, computing shared nuisance function estimators, performing computationally-intense tasks, and so on.
  4. Currently, if your simulation script uses the batch function, you cannot update the simulation using the update_sim (or update_sim_on_cluster), function unless all you are doing is removing replicates. This may be changed in the future.

Setting seeds

In statistical research, it is often desirable to have the ability to reproduce the exact results of a simulation. Since R code often involves stochastic (random) functions like rnorm or sample that return different values when called multiple times, reproducibility is not guaranteed. In a simple R script, calling the set.seed function at the top of your script ensures that the code that follows will produce the same results whenever the script is run. However, a more nuanced strategy is needed when running simulations. If we are running 100 replicates of the same simulation, we typically don’t want each replicate to return identical results; rather, we would like for each replicate to be different from one another, but for the entire set of replicates to be the same when we run the entire simulation twice in a row. Luckily, SimEngine manages this process for you, even when simulations are being run in parallel. SimEngine uses a single “global seed” that changes the individual seeds for each simulation replicate; use set_config to set or change this global seed:

sim %<>% set_config(seed=123)

If you did not set a seed with set_config, SimEngine will set a random seed automatically for you so that you can reproduce the results later if desired. To view this seed, use the vars function:

sim <- new_sim()
vars(sim, "seed")
#> [1] 287577520

Handling errors and warnings

As with any type of programming, debugging is a necessary part of the coding workflow. With simulations, sometimes errors occur that will affect all simulation replicates and sometimes errors occur that only affect some replicates. By default, when a simulation is run, SimEngine will not stop if an error occurs; instead, errors are logged and stored in a dataframe along with information about the simulation replicates that resulted in those errors. Examining this dataframe by typing print(sim$errors) can sometimes help to quickly pinpoint the issue. This is demonstrated below:

sim <- new_sim()
sim %<>% set_config(num_sim=2)
sim %<>% set_levels(
  Sigma = list(
    s1 = list(mtx=matrix(c(3,1,1,2), nrow=2)),
    s3 = list(mtx=matrix(c(4,3,3,9), nrow=2)),
    s2 = list(mtx=matrix(c(1,2,2,1), nrow=2)),
    s4 = list(mtx=matrix(c(8,2,2,6), nrow=2))
  )
)
sim %<>% set_script(function() {
  x <- MASS::mvrnorm(n=1, mu=c(0,0), Sigma=L$Sigma$mtx)
  return(list(x1=x[1], x2=x[2]))
})

sim %<>% run()
#> Done. Errors detected in 25% of simulation replicates. Warnings detected in 0% of simulation replicates.

sim$errors
#>   sim_uid level_id rep_id Sigma      runtime                          message
#> 1       5        3      1    s2 0.0003933907 'Sigma' is not positive definite
#> 2       6        3      2    s2 0.0003585815 'Sigma' is not positive definite
#>                                                      call
#> 1 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)
#> 2 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)

From the output above, we see that our code fails for the simulation replicates that use the level with Sigma="s2" because it uses an invalid covariance matrix. Similarly, if a simulation involves replicates that throw warnings, all warnings are logged and stored in the dataframe sim$warnings.

The workflow above can be useful to quickly spot errors, but it has two main drawbacks. First, it can be frustrating to run a time-consuming simulation involving hundreds or thousands of replicates, only to find out at the very end that every replicate failed because of a typo. It is often useful to stop an entire simulation after a single error has occurred. Second, it can sometimes be difficult to determine exactly what caused an error without making use of more advanced debugging tools. For both of these situations, use the following configuration option:

sim %<>% set_config(stop_at_error=TRUE)

Setting stop_at_error=TRUE will stop the simulation when it encounters any error. Furthermore, the error will be thrown by R in the usual way, and so if you are running the simulation in RStudio, you can make use of the built-in debugging tools to find and fix the bug. For example, you can click “Show Traceback” to view the entire call stack and see the function calls that led to the error, or you can click “Rerun with debug” to active the interactive debugger, which reruns the code and pauses execution where the error occurred so that you can examine objects in the function’s environment. Try running the code above in RStudio but with the stop_at_error=TRUE configuration option, clicking “Rerun with debug”, and then typing print(Sigma) in the console.