Complex simulation levels
Often, simulation levels will be simple, such as a vector of sample sizes:
sim <- new_sim()
sim %<>% set_levels(
n = c(200,400,800)
)
However, there will be many instances in which more complex objects
are needed. For these cases, instead of a vector of numbers or character
strings, use a named list of lists. The toy example below
illustrates this. Note that the list names ("Beta 1"
,
"Beta 2"
, and "Normal"
) become the entries in
the sim$results
dataframe.
sim <- new_sim()
sim %<>% set_levels(
n = c(10,100),
distribution = list(
"Beta 1" = list(type="Beta", params=c(0.3, 0.7)),
"Beta 2" = list(type="Beta", params=c(1.5, 0.4)),
"Normal" = list(type="Normal", params=c(3.0, 0.2))
)
)
create_data <- function(n, type, params) {
if (type=="Beta") {
return(rbeta(n, shape1=params[1], shape2=params[2]))
} else if (type=="Normal") {
return(rnorm(n, mean=params[1], sd=params[2]))
}
}
sim %<>% set_script(function() {
x <- create_data(L$n, L$distribution$type, L$distribution$params)
return(list("y"=mean(x)))
})
sim %<>% run()
#> Done. No errors or warnings detected.
sim$results
#> sim_uid level_id rep_id n distribution runtime y
#> 1 1 1 1 10 Beta 1 0.0006489754 0.2878745
#> 2 2 2 1 100 Beta 1 0.0041592121 0.3316068
#> 3 3 3 1 10 Beta 2 0.0003321171 0.8220943
#> 4 4 4 1 100 Beta 2 0.0003123283 0.7662802
#> 5 5 5 1 10 Normal 0.0003025532 3.0007964
#> 6 6 6 1 100 Normal 0.0002920628 2.9895281
Complex return data
In most situations, the results of simulations will be numeric.
However, we may want to return more complex data, such as matrices,
lists, or model objects. To do this, we include our complex return data
in the list with the special key ".complex"
. This is
illustrated in the toy example below, in which we estimate the
parameters of a linear regression and returns these as numeric, but also
return the estimated covariance matrix and the entire model object.
sim <- new_sim()
sim %<>% set_levels(n=c(10, 100, 1000))
create_data <- function(n) {
x <- runif(n)
y <- 3 + 2*x + rnorm(n)
return(data.frame("x"=x, "y"=y))
}
sim %<>% set_config(num_sim=2)
sim %<>% set_script(function() {
dat <- create_data(L$n)
model <- lm(y~x, data=dat)
return (list(
"beta0_hat" = model$coefficients[[1]],
"beta1_hat" = model$coefficients[[2]],
".complex" = list(
"model" = model,
"cov_mtx" = vcov(model)
)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
After running this simulation, we can examine the numeric results
directly by accessing sim$results
or using the
summarize
function, as usual:
sim$results
#> sim_uid level_id rep_id n runtime beta0_hat beta1_hat
#> 1 1 1 1 10 0.003430128 1.870858 3.571281
#> 2 4 1 2 10 0.001072884 3.201885 1.956946
#> 3 2 2 1 100 0.004134655 2.977784 2.117937
#> 4 5 2 2 100 0.001064777 3.048721 1.856485
#> 5 3 3 1 1000 0.001358509 2.983234 2.039734
#> 6 6 3 2 1000 0.001242876 2.964263 2.068172
However, we may also want to look at the complex return data. To do
so, we use the special function get_complex
, as illustrated
below:
c5 <- get_complex(sim, sim_uid=5)
summary(c5$model)
#>
#> Call:
#> lm(formula = y ~ x, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.35517 -0.43454 0.09288 0.57668 2.16728
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.0487 0.2178 13.999 < 2e-16 ***
#> x 1.8565 0.3513 5.284 7.63e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9712 on 98 degrees of freedom
#> Multiple R-squared: 0.2218, Adjusted R-squared: 0.2138
#> F-statistic: 27.93 on 1 and 98 DF, p-value: 7.632e-07
c5$cov_mtx
#> (Intercept) x
#> (Intercept) 0.04742899 -0.06847956
#> x -0.06847956 0.12341769
Using the batch
function
The batch
function is useful if you want to share data
or objects between simulation replicates. Essentially, it allows you to
take your simulation replicates and divide them into “batches”; all
replicates in a given batch will then share a single set of objects. The
most common use case for this is if you have a simulation that involves
generating one dataset, analyzing it using multiple methods, and then
repeating this a number of times. To illustrate the use of
batch
using this example, first consider the following
simulation:
sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
if (type=="est_mean") { return(mean(dat)) }
if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3)
sim %<>% set_script(function() {
dat <- create_data(n=100)
mu_hat <- est_mean(dat=dat, type=L$est)
return(list(
"mu_hat" = round(mu_hat,2),
"dat_1" = round(dat[1],2)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
If we look at the “dat_1” column of the results object (equal to the
first element of the dat
vector created in the simulation
script), we see that a unique dataset was created for each simulation
replicate:
sim$results[order(sim$results$rep_id),]
#> sim_uid level_id rep_id est runtime mu_hat dat_1
#> 1 1 1 1 est_mean 0.0003786087 3.00 3.37
#> 4 2 2 1 est_median 0.0040125847 3.08 3.79
#> 2 3 1 2 est_mean 0.0003399849 2.88 2.84
#> 5 5 2 2 est_median 0.0003337860 2.86 2.08
#> 3 4 1 3 est_mean 0.0003075600 2.90 4.79
#> 6 6 2 3 est_median 0.0003430843 3.02 1.83
Suppose that instead, we want to run a simulation where we generate
one dataset, analyzing it using multiple methods (in this case
corresponding to “est_mean” and “est_median”), and then repeating this
twice. We can do this using the batch
function, as
follows:
sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
if (type=="est_mean") { return(mean(dat)) }
if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3, batch_levels=NULL)
sim %<>% set_script(function() {
batch({
dat <- create_data(n=100)
})
mu_hat <- est_mean(dat=dat, type=L$est)
return(list(
"mu_hat" = round(mu_hat,2),
"dat_1" = round(dat[1],2)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
In the code above, we changed two things. One, we added
batch_levels=NULL
to the set_config
call;
don’t worry about this for now. Second, we wrapped the code line
dat <- create_data(n=100)
inside the batch
function. Whatever code goes inside the batch
function will
produce the same output for all simulations in a batch; in this case, if
we look at the “dat_1” column of the results object, we can see that one
dataset was created and shared by the batch corresponding to sim_uids 1
and 2:
sim$results[order(sim$results$rep_id),]
#> sim_uid level_id rep_id est runtime mu_hat dat_1
#> 1 1 1 1 est_mean 0.0006170273 2.80 3.73
#> 4 2 2 1 est_median 0.0032284260 2.80 3.73
#> 2 3 1 2 est_mean 0.0039565563 3.19 3.18
#> 5 5 2 2 est_median 0.0004200935 3.14 3.18
#> 3 4 1 3 est_mean 0.0004267693 2.98 2.34
#> 6 6 2 3 est_median 0.0004649162 2.98 2.34
However, the situation is often more complicated. What if we have a
simulation with multiple level variables, some that correspond to
creating data and some that correspond to analyzing the data? This is
where the batch_levels
config option comes in. It is easy
to use; simply specify the names of the level variables that are used
within the batch function. Here is an example:
sim <- new_sim()
create_data <- function(n, mu) { rnorm(n=n, mean=mu) }
est_mean <- function(dat, type) {
if (type=="est_mean") { return(mean(dat)) }
if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(n=c(10,100), mu=c(3,5), est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=2, batch_levels=c("n", "mu"), return_batch_id=T)
sim %<>% set_script(function() {
batch({
dat <- create_data(n=L$n, mu=L$mu)
})
mu_hat <- est_mean(dat=dat, type=L$est)
return(list(
"mu_hat" = round(mu_hat,2),
"dat_1" = round(dat[1],2)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
sim$results[order(sim$results$batch_id),]
#> sim_uid level_id rep_id batch_id n mu est runtime mu_hat dat_1
#> 1 1 1 1 1 10 3 est_mean 0.0005247593 2.53 1.90
#> 9 5 5 1 1 10 3 est_median 0.0004286766 2.78 1.90
#> 2 9 1 2 2 10 3 est_mean 0.0004208088 2.75 4.71
#> 10 13 5 2 2 10 3 est_median 0.0003991127 3.00 4.71
#> 3 2 2 1 3 100 3 est_mean 0.0040965080 3.24 2.36
#> 11 6 6 1 3 100 3 est_median 0.0004251003 3.27 2.36
#> 4 10 2 2 4 100 3 est_mean 0.0004086494 3.10 4.64
#> 12 14 6 2 4 100 3 est_median 0.0003995895 3.22 4.64
#> 5 3 3 1 5 10 5 est_mean 0.0004565716 4.65 4.87
#> 13 7 7 1 5 10 5 est_median 0.0004093647 4.79 4.87
#> 6 11 3 2 6 10 5 est_mean 0.0004103184 4.96 4.65
#> 14 15 7 2 6 10 5 est_median 0.0004045963 4.62 4.65
#> 7 4 4 1 7 100 5 est_mean 0.0004327297 4.94 4.07
#> 15 8 8 1 7 100 5 est_median 0.0004153252 4.98 4.07
#> 8 12 4 2 8 100 5 est_mean 0.0004026890 5.00 4.92
#> 16 16 8 2 8 100 5 est_median 0.0004107952 5.03 4.92
We see that we have achieved what we wanted - the batches were
created such that each batch contained two replicates, one for each
estimator. We also specified the return_batch_id=T
option
in set_config
so that the results object would return the
batch_id
. The batch_id
is the variable that
defines the batches; all simulations that share the same batch_id. It
can be useful to use this option to ensure that you are using the
batch
function correctly.
Here are a few tips to keep in mind when using the batch function:
- The code within the
batch
code block should only create objects; do not attempt to change or delete existing objects, as these changes may be ignored. - In the majority of cases, the
batch
function will be called just once, at the top of the simulation script. However, it can be used anywhere in the script and can be called multiple times. Never usebatch
outside of the simulation script. - Although we illustrated the use of the
batch
function to create a dataset to share between multiple replicates, it can be used for much more, such as taking a sample from an existing dataset, computing shared nuisance function estimators, performing computationally-intense tasks, and so on. - Currently, if your simulation script uses the
batch
function, you cannot update the simulation using theupdate_sim
(orupdate_sim_on_cluster
), function unless all you are doing is removing replicates. This may be changed in the future.
Setting seeds
In statistical research, it is often desirable to have the ability to
reproduce the exact results of a simulation. Since R code often involves
stochastic (random) functions like rnorm
or
sample
that return different values when called multiple
times, reproducibility is not guaranteed. In a simple R script, calling
the set.seed
function at the top of your script ensures
that the code that follows will produce the same results whenever the
script is run. However, a more nuanced strategy is needed when running
simulations. If we are running 100 replicates of the same simulation, we
typically don’t want each replicate to return identical results; rather,
we would like for each replicate to be different from one another, but
for the entire set of replicates to be the same when we run the
entire simulation twice in a row. Luckily, SimEngine
manages this process for you, even when simulations are being run in
parallel. SimEngine uses a single “global seed” that
changes the individual seeds for each simulation replicate; use
set_config
to set or change this global seed:
sim %<>% set_config(seed=123)
If you did not set a seed with set_config
,
SimEngine will set a random seed automatically for you
so that you can reproduce the results later if desired. To view this
seed, use the vars
function:
Handling errors and warnings
As with any type of programming, debugging is a necessary part of the
coding workflow. With simulations, sometimes errors occur that will
affect all simulation replicates and sometimes errors occur that only
affect some replicates. By default, when a simulation is run,
SimEngine will not stop if an error occurs; instead,
errors are logged and stored in a dataframe along with information about
the simulation replicates that resulted in those errors. Examining this
dataframe by typing print(sim$errors)
can sometimes help to
quickly pinpoint the issue. This is demonstrated below:
sim <- new_sim()
sim %<>% set_config(num_sim=2)
sim %<>% set_levels(
Sigma = list(
s1 = list(mtx=matrix(c(3,1,1,2), nrow=2)),
s3 = list(mtx=matrix(c(4,3,3,9), nrow=2)),
s2 = list(mtx=matrix(c(1,2,2,1), nrow=2)),
s4 = list(mtx=matrix(c(8,2,2,6), nrow=2))
)
)
sim %<>% set_script(function() {
x <- MASS::mvrnorm(n=1, mu=c(0,0), Sigma=L$Sigma$mtx)
return(list(x1=x[1], x2=x[2]))
})
sim %<>% run()
#> Done. Errors detected in 25% of simulation replicates. Warnings detected in 0% of simulation replicates.
sim$errors
#> sim_uid level_id rep_id Sigma runtime message
#> 1 5 3 1 s2 0.0003933907 'Sigma' is not positive definite
#> 2 6 3 2 s2 0.0003585815 'Sigma' is not positive definite
#> call
#> 1 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)
#> 2 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)
From the output above, we see that our code fails for the simulation
replicates that use the level with Sigma="s2"
because it
uses an invalid covariance matrix. Similarly, if a simulation involves
replicates that throw warnings, all warnings are logged and stored in
the dataframe sim$warnings
.
The workflow above can be useful to quickly spot errors, but it has two main drawbacks. First, it can be frustrating to run a time-consuming simulation involving hundreds or thousands of replicates, only to find out at the very end that every replicate failed because of a typo. It is often useful to stop an entire simulation after a single error has occurred. Second, it can sometimes be difficult to determine exactly what caused an error without making use of more advanced debugging tools. For both of these situations, use the following configuration option:
sim %<>% set_config(stop_at_error=TRUE)
Setting stop_at_error=TRUE
will stop the simulation when
it encounters any error. Furthermore, the error will be thrown by R in
the usual way, and so if you are running the simulation in RStudio, you
can make use of the built-in debugging tools to find and fix the bug.
For example, you can click “Show Traceback” to view the entire call
stack and see the function calls that led to the error, or you can click
“Rerun with debug” to active the interactive debugger, which reruns the
code and pauses execution where the error occurred so that you can
examine objects in the function’s environment. Try running the code
above in RStudio but with the stop_at_error=TRUE
configuration option, clicking “Rerun with debug”, and then typing
print(Sigma)
in the console.