Blocked weighted bootstrap estimation
Ernest Guevarra
8 January 2025
Source:vignettes/bbw-estimation.Rmd
bbw-estimation.Rmd
The bbw package was developed primarily as a tool for analysing complex sample survey data. It was developed specifically for use with the Rapid Assessment Method (RAM) and the Simple Spatial Survey Method (S3M).
The indicatorsHH
is a survey dataset collected from a RAM survey in Bakool, Bay, and
Middle Shabelle regions of Somalia. The villageData
contains the list of villages/clusters that were sampled in the survey
that collected the indicatorsHH
dataset. These is a good
set of data to demonstrate the use of the bbw package to
perform blocked weighted bootstrap estimation.
Original bootstrapping workflow
Bootstrap resampling with bootBW()
The bootBW()
function is the original bootstrap
resampling function of the package. It can be used as follows:
boot_df <- bootBW(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2")
)
#> ✔ x has the appropriate/expected data structure
This call to bootBW()
takes in the survey dataset
indicatorsHH
as its first argument (x
). This
dataset is expected to have a variable labelled as psu
which identifies the primary sampling unit from which data was collected
during the survey and then additional variables for the indicators to be
estimated. The second argument (w
) is for the dataset of
the list of primary sampling units that were sampled in the survey to
collect the survey data specified in x
. This dataset, which
in this case is villageData
, should have at least a
variable labelled psu
which identified the primary sampling
unit that matches the same variable in the survey dataset and a variable
labelled pop
for the population size of the primary
sampling unit. The statistic
argument specified the type of
statistic to apply to the bootstrap replicates. There are two of these
functions available from the bbw package -
bootClassic()
and the bootPROBIT()
. For this
example, the bootClassic()
function is used to get the mean
value of the bootstrap replicates. This is generally useful for binomial
type of indicators and for continuous variables of which to get the mean
of. The params
argument takes in values of the indicator
names in x
to be estimated. In this example, two indicator
names for antenatal care are specified. Finally, the argument for
replicates
specify the number of replicate bootstraps to be
performed. The default of 400 replicates is used here. This results in
the following (showing first 10 rows):
head(boot_df, 10)
#> anc1 anc2
#> 1 0.1864175 0.01874714
#> 2 0.2290978 0.02035985
#> 3 0.2343529 0.02641509
#> 4 0.2548555 0.03084955
#> 5 0.2698864 0.02662863
#> 6 0.2151356 0.01819052
#> 7 0.1937834 0.02677702
#> 8 0.2148349 0.01678766
#> 9 0.2593480 0.02891566
#> 10 0.2151414 0.01922153
The result is a data.frame()
of bootstrap replicates
with number of rows equal to the number or replicates and number of
columns equal to the number of params
specified. Hence,
boot_df
has 400 rows and 2 columns.
Bootstrap estimation
Using boot_df
containing bootstrap replicates of the
indicators anc1
and anc2
, estimating each
indicator with a 95% confidence interval using the percentile
bootstrap method. This can be simply done using the
quantile()
function from the stats
package as
follows:
est_df <- lapply(
X = boot_df,
FUN = quantile,
probs = c(0.5, 0.025, 0.975)
) |>
do.call(rbind, args = _)
The quantile()
function is used to get the 50th
percentile (for the estimate) and the 2.5th and the 97.5th percentile of
the bootstrap replicates to get the lower confidence limit and the upper
confidence limits (respectively) of the indicator estimate. This gives
the following results:
est_df
#> 50% 2.5% 97.5%
#> anc1 0.2316597 0.17709920 0.28849265
#> anc2 0.0218962 0.01347537 0.03484835
Stratified bootstrap resampling
Note that the indicatorsHH
dataset has geographical
stratification. Specifically, the survey from which this data was
collected was designed to be representative of three regions in Somalia
with the regions identified through the region
variable in
indicatorsHH
. Because of this the more appropriate
bootstrap resampling approach would be to resample within each region.
To do this using the original bootBW()
function would
require restructuring the survey dataset by region and then passing the
region-stratified datasets individually to the bootBW()
function. This may look something like this:
## Split indicators by region ----
indicators_by_region <- split(indicatorsHH, f = indicatorsHH$region)
## Split psus by region ----
psus_by_region <- split(villageData, f = villageData$region)
## Bootstrap
boot_df <- Map(
f = bootBW,
x = indicators_by_region,
w = psus_by_region,
statistic = rep(list(get("bootClassic")), length(indicators_by_region)),
params = rep(list(c("anc1", "anc2")), length(indicators_by_region))
)
#> ✔ x has the appropriate/expected data structure
#> ✔ x has the appropriate/expected data structure
#> ✔ x has the appropriate/expected data structure
The bootBW()
function only accepts single
data.frame
inputs for x
and w
arguments. Hence, to resample data from within region, the datasets will
have to be split into separate data.frame
inputs per region
and then bootBW()
applied to each separately. In the
example above, this is done by concatenating each of the inputs to
bootBW()
into a list and then using the Map()
function is sent to bootBW()
sequentially. This produces a
list of the data.frame
bootstrap resample for each region
(shown below):
class(boot_df)
#> [1] "list"
head(boot_df$Bay, 10)
#> anc1 anc2
#> 1 0.4043419 0.013568521
#> 2 0.3907104 0.020491803
#> 3 0.3224044 0.023224044
#> 4 0.2645862 0.016282225
#> 5 0.2708618 0.008207934
#> 6 0.3297151 0.024423338
#> 7 0.3627717 0.004076087
#> 8 0.3662551 0.016460905
#> 9 0.3410641 0.016371078
#> 10 0.2277628 0.014824798
head(boot_df$Bakool, 10)
#> anc1 anc2
#> 1 0.2916667 0.17415730
#> 2 0.2928177 0.09497207
#> 3 0.3260274 0.14804469
#> 4 0.2747253 0.11864407
#> 5 0.2900552 0.11797753
#> 6 0.1823204 0.05849582
#> 7 0.4065934 0.16343490
#> 8 0.2727273 0.11731844
#> 9 0.2821918 0.06944444
#> 10 0.2939560 0.09749304
head(boot_df$`Middle Shabelle`, 10)
#> anc1 anc2
#> 1 0.1723447 0.011055276
#> 2 0.2550607 0.018367347
#> 3 0.1330724 0.010816126
#> 4 0.2830189 0.024551464
#> 5 0.1921569 0.014792899
#> 6 0.2217782 0.010989011
#> 7 0.2117647 0.007881773
#> 8 0.2165156 0.019172553
#> 9 0.2195122 0.015625000
#> 10 0.2274549 0.015075377
To estimate the per region results from this bootstrap resampling, the following can be implemented:
est_df <- lapply(
X = boot_df,
FUN = function(x) lapply(
x, FUN = quantile, probs = c(0.5, 0.025, 0.975)
) |>
do.call(rbind, args = _)
)
est_df <- data.frame(
region = names(est_df),
indicators = lapply(est_df, FUN = row.names) |> unlist(),
do.call(rbind, args = est_df)
)
row.names(est_df) <- NULL
which results in the following output:
est_df
#> region indicators X50. X2.5. X97.5.
#> 1 Bakool anc1 0.30261405 0.188862799 0.41167127
#> 2 Bay anc2 0.11251780 0.050903865 0.19504237
#> 3 Middle Shabelle anc1 0.32391543 0.217411669 0.43781930
#> 4 Bakool anc2 0.01893172 0.002766156 0.03663750
#> 5 Bay anc1 0.20220114 0.134820317 0.27676772
#> 6 Middle Shabelle anc2 0.01724140 0.007237952 0.03006251
Alternative blocked weighted bootstrap function set
From this demonstration, the bootBW()
function proves to
be straightforward to implement and can be easily incorporated into a
user’s workflow based on their dataset and their analytic needs.
However, as shown above, this flexibility requires a lot more extra
coding from the user to get from resampling to indicator estimates.
Starting from v0.3.0
, an alternative set of functions is
available to perform blocked weighted bootstrap resampling that
facilitates all the steps from resampling to estimation. Below is an
example of how to use this alternative set of functions for the same
tasks shown above.
This set of functions attempts to make the blocked weighted bootstrap
algorithm more efficient through vectorisation and use of
parallelisation techniques. The function syntax has been kept consistent
with bootBW()
for ease of transition.
Bootstrap resampling with boot_bw()
The boot_bw()
function is the alternative bootstrap
resampling function of the package. It can be used as follows:
boot_df <- boot_bw(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2")
)
This call to boot_bw()
takes in the survey dataset
indicatorsHH
as its first argument (x
). This
dataset is expected to have a variable labelled as psu
which identifies the primary sampling unit from which data was collected
during the survey and then additional variables for the indicators to be
estimated. The second argument (w
) is for the dataset of
the list of primary sampling units that were sampled in the survey to
collect the survey data specified in x
. This dataset, which
in this case is villageData
, should have at least a
variable labelled psu
which identified the primary sampling
unit that matches the same variable in the survey dataset and a variable
labelled pop
for the population size of the primary
sampling unit. The statistic
argument specified the type of
statistic to apply to the bootstrap replicates. There are two of these
functions available from the bbw package -
bootClassic()
and the bootPROBIT()
. For this
example, the bootClassic()
function is used to get the mean
value of the bootstrap replicates. This is generally useful for binomial
type of indicators and for continuous variables of which to get the mean
of. The params
argument takes in values of the indicator
names in x
to be estimated. In this example, two indicator
names for antenatal care are specified. Finally, the argument for
replicates
specify the number of replicate bootstraps to be
performed. The default of 400 replicates is used here. As can be noted,
the boot_bw()
takes on the same type of arguments as
bootBW()
and the syntax is exactly the same. Hence, using
this alternative function will be familiar to those who have had
experience using the original function.
However, the output of the boot_bw()
function is
structured differently from the bootBW()
function. The
boot_bw()
function produces and object of class
boot_bw
.
class(boot_df)
#> [1] "boot_bw"
The object boot_bw
is a list with 4 named components:
params
for the values specified for the params
argument, replicates
for the number of bootstrap replicates
performed, strata
for the values specified for
stratification, and boot_data
which is the bootstrap
results.
names(boot_df)
#> [1] "params" "replicates" "strata" "boot_data"
The boot_data
component of the boot_bw
object corresponds to the output of the bootBW()
function.
Other than the difference in the structure of the output, this alternative function also has three additional arguments for the new features it provides.
strata
- the variable name inx
that provides information on the stratification in the survey data. This is by default set toNULL
signifying no stratification. This argument allows the user to perform stratified bootstrap resampling conveniently through theboot_bw()
function.parallel
- whether or not to use parallel computation for the bootstrap resampling. This is by default set to FALSE in which case bootstrap resampling is done sequentially as is with thebootBW()
function. If set to TRUE, the function sets up parallel computing and utilises the machines available cores (seecores
argument below).cores
- the number of cores to use for parallel computation. This is only evaluated ifparallel = TRUE
. By default, this is set to 1 less the total available number of cores of the current machine.
To use these new features and functionality, the call to
boot_bw()
would look something like this:
boot_df <- boot_bw(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2"), strata = "region", parallel = TRUE
)
This produces a boot_bw
class list
object
with the same components as above. The only different is that the
boot_data
component is a list
(instead of a
data.frame
) with each component being the
data.frame
bootstrap resampling output for each of the
strata in the dataset.
Bootstrap estimation
The boot_bw_estimate()
function can then be applied to
the output of the boot_bw()
function to get the indicator
estimates with 95% confidence interval.
boot_bw_estimate(boot_df)
#> region indicator est lcl ucl
#> 1 Bakool anc1 0.43888889 0.38881944 0.48888889
#> 2 Bakool anc2 0.38055556 0.32497749 0.43062500
#> 3 Bay anc1 0.71619066 0.63887512 0.77849135
#> 4 Bay anc2 0.00254615 0.00000000 0.01294677
#> 5 Middle Shabelle anc1 0.20757542 0.14514451 0.28293531
#> 6 Middle Shabelle anc2 0.05065259 0.03133757 0.07453108
#> se
#> 1 0.027718319
#> 2 0.027983726
#> 3 0.036466569
#> 4 0.003743969
#> 5 0.036375151
#> 6 0.011463590
These two functions can be piped to each other for a single workflow from bootstrap resampling to estimation.
boot_bw(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2"), strata = "region", parallel = TRUE
) |>
boot_bw_estimate()
#> region indicator est lcl ucl
#> 1 Bakool anc1 0.438888889 0.3805556 0.49444444
#> 2 Bakool anc2 0.376731302 0.3138889 0.43888889
#> 3 Bay anc1 0.719130072 0.6487833 0.78255787
#> 4 Bay anc2 0.002534854 0.0000000 0.01262706
#> 5 Middle Shabelle anc1 0.203423968 0.1428536 0.27819673
#> 6 Middle Shabelle anc2 0.051256281 0.0339071 0.07622767
#> se
#> 1 0.030425611
#> 2 0.030033802
#> 3 0.034273078
#> 4 0.003372086
#> 5 0.033966913
#> 6 0.010573679