Simulated data sets to illustrate the package functionality
clustertraindata.RdBoth the clusteredtraindata and clusteredtestdata dataframe are synthetically generated data sets to illustrate the functionality of the package.
The clusteredtraindata has 1000 observations and the clusteredtestdata has 500 observations. The same settings were used to generate both data sets.
Format
ythe binary outcome variable
clusterthe cluster
x1covariate 1
x2covariate 2
x3covariate 3
x4covariate 4
x5covariate 5
Examples
# The data sets were generated as follows
lapply(c("magrittr", "dplyr"), library, character.only = TRUE)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:Hmisc':
#>
#> src, summarize
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> [[1]]
#> [1] "magrittr" "CalibrationCurves" "ggplot2"
#> [4] "rms" "Hmisc" "stats"
#> [7] "graphics" "grDevices" "utils"
#> [10] "datasets" "methods" "base"
#>
#> [[2]]
#> [1] "dplyr" "magrittr" "CalibrationCurves"
#> [4] "ggplot2" "rms" "Hmisc"
#> [7] "stats" "graphics" "grDevices"
#> [10] "utils" "datasets" "methods"
#> [13] "base"
#>
set.seed(1234)
# Simulate training data
nClusters = 10
p = 5
Uj = scale(rnorm(nClusters))
nPop = 1e6
nSample = 1e3
nTest = 1e3
X = replicate(p, rnorm(nPop))
Beta = rnorm(p)
cluster = sample(seq_len(nClusters), nPop, TRUE)
table(cluster)
#> cluster
#> 1 2 3 4 5 6 7 8 9 10
#> 100093 100615 100108 100225 100030 99580 99870 99959 99813 99707
eta = X %*% Beta + Uj[match(cluster, seq_len(nClusters))]
y = rbinom(nPop, 1, binomial()$linkinv(eta))
Dt = data.frame(y, X, cluster)
colnames(Dt) %<>% tolower
clustertraindata = Dt %>%
filter(cluster %in% 1:5) %>%
group_by(cluster) %>%
sample_n(size = nSample) %>%
as.data.frame
clustertestdata = Dt %>%
filter(cluster %in% 6:10) %>%
group_by(cluster) %>%
sample_n(size = nTest) %>%
as.data.frame