Simulated data sets to illustrate the package functionality — simulatedclustereddata • CalibrationCurves

Both the clusteredtraindata and clusteredtestdata dataframe are synthetically generated data sets to illustrate the functionality of the package. The clusteredtraindata has 1000 observations and the clusteredtestdata has 500 observations. The same settings were used to generate both data sets.

Usage

data(traindata)
  data(testdata)

Format

y: the binary outcome variable
cluster: the cluster
x1: covariate 1
x2: covariate 2
x3: covariate 3
x4: covariate 4
x5: covariate 5

Details

See the examples for how the data sets were generated.

Examples

  # The data sets were generated as follows
  lapply(c("magrittr", "dplyr"), library, character.only = TRUE)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:Hmisc':
#> 
#>     src, summarize
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> [[1]]
#>  [1] "magrittr"          "CalibrationCurves" "ggplot2"          
#>  [4] "rms"               "Hmisc"             "stats"            
#>  [7] "graphics"          "grDevices"         "utils"            
#> [10] "datasets"          "methods"           "base"             
#> 
#> [[2]]
#>  [1] "dplyr"             "magrittr"          "CalibrationCurves"
#>  [4] "ggplot2"           "rms"               "Hmisc"            
#>  [7] "stats"             "graphics"          "grDevices"        
#> [10] "utils"             "datasets"          "methods"          
#> [13] "base"             
#> 
  set.seed(1234)

  # Simulate training data
  nClusters = 10
  p         = 5
  Uj        = scale(rnorm(nClusters))
  nPop      = 1e6
  nSample   = 1e3
  nTest     = 1e3
  X         = replicate(p, rnorm(nPop))
  Beta      = rnorm(p)
  cluster   = sample(seq_len(nClusters), nPop, TRUE)
  table(cluster)
#> cluster
#>      1      2      3      4      5      6      7      8      9     10 
#> 100093 100615 100108 100225 100030  99580  99870  99959  99813  99707 
  eta       = X %*% Beta + Uj[match(cluster, seq_len(nClusters))]
  y         = rbinom(nPop, 1, binomial()$linkinv(eta))
  Dt        = data.frame(y, X, cluster)
  colnames(Dt) %<>% tolower

  clustertraindata = Dt %>%
    filter(cluster %in% 1:5) %>%
    group_by(cluster) %>%
    sample_n(size = nSample) %>%
    as.data.frame
  clustertestdata = Dt %>%
    filter(cluster %in% 6:10) %>%
    group_by(cluster) %>%
    sample_n(size = nTest) %>%
    as.data.frame