Skip to contents

The training dataset contains real-life survival data from patients who underwent primary surgery for breast cancer between 1978 and 1993 in Rotterdam. The patients were followed until 2007, resulting in a model development cohort of 2982 patients after exclusions. The primary outcome measured was recurrence-free survival, defined as the time from primary surgery to recurrence or death.

The validation dataset consists of 686 patients with primary node-positive breast cancer from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).

Usage

data(trainDataSurvival)
  data(testDataSurvival)

Format

A data frame with observations on the following 26 variables.

pid

patient identifier

year

year of surgery

age

age at surgery

meno

menopausal status (0 = premenopausal, 1 = postmenopausal)

size

tumor size, a factor with levels <= 20, 20-50, >50

grade

differentiation grade

nodes

number of positive lymph nodes

pgr

progesterone receptors (fmol/l)

er

estrogen receptors (fmol/l)

hormon

hormonal treatment (0 = no, 1 = yes)

chemo

chemotherapy

rtime

days to relapse or last follow-up

recur

0 = no relapse, 1 = relapse

dtime

days to death or last follow-up

death

0 = alive, 1 = dead

ryear

Follow-up time for RFS, in years (numeric)

rfs

Recurrence-free survival status (0 = no event, 1 = event) (numeric)

pgr2

Winsorized progesterone receptor level (numeric)

nodes2

Winsorized node count (numeric)

csize

Categorized tumor size, copied from size (factor)

cnode

Categorized node involvement (factor: "0", "1-3", ">3")

grade3

Recoded grade factor (levels: "1-2", "3")

nodes3

Restricted cubic spline basis for nodes2 (numeric)

pgr3

Restricted cubic spline basis for original pgr (numeric)

epoch

Follow-up epoch indicator after splitting at 5 years (numeric)

Details

The data sets are based on the publicly available code and data used in the repository Prediction_performance_survival by Giardiello et al. (2023), which accompanies the Annals of Internal Medicine article "Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models".

All preprocessing steps, such as converting survival time to years, defining recurrence-free survival status via `rfs = pmax(recur, death)`, correcting 43 discordant cases using death time, 99th-percentile winsorization of `pgr` and `nodes`, spline transformations (`nodes3`, `pgr3`), splitting follow-up at 5 years (`epoch`), and recoding categorical variables (`csize`, `cnode`, `grade3`)—were performed exactly as in the Giardiello code.

The training dataset, trainDataSurvival, consists of 2982 patients, with 1713 events occurring over a maximum follow-up time of 19.3 years. The estimated median potential follow-up time, calculated using the reverse Kaplan- method, was 9.3 years. Out of these patients, 1275 suffered a recurrence or death within the follow-up time of interest (5 years), and 126 were censored before 5 years.

The validation dataset, testDataSurvival, consists of 686 patients with primary node-positive breast cancer from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).

References

David J. McLernon, Daniele Giardiello, Ben Van Calster, et al. (2023). Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models. Annals of Internal Medicine, 176(1), pp. 105-114, doi:10.7326/M22-0844

Examples

data(testDataSurvival)
## Explore the structure of the dataset
str(testDataSurvival)
#> 'data.frame':	686 obs. of  21 variables:
#>  $ pid    : int  132 1575 1140 769 130 1642 475 973 569 1180 ...
#>  $ age    : int  49 55 56 45 65 48 48 37 67 45 ...
#>  $ meno   : int  0 1 1 0 1 0 0 0 1 0 ...
#>  $ size   : int  18 20 40 25 30 52 21 20 20 30 ...
#>  $ grade  : int  2 3 3 3 2 2 3 2 2 2 ...
#>  $ nodes  : int  2 16 3 1 5 11 8 9 1 1 ...
#>  $ pgr    : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ er     : int  0 0 0 4 36 0 0 0 0 0 ...
#>  $ hormon : int  0 0 0 0 1 0 0 1 1 0 ...
#>  $ rfstime: int  1838 403 1603 177 1855 842 293 42 564 1093 ...
#>  $ status : int  0 1 0 0 0 1 1 0 1 1 ...
#>  $ cnode  : Factor w/ 3 levels "0","1-3",">3": 2 3 2 2 3 3 3 3 2 2 ...
#>  $ csize  : Factor w/ 3 levels "<=20","20-50",..: 1 1 2 2 2 3 2 1 1 2 ...
#>  $ pgr2   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ nodes2 : num  2 16 3 1 5 11 8 9 1 1 ...
#>  $ grade3 : Factor w/ 2 levels "1-2","3": 1 2 2 2 1 1 2 1 1 1 ...
#>  $ nodes3 : num  0.0849 4.2222 0.2222 0.0123 0.6543 ...
#>  $ pgr3   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ryear  : num  5 1.103 4.389 0.485 5 ...
#>  $ rfs    : num  0 1 0 0 0 1 1 0 1 1 ...
#>  $ epoch  : num  1 1 1 1 1 1 1 1 1 1 ...