Breast Cancer Survival Data from Rotterdam and Germany
survivaltraindata.Rd
The training dataset contains real-life survival data from patients who underwent primary surgery for breast cancer between 1978 and 1993 in Rotterdam. The patients were followed until 2007, resulting in a model development cohort of 2982 patients after exclusions. The primary outcome measured was recurrence-free survival, defined as the time from primary surgery to recurrence or death.
The validation dataset consists of 686 patients with primary node-positive breast cancer from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).
Format
A data frame with observations on the following 26 variables.
- pid
patient identifier
- year
year of surgery
- age
age at surgery
- meno
menopausal status (0 = premenopausal, 1 = postmenopausal)
- size
tumor size, a factor with levels <= 20, 20-50, >50
- grade
differentiation grade
- nodes
number of positive lymph nodes
- pgr
progesterone receptors (fmol/l)
- er
estrogen receptors (fmol/l)
- hormon
hormonal treatment (0 = no, 1 = yes)
- chemo
chemotherapy
- rtime
days to relapse or last follow-up
- recur
0 = no relapse, 1 = relapse
- dtime
days to death or last follow-up
- death
0 = alive, 1 = dead
- ryear
Follow-up time for RFS, in years (numeric)
- rfs
Recurrence-free survival status (0 = no event, 1 = event) (numeric)
- pgr2
Winsorized progesterone receptor level (numeric)
- nodes2
Winsorized node count (numeric)
- csize
Categorized tumor size, copied from
size
(factor)- cnode
Categorized node involvement (factor: "0", "1-3", ">3")
- grade3
Recoded grade factor (levels: "1-2", "3")
- nodes3
Restricted cubic spline basis for
nodes2
(numeric)- pgr3
Restricted cubic spline basis for original
pgr
(numeric)- epoch
Follow-up epoch indicator after splitting at 5 years (numeric)
Details
The data sets are based on the publicly available code and data used in the repository Prediction_performance_survival by Giardiello et al. (2023), which accompanies the Annals of Internal Medicine article "Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models".
All preprocessing steps, such as converting survival time to years, defining recurrence-free survival status via `rfs = pmax(recur, death)`, correcting 43 discordant cases using death time, 99th-percentile winsorization of `pgr` and `nodes`, spline transformations (`nodes3`, `pgr3`), splitting follow-up at 5 years (`epoch`), and recoding categorical variables (`csize`, `cnode`, `grade3`)—were performed exactly as in the Giardiello code.
The training dataset, trainDataSurvival
, consists of 2982 patients, with 1713 events occurring over a maximum
follow-up time of 19.3 years. The estimated median potential follow-up time, calculated using the reverse Kaplan-
method, was 9.3 years. Out of these patients, 1275 suffered a recurrence or death within the follow-up time of interest
(5 years), and 126 were censored before 5 years.
The validation dataset, testDataSurvival
, consists of 686 patients with primary node-positive breast cancer
from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years
of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median
survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).
References
David J. McLernon, Daniele Giardiello, Ben Van Calster, et al. (2023). Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models. Annals of Internal Medicine, 176(1), pp. 105-114, doi:10.7326/M22-0844
Examples
data(testDataSurvival)
## Explore the structure of the dataset
str(testDataSurvival)
#> 'data.frame': 686 obs. of 21 variables:
#> $ pid : int 132 1575 1140 769 130 1642 475 973 569 1180 ...
#> $ age : int 49 55 56 45 65 48 48 37 67 45 ...
#> $ meno : int 0 1 1 0 1 0 0 0 1 0 ...
#> $ size : int 18 20 40 25 30 52 21 20 20 30 ...
#> $ grade : int 2 3 3 3 2 2 3 2 2 2 ...
#> $ nodes : int 2 16 3 1 5 11 8 9 1 1 ...
#> $ pgr : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ er : int 0 0 0 4 36 0 0 0 0 0 ...
#> $ hormon : int 0 0 0 0 1 0 0 1 1 0 ...
#> $ rfstime: int 1838 403 1603 177 1855 842 293 42 564 1093 ...
#> $ status : int 0 1 0 0 0 1 1 0 1 1 ...
#> $ cnode : Factor w/ 3 levels "0","1-3",">3": 2 3 2 2 3 3 3 3 2 2 ...
#> $ csize : Factor w/ 3 levels "<=20","20-50",..: 1 1 2 2 2 3 2 1 1 2 ...
#> $ pgr2 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ nodes2 : num 2 16 3 1 5 11 8 9 1 1 ...
#> $ grade3 : Factor w/ 2 levels "1-2","3": 1 2 2 2 1 1 2 1 1 1 ...
#> $ nodes3 : num 0.0849 4.2222 0.2222 0.0123 0.6543 ...
#> $ pgr3 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ryear : num 5 1.103 4.389 0.485 5 ...
#> $ rfs : num 0 1 0 0 0 1 1 0 1 1 ...
#> $ epoch : num 1 1 1 1 1 1 1 1 1 1 ...