About this vignette: This tutorial teaches survival data generation for cohort studies. You’ll learn how to create time-to-event data with competing risks (death, disease incidence, loss-to-follow-up), apply temporal ordering constraints, and generate survival indicators for analysis. All code examples run during vignette build to ensure accuracy.
Overview
Survival analysis requires careful coordination of multiple date variables with strict temporal ordering:
-
Cohort entry date (baseline, index date)
-
Event dates (disease incidence, outcomes of interest)
-
Competing risks (death prevents observation of primary event)
-
Censoring events (loss to follow-up, administrative censoring)
MockData’s create_wide_survival_data() generates these dates with proper temporal constraints and realistic distributions.
About this tutorial: Clean survival data with correct temporal ordering
This tutorial teaches how to create meaningful, analysis-ready survival data with correct temporal ordering. All dates follow proper survival analysis constraints (entry ≤ event, death as competing risk, etc.).
For data quality testing: If you need to generate raw data with temporal violations (e.g., death before entry, impossible dates) for testing data cleaning pipelines, see the Garbage data tutorial which covers survival-specific garbage patterns.
Basic survival data generation
Minimal example: entry + event
The simplest survival data has two dates: cohort entry and a single event.
# Load minimal-example metadata
variables <- read.csv(
system.file("extdata/minimal-example/variables.csv", package = "MockData"),
stringsAsFactors = FALSE,
check.names = FALSE
)
variable_details <- read.csv(
system.file("extdata/minimal-example/variable_details.csv", package = "MockData"),
stringsAsFactors = FALSE,
check.names = FALSE
)
# Generate entry + event dates
surv_basic <- create_wide_survival_data(
var_entry_date = "interview_date",
var_event_date = "primary_event_date", # disease incidence or similar
var_death_date = NULL,
var_ltfu = NULL, # Loss to follow-up
var_admin_censor = NULL, # i.e. End of study follow-up
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 123
)
# View first few rows
head(surv_basic)
interview_date primary_event_date
1 2002-02-19 <NA>
2 2002-04-08 2002-05-23
3 2001-06-28 <NA>
4 2002-06-10 <NA>
5 2001-07-14 <NA>
6 2003-07-27 <NA>
Result: Each row has an interview date (cohort entry) and primary event date. Some event dates are NA (censored - event did not occur during follow-up).
Event proportions
Not all individuals experience the primary event. The event_prop parameter in variables.csv controls event occurrence rate:
Configuration: The primary_event_date variable has event_prop = 0.3 in variables.csv, meaning 30% of individuals experience the event.
Observed: 300 out of 1000 individuals (30%) experienced the primary event.
Competing risks: adding death
Death is a competing risk - individuals who die cannot experience the primary event. MockData handles this temporal logic automatically.
# Generate entry + event + death
surv_compete <- create_wide_survival_data(
var_entry_date = "interview_date",
var_event_date = "primary_event_date",
var_death_date = "death_date",
var_ltfu = NULL,
var_admin_censor = NULL,
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 456
)
# Check temporal ordering
head(surv_compete[, c("interview_date", "primary_event_date", "death_date")])
interview_date primary_event_date death_date
1 2005-11-08 2006-01-17 <NA>
2 2005-02-17 <NA> <NA>
3 2003-07-20 <NA> <NA>
4 2002-07-04 <NA> <NA>
5 2002-04-14 2002-06-07 <NA>
6 2004-06-29 2004-09-13 <NA>
Competing risk logic
MockData applies these rules:
-
Entry date is baseline: All other dates occur after entry
-
Death prevents events: If death < event, set event to NA (cannot observe event after death)
-
Temporal ordering: interview_date ≤ event_date, interview_date ≤ death_date
# Verify temporal ordering
# Note: Dates are already R Date objects (sourceFormat = "analysis" in variables.csv)
interview_dates <- surv_compete$interview_date
event_dates <- surv_compete$primary_event_date
death_dates <- surv_compete$death_date
# Check: All events occur after entry
all_events_after_entry <- all(
event_dates[!is.na(event_dates)] >= interview_dates[!is.na(event_dates)],
na.rm = TRUE
)
# Check: All deaths occur after entry
all_deaths_after_entry <- all(
death_dates[!is.na(death_dates)] >= interview_dates[!is.na(death_dates)],
na.rm = TRUE
)
Temporal ordering validation:
- All events occur after entry: TRUE
- All deaths occur after entry: TRUE
This confirms MockData correctly enforces temporal constraints.
Complete survival data: entry + event + death + censoring
Real cohort studies have multiple censoring mechanisms:
-
Loss to follow-up: Participants drop out
-
Administrative censoring: Study ends on specific date
# Generate complete survival data (all 5 date variables)
surv_complete <- create_wide_survival_data(
var_entry_date = "interview_date",
var_event_date = "primary_event_date",
var_death_date = "death_date",
var_ltfu = "ltfu_date",
var_admin_censor = "admin_censor_date",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 2000,
seed = 789
)
# View structure
head(surv_complete)
interview_date primary_event_date death_date ltfu_date admin_censor_date
1 2003-03-24 <NA> <NA> <NA> 2020-04-12
2 2003-02-19 <NA> <NA> 2020-02-15 2010-05-26
3 2005-03-14 2005-05-16 2006-03-14 <NA> 2024-11-15
4 2004-08-14 <NA> <NA> <NA> 2018-07-07
5 2002-10-28 2003-01-16 <NA> <NA> 2017-10-07
6 2002-09-16 <NA> <NA> <NA> 2017-10-29
Creating survival indicators
Survival analysis requires deriving time-to-event and event indicators from the date variables.
Calculate observation end date
Observation ends at the earliest of: primary event, death, loss to follow-up, or administrative censoring.
# Dates are already R Date objects (sourceFormat = "analysis")
# Calculate end date (earliest of all outcomes)
# Build list of existing date columns
date_cols <- c("primary_event_date", "death_date", "ltfu_date", "admin_censor_date")
existing_cols <- date_cols[date_cols %in% names(surv_complete)]
# Calculate minimum date across existing columns
if (length(existing_cols) > 0) {
# Use row-wise minimum to avoid pmin Inf issues with all-NA rows
surv_complete$t_end <- as.Date(
apply(surv_complete[, existing_cols, drop = FALSE], 1, function(row_dates) {
valid_dates <- row_dates[!is.na(row_dates)]
if (length(valid_dates) == 0) return(NA_real_)
min(valid_dates)
}),
origin = "1970-01-01"
)
} else {
surv_complete$t_end <- as.Date(NA)
}
# Calculate follow-up time in days
# Date subtraction returns difftime object; convert to numeric days
surv_complete$followup_days <- as.numeric(difftime(surv_complete$t_end, surv_complete$interview_date, units = "days"))
head(surv_complete[, c("interview_date", "t_end", "followup_days")])
interview_date t_end followup_days
1 2003-03-24 2020-04-12 6229
2 2003-02-19 2010-05-26 2653
3 2005-03-14 2005-05-16 63
4 2004-08-14 2018-07-07 5075
5 2002-10-28 2003-01-16 80
6 2002-09-16 2017-10-29 5522
Create event indicator
Event indicator identifies why observation ended:
-
0: Censored (loss to follow-up or administrative censoring)
-
1: Primary event occurred
-
2: Death occurred (competing risk)
# Create event indicator
surv_complete$event_indicator <- ifelse(
!is.na(surv_complete$primary_event_date) & surv_complete$primary_event_date == surv_complete$t_end, 1, # Event
ifelse(!is.na(surv_complete$death_date) & surv_complete$death_date == surv_complete$t_end, 2, # Death
0) # Censored
)
# Tabulate outcomes
table(surv_complete$event_indicator)
Result:
-
Censored (0): 1132 (56.6%) - lost to follow-up or administratively censored
-
Primary event (1): 582 (29.1%) - experienced primary event
-
Death (2): 286 (14.3%) - died before primary event
Distributions for survival data
Survival dates use realistic distributions to match real-world patterns:
Uniform: Constant hazard (administrative censoring, loss to follow-up)
Gompertz: Age-dependent hazard (death, chronic disease)
distribution = "gompertz"
rate = 0.0001
shape = 0.1
Exponential: Constant hazard with early concentration
distribution = "exponential"
rate = 0.001
Distribution parameters are specified in variables.csv and used automatically by create_wide_survival_data().
By default, survival dates are generated as R Date objects (sourceFormat = "analysis"). However, you can simulate different raw data formats using the sourceFormat column in variables.csv:
Available sourceFormat values:
-
analysis (default): R Date objects ready for analysis
-
csv: Character strings in ISO format (YYYY-MM-DD), simulating CSV file imports
-
sas: Numeric values (days since 1960-01-01), simulating SAS date format
Why this matters:
Real cohort data doesn’t arrive as clean R Date objects:
- CSV files from survey instruments contain character dates requiring parsing
- SAS files may have numeric dates that need conversion
- Different data sources require different harmonization approaches
Example: Testing different source formats
The sourceFormat value in variables.csv controls the output format. Let’s generate interview_date in SAS numeric format while keeping other dates in analysis format:
# Modify only interview_date to use SAS format
vars_sas <- variables
vars_sas$sourceFormat[vars_sas$variable == "interview_date"] <- "sas"
surv_sas <- create_wide_survival_data(
var_entry_date = "interview_date",
var_event_date = "primary_event_date",
var_death_date = NULL,
var_ltfu = NULL,
var_admin_censor = NULL,
databaseStart = "minimal-example",
variables = vars_sas, # Modified to use SAS format for interview_date
variable_details = variable_details,
n = 100,
seed = 123
)
# Check the format: interview_date is numeric (SAS), others are Date
head(surv_sas)
interview_date primary_event_date
1 15390 2012-05-06
2 15438 <NA>
3 15154 2011-09-15
4 15501 <NA>
5 15170 <NA>
6 15913 <NA>
Result: The interview_date column is numeric (days since 1960-01-01), while primary_event_date remains a Date object. This simulates mixed-format raw data that requires harmonization.
To convert SAS dates to R Date format:
# Convert SAS numeric dates to R Date
interview_converted <- as.Date(surv_sas$interview_date, origin = "1960-01-01")
head(interview_converted)
[1] "2002-02-19" "2002-04-08" "2001-06-28" "2002-06-10" "2001-07-14"
[6] "2003-07-27"
Temporal violations for QA testing
The minimal-example metadata includes temporal violations through the garbage_high_prop and garbage_high_range parameters. These generate future dates that violate temporal constraints for testing validation pipelines:
# Generate survival data with configured garbage dates
surv_qa <- create_wide_survival_data(
var_entry_date = "interview_date",
var_event_date = "primary_event_date",
var_death_date = "death_date",
var_ltfu = NULL,
var_admin_censor = NULL,
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 999
)
# Check for temporal violations
# Look for events that occur after impossibly far in the future (2025+)
future_threshold <- as.Date("2025-01-01")
# Count future dates in primary_event_date
n_future_events <- sum(surv_qa$primary_event_date > future_threshold, na.rm = TRUE)
# Count future dates in death_date
n_future_deaths <- sum(surv_qa$death_date > future_threshold, na.rm = TRUE)
# Total violations
n_violations <- n_future_events + n_future_deaths
prop_violations <- round(100 * n_violations / nrow(surv_qa), 1)
Validation detects: 14 temporal violations out of 1000 observations (1.4%). These future dates (> 2025-01-01) represent data quality issues that your validation pipeline should flag.
Breakdown:
- Future primary events: 8
- Future deaths: 6
This demonstrates how MockData’s garbage parameters help test validation logic by generating realistic data quality issues.
Key concepts summary
| Competing risks |
Death prevents primary event |
If death < event, set event to NA |
| Event proportions |
event_prop in variables.csv |
Controls % experiencing each outcome |
| Temporal ordering |
Automatic constraint enforcement |
All dates ≥ entry date |
| Distributions |
Gompertz, uniform, exponential |
Specified in variables.csv |
| End date |
pmin(event, death, ltfu, admin) |
Earliest outcome defines observation end |
| Event indicator |
Derived from date comparison |
0=censored, 1=event, 2=death |
| QA testing |
prop_garbage parameter |
Generates temporal violations for validation testing |
What you learned
In this tutorial, you learned:
-
Basic survival generation: Entry date + event date with event proportions
-
Competing risks: Death as a competing event that prevents primary event observation
-
Temporal constraints: How MockData enforces proper date ordering
-
Complete cohort data: Entry + event + death + loss-to-follow-up + administrative censoring
-
Derived variables: Creating follow-up time and event indicators from raw dates
-
Distributions: Using Gompertz, uniform, and exponential for realistic temporal patterns
-
QA testing: Generating temporal violations to validate data quality pipelines
Next steps
Tutorials:
Reference: