About this vignette: This tutorial teaches you how to generate date variables with realistic temporal patterns. You’ll learn interval notation, distribution options, source format simulation, and data quality testing. All code examples run during vignette build to ensure accuracy.
Overview
This tutorial teaches temporal data generation in MockData. You’ll learn how to create date variables for:
-
Cohort entry dates (index dates, baseline dates)
-
Event dates (death, diagnosis, hospital admission)
-
Administrative dates (censoring dates, data freeze dates)
-
Temporal data quality testing (out-of-period dates for validation)
Date variables use interval notation to specify ranges and support three distributions (uniform, Gompertz, exponential) for realistic temporal patterns.
Basic date variable setup
Configuration structure
Date variables use the same two-file structure as other variables, with interval notation for date ranges:
variables.csv:
| ices_v01 |
interview_date |
enabled |
Date |
Interview date |
1 |
variable_details.csv:
| ices_v01 |
ices_d001 |
interview_date |
[2001-01-01,2005-12-31] |
Interview date range |
Key points:
- Use interval notation in
recStart: [start_date,end_date] (square brackets, comma-separated, ISO format)
- Dates use ISO format: YYYY-MM-DD
- Single row per date variable (not separate rows for start/end)
- All date variables from minimal-example use
ices_v* namespace
Generating date variables
# Load minimal-example metadata
variables <- read.csv(
system.file("extdata/minimal-example/variables.csv", package = "MockData"),
stringsAsFactors = FALSE
)
variable_details <- read.csv(
system.file("extdata/minimal-example/variable_details.csv", package = "MockData"),
stringsAsFactors = FALSE
)
# Generate interview dates
interview_data <- create_mock_data(
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 100,
seed = 123
)
# View interview date distribution
head(interview_data$interview_date)
[1] "2001-03-08" "2003-05-28" "2005-10-22" "2005-06-17" "2002-01-18"
[6] "2005-04-24"
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2001-01-17" "2002-01-02" "2002-11-14" "2003-03-06" "2004-05-07" "2005-12-12"
Result: 100 interview dates uniformly distributed between 2001-01-01 and 2005-12-31 (matching the minimal-example metadata).
Note: The interval notation in recStart specifies the complete date range in a single row.
Generating single date variables
For interactive development or testing, use create_date_var() to generate a single date variable without creating an entire dataset:
# Generate just the interview_date variable
interview_dates_only <- create_date_var(
var = "interview_date",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 100,
seed = 456
)
# View the result (single-column data frame)
head(interview_dates_only)
interview_date
1 2005-11-08
2 2005-02-17
3 2003-07-20
4 2002-07-04
5 2002-04-14
6 2004-06-29
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2001-01-31" "2002-02-25" "2003-06-25" "2003-06-28" "2004-07-30" "2005-12-31"
Use cases for create_date_var():
- Testing date specifications interactively
- Generating single variables for specific tests
- Exploring distribution patterns
- Quick prototyping before batch generation
For complete datasets: Use create_mock_data() to generate all variables at once (as shown in the previous section).
Date ranges and distributions
Distribution options
MockData supports three distributions for date generation:
Uniform (default): All dates equally likely across the range
- Use for: Cohort accrual with constant enrollment, administrative dates
- Example:
distribution = "uniform"
Gompertz: Useful for survival/event times with higher events near end of range
- Use for: Typical survival patterns
- Example:
distribution = "gompertz"
Exponential: More events near start of range
- Use for: Time-to-event with early concentration
- Example:
distribution = "exponential"
Note: Distribution types are specified in the distribution column of variables.csv. For interactive generation (single variable functions), you can override the metadata distribution.
Comparing distributions
Different distributions create different temporal patterns. Here’s how they compare using the same date range:
# Generate dates with uniform distribution (default)
uniform_dates <- create_date_var(
var = "interview_date",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 100
)
# Calculate median date
median_uniform <- median(as.Date(uniform_dates$interview_date))
# View median
median_uniform
The uniform distribution produces a median date near the middle of the range (2001-01-01 to 2005-12-31), as expected for evenly distributed dates.
Distribution patterns:
| Uniform |
Middle of range (~2003) |
Calendar dates, recruitment periods, random events |
| Gompertz |
Shifted toward end (~2004-2005) |
Mortality data, age-related events |
| Exponential |
Shifted toward start (~2001-2002) |
First diagnosis, early treatment, rapid events |
Choosing a distribution:
-
Start with uniform for calendar dates without known temporal patterns
-
Use Gompertz for mortality and age-related outcomes
-
Use exponential for first-event or early-diagnosis scenarios
The distribution is read from the distribution column in variables.csv, making it easy to specify different patterns for different date variables.
Event proportions for survival data
When generating event dates (death, diagnosis, hospitalization), not all individuals experience the event during follow-up. The event_prop parameter controls what proportion of individuals experience the event versus being censored (no event observed).
Basic event_prop example
Let’s generate death dates where only 30% of individuals die during the study period:
# Create simple death date configuration
death_vars <- data.frame(
uid = "death_v1",
variable = "death_date",
role = "enabled",
variableType = "Date",
rType = "date",
position = 1,
distribution = "gompertz",
rate = 1e-04,
shape = 0.1,
followup_min = 0,
followup_max = 3650,
event_prop = 0.3, # 30% experience death
sourceFormat = "analysis",
stringsAsFactors = FALSE
)
death_details <- data.frame(
uid = "death_v1",
uid_detail = "death_v1_d1",
variable = "death_date",
recStart = "[2001-01-01,2005-12-31]",
recEnd = "copy",
catLabel = "Death dates",
stringsAsFactors = FALSE
)
# Generate death dates (requires anchor_date for survival variables)
anchor_dates <- as.Date("2001-01-01") + sample(0:1826, 100, replace = TRUE)
df_mock <- data.frame(anchor_date = anchor_dates)
death_data <- create_date_var(
var = "death_date",
databaseStart = "tutorial",
variables = death_vars,
variable_details = death_details,
df_mock = df_mock,
n = 100,
seed = 456
)
# Check event proportion
n_deaths <- sum(!is.na(death_data$death_date))
n_censored <- sum(is.na(death_data$death_date))
observed_prop <- n_deaths / nrow(death_data)
Results:
- Deaths observed: 30 out of 100 (30%)
- Censored (no death): 70 (70%)
- Expected: 30% deaths (from
event_prop = 0.3)
Sample of first 10 observations (showing NA for censored cases):
# Display first 10 rows to show mixture of dates and NA values
head(death_data, 10)
death_date
1 <NA>
2 <NA>
3 <NA>
4 2003-12-30
5 2004-04-19
6 <NA>
7 <NA>
8 <NA>
9 <NA>
10 <NA>
Key concepts:
-
event_prop = 0.3: 30% of individuals experience the event (get a date)
-
Censored observations: The remaining 70% have
NA for death_date (event not observed)
-
Survival analysis use:
event_prop simulates realistic cohort data where not everyone experiences the outcome
When to use event_prop:
- Death dates in cohort studies
- Disease incidence dates
- Hospital admission dates
- Any time-to-event outcome where censoring occurs
For complete survival data examples with competing risks and multiple event types, see the Survival data tutorial.
By default, MockData generates dates as R Date objects (analysis-ready format). However, real survey data comes in different formats depending on the source. The sourceFormat column in variables.csv controls how dates are generated to simulate different raw data formats.
MockData supports three sourceFormat formats specified in the variables.csv metadata:
analysis (default): R Date objects ready for analysis
csv: Character strings in ISO format (YYYY-MM-DD), simulating dates from CSV files
sas: Numeric values (days since 1960-01-01), simulating SAS date format
The sourceFormat column in variables.csv controls the output format. Let’s generate the same date variable in all three formats by modifying the metadata:
# Format 1: analysis (R Date objects) - DEFAULT
# Variables already has sourceFormat = "analysis" by default
dates_analysis <- create_date_var(
var = "interview_date",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 5,
seed = 100
)
# Check the class and values
class(dates_analysis$interview_date)
head(dates_analysis$interview_date, 3)
[1] "2005-10-04" "2002-05-18" "2004-10-13"
# Format 2: csv (character strings)
vars_csv <- variables
vars_csv$sourceFormat <- "csv"
dates_csv <- create_date_var(
var = "interview_date",
databaseStart = "minimal-example",
variables = vars_csv,
variable_details = variable_details,
n = 5,
seed = 100 # Same seed = same underlying dates
)
# Check the class and values
class(dates_csv$interview_date)
head(dates_csv$interview_date, 3)
[1] "2005-10-04" "2002-05-18" "2004-10-13"
# Format 3: sas (numeric days since 1960-01-01)
vars_sas <- variables
vars_sas$sourceFormat <- "sas"
dates_sas <- create_date_var(
var = "interview_date",
databaseStart = "minimal-example",
variables = vars_sas,
variable_details = variable_details,
n = 5,
seed = 100 # Same seed = same underlying dates
)
# Check the class and values
class(dates_sas$interview_date)
head(dates_sas$interview_date, 3)
Key insight: The same underlying dates are represented in three different formats. Using the same seed (100) ensures all three formats contain the same dates, just stored differently.
All three formats represent identical dates. You can convert between them:
# Convert CSV format (character) to Date
csv_to_date <- as.Date(dates_csv$interview_date)
# Convert SAS format (numeric) to Date
sas_to_date <- as.Date(dates_sas$interview_date, origin = "1960-01-01")
# Verify all represent the same dates
all(csv_to_date == dates_analysis$interview_date)
all(sas_to_date == dates_analysis$interview_date)
Use case: testing harmonization pipelines
The sourceFormat metadata column is particularly useful for testing date parsing and harmonization code. Generate mock data in the format matching your raw source files:
# Example: Testing harmonization from CSV source
vars_csv_source <- variables
vars_csv_source$sourceFormat <- "csv"
# Generate mock data in CSV format
mock_csv <- create_mock_data(
databaseStart = "minimal-example",
variables = vars_csv_source,
variable_details = variable_details,
n = 1000
)
# Test your harmonization logic
library(dplyr)
harmonized <- mock_csv %>%
mutate(
# Parse character dates (your harmonization code)
interview_date = as.Date(interview_date, format = "%Y-%m-%d"),
# Calculate derived variables
days_from_start = as.numeric(interview_date - as.Date("2001-01-01"))
)
# Verify harmonization succeeded
stopifnot(inherits(harmonized$interview_date, "Date"))
stopifnot(all(harmonized$days_from_start >= 0, na.rm = TRUE))
Why this matters
Real survey data doesn’t arrive as clean R Date objects:
-
CSV files: Dates are character strings that need parsing
-
SAS files: Dates may be numeric (if haven doesn’t auto-convert)
-
SPSS/Stata files: Various numeric formats with different epochs
Setting sourceFormat in variables.csv lets you generate mock data that matches your actual raw data format, allowing you to test your entire harmonization pipeline.
Data quality for dates
Future dates
Simulate data entry errors where dates are in the future using garbage_high parameters in variables.csv:
# Create configuration with future date garbage
birth_config <- data.frame(
uid = "birth_date_v1",
variable = "birth_date",
role = "enabled",
variableType = "Date",
rType = "date",
position = 1,
garbage_high_prop = 0.02,
garbage_high_range = "[2026-01-01,2030-12-31]",
stringsAsFactors = FALSE
)
birth_details <- data.frame(
uid = "birth_date_v1",
uid_detail = "birth_date_v1_d1",
variable = "birth_date",
recStart = "[1950-01-01,2010-12-31]",
catLabel = "Valid birth dates",
stringsAsFactors = FALSE
)
# Write to temporary files
temp_dir <- tempdir()
birth_config_path <- file.path(temp_dir, "birth_config.csv")
birth_details_path <- file.path(temp_dir, "birth_details.csv")
write.csv(birth_config, birth_config_path, row.names = FALSE)
write.csv(birth_details, birth_details_path, row.names = FALSE)
# Generate birth dates with future date garbage
birth_data <- create_mock_data(
databaseStart = "tutorial",
variables = birth_config_path,
variable_details = birth_details_path,
n = 1000,
seed = 789
)
# Check for future dates (after 2025)
future_threshold <- as.Date("2025-12-31")
n_future <- sum(birth_data$birth_date > future_threshold, na.rm = TRUE)
prop_future <- n_future / nrow(birth_data)
future_dates_sample <- head(sort(birth_data$birth_date[birth_data$birth_date > future_threshold]), 5)
Result: 20 birth dates (2%) are in the future (after 2025-12-31), which is impossible for current data. Sample of future dates: 2026-01-18, 2026-05-28, 2026-08-02, 2026-11-24, 2027-02-09.
Key insight: Garbage data is now specified at the variable level in variables.csv, not in variable_details.csv.
Past dates (temporal violations)
Simulate impossibly old dates using garbage_low parameters:
# Create configuration with old date garbage
diag_config <- data.frame(
uid = "diagnosis_date_v1",
variable = "diagnosis_date",
role = "enabled",
variableType = "Date",
rType = "date",
position = 1,
garbage_low_prop = 0.03,
garbage_low_range = "[1850-01-01,1900-12-31]",
stringsAsFactors = FALSE
)
diag_details <- data.frame(
uid = "diagnosis_date_v1",
uid_detail = "diagnosis_date_v1_d1",
variable = "diagnosis_date",
recStart = "[2000-01-01,2020-12-31]",
catLabel = "Valid diagnosis dates",
stringsAsFactors = FALSE
)
# Write to temporary files
temp_dir <- tempdir()
diag_config_path <- file.path(temp_dir, "diag_config.csv")
diag_details_path <- file.path(temp_dir, "diag_details.csv")
write.csv(diag_config, diag_config_path, row.names = FALSE)
write.csv(diag_details, diag_details_path, row.names = FALSE)
# Generate diagnosis dates with old date garbage
diag_data <- create_mock_data(
databaseStart = "tutorial",
variables = diag_config_path,
variable_details = diag_details_path,
n = 1000,
seed = 456
)
# Check for impossibly old dates (before 1950)
old_threshold <- as.Date("1950-01-01")
n_old <- sum(diag_data$diagnosis_date < old_threshold, na.rm = TRUE)
prop_old <- n_old / nrow(diag_data)
old_dates_sample <- head(sort(diag_data$diagnosis_date[diag_data$diagnosis_date < old_threshold]), 5)
Result: 30 diagnosis dates (3%) are from before 1950-01-01 (1850-1900 range), which is unrealistic for modern medical data. Sample of old dates: 1851-10-23, 1851-12-19, 1853-09-30, 1854-07-24, 1854-07-27.
Use cases for garbage dates
-
Testing validation pipelines: Ensure your code catches impossible dates
-
Training analysts: Show examples of real-world data quality issues
-
Data cleaning scripts: Test date range checks and filtering logic
Note: For complete examples of garbage date generation integrated into metadata files, see the minimal-example configuration files.
Checking generated dates
After generating dates, validate the results:
# Check interview date range
interview_dates <- as.Date(interview_data$interview_date)
# Calculate summary statistics
date_min <- min(interview_dates, na.rm = TRUE)
date_max <- max(interview_dates, na.rm = TRUE)
n_missing <- sum(is.na(interview_dates))
# Validate dates are within expected range (from metadata)
expected_min <- as.Date("2001-01-01")
expected_max <- as.Date("2005-12-31")
all_valid <- all(interview_dates >= expected_min &
interview_dates <= expected_max,
na.rm = TRUE)
Validation results:
- Date range: 2001-01-17 to 2005-12-12
- Missing dates: 0
- All dates within expected range (2001-01-01 to 2005-12-31): TRUE
This validation confirms the generated dates match the metadata specifications.
Key concepts summary
| Date ranges |
Interval notation in recStart
|
[2001-01-01,2005-12-31] format (ISO dates) |
| Distributions |
Uniform, Gompertz, Exponential |
Specified in distribution column of variables.csv |
| Event proportions |
event_prop column in variables.csv |
Controls % experiencing event vs. censored (NA) |
| Garbage dates |
garbage_low_prop/range, garbage_high_prop/range
|
Specified in variables.csv for data quality testing |
| Validation |
Check ranges and distributions |
Use summary(), min(), max() after generation |
| Source formats |
sourceFormat column in variables.csv |
Values: “csv” (character), “sas” (numeric), “analysis” (Date) |
What you learned
In this tutorial, you learned:
-
Date configuration basics: How to specify date ranges using interval notation in variable_details.csv
-
Single variable generation: Using
create_date_var() for interactive development and testing
-
Distribution options: Uniform, Gompertz, and exponential distributions for realistic temporal patterns
-
Distribution selection: When to use each distribution type for different use cases
-
Event proportions: Using
event_prop to control the proportion of individuals experiencing events versus being censored
-
Source data formats: Using the
sourceFormat metadata column to simulate different raw data formats (csv, sas, analysis)
-
Format conversion: How the same dates can be represented in different formats for harmonization testing
-
Data quality testing: Generating corrupt future/past dates using
garbage_low and garbage_high parameters for validation pipeline testing
-
Validation: Checking generated dates are within expected ranges and match metadata specifications
Next steps
Tutorials:
Reference: