Name-blind variable-role detection with rolescry

The problem: names lie, data does not

Real datasets arrive with column names that are missing, misleading, in the wrong language, or simply wrong: a column called category_code holding continuous lab values, a gender column that is actually a free numeric measurement, an outcome buried under an opaque v7. Any tool that decides a column’s statistical role from its name inherits every one of those lies.

rolescry decides roles from the data signature instead – the guiding principle is Data inspice, non nomen (“inspect the data, not the name”). Renaming every column to col_1, col_2, ... does not change a single role assignment. This is the turnusol (litmus) invariant, and it is the package’s keystone test.

A worked example

d <- data.frame(
  arm  = rep(c(0, 1), each = 50),  # a balanced 2-level grouping
  pre  = rnorm(100, 10, 2),        # measured before ...
  post = rnorm(100, 11, 2),        # ... and after (paired)
  resp = rbinom(100, 1, 0.4)       # a binary response
)

res <- detect_roles(d)
res
#> <role_detection> 100 observations x 4 variables
#>   paired_pairs       pre, post                    pct=64.5
#>   agreement_pairs    pre, post                    pct=59.1
#>   time_variable      pre                          pct=90.0
#>   event_variable     arm                          pct=90.0
#>   outcome_continuous pre                          pct=60.0
#>   outcome_binary     arm                          pct=60.0
#>   covariate          pre, post, arm, resp         pct=50.0
summary(res)
#>                  role found           columns  pct
#> 1           group_var FALSE                    0.0
#> 2        paired_pairs  TRUE          pre,post 64.5
#> 3     agreement_pairs  TRUE          pre,post 59.1
#> 4       time_variable  TRUE               pre 90.0
#> 5      event_variable  TRUE               arm 90.0
#> 6          subject_id FALSE                    0.0
#> 7  outcome_continuous  TRUE               pre 60.0
#> 8      outcome_binary  TRUE               arm 60.0
#> 9   repeated_measures FALSE                    0.0
#> 10        scale_items FALSE                    0.0
#> 11          covariate  TRUE pre,post,arm,resp 50.0

The same call on the name-stripped twin yields the same roles by position:

d_blind <- setNames(d, paste0("col_", seq_along(d)))
pos <- function(r, dat) match(r$roles$paired_pairs$columns, names(dat))
identical(pos(detect_roles(d), d), pos(detect_roles(d_blind), d_blind))
#> [1] TRUE

How a role is scored

Each column is first typed by value (continuous, binary, categorical, ID), never by name. Candidate roles are then scored by signatures that capture the statistical shape a role implies – correlation and distributional overlap for paired measurements, Bland-Altman bias and intraclass correlation for agreement, event-rate and right-skew for survival, inter-item correlation and a Cronbach-alpha proxy for scale items, and so on. Every score is a transparent sum of named components you can inspect:

res$roles$paired_pairs$components[[1]]
#> $name
#> [1] "Correlation"
#> 
#> $score
#> [1] 0
#> 
#> $max
#> [1] 20
#> 
#> $detail
#> [1] "r=-0.00"

Shannon entropy

For a categorical column with level proportions \(p_1, \dots, p_k\), the normalized Shannon entropy

\[ H_{\text{norm}} = \frac{-\sum_i p_i \log_2 p_i}{\log_2 k} \in [0, 1] \]

measures how balanced the levels are. A grouping variable (treatment vs control) has high entropy (near-balanced); a near-constant flag has entropy near zero. Entropy drives both the value classifier and the group-balance signal.

Normalized mutual information

To ask – name-blind – whether a candidate grouping actually carries information about an outcome, rolescry uses normalized mutual information:

\[ \text{NMI}(X, Y) = \frac{I(X; Y)}{\min\{H(X),\, H(Y)\}} \in [0, 1], \]

which is 0 for independent variables and 1 for a deterministic association, and is comparable across variables with different numbers of levels. It is exposed directly:

g <- sample(c("A", "B", "C"), 300, replace = TRUE)
y <- ifelse(g == "A", "event", sample(c("event", "none"), 300, replace = TRUE))
compute_nmi(g, y)          # > 0: g informs y
#> [1] 0.2903596
compute_nmi(g, sample(g))  # ~ 0: shuffled -> independent
#> [1] 0.007514902

The optional, capped name bonus

Names are not useless – they are just untrustworthy. When you do trust them, pass a keyword dictionary via name_bonus. Names then act only as a small, capped tie-breaker (at most a +10 point nudge, i.e. <= 10% of the selection score); the mathematical signature still dominates (>= 90%), the relationship enforced by score_gap_ok().

clin <- data.frame(
  male  = rbinom(120, 1, 0.5),      # a demographic binary (first)
  death = rbinom(120, 1, 0.3)       # the intended outcome
)
detect_roles(clin)$roles$outcome_binary$columns                                  # positional default
#> [1] "male"
detect_roles(clin, name_bonus = rolescry_default_name_bonus())$roles$outcome_binary$columns  # "death"
#> [1] "death"

Header-aware loading

read_data() reads a file with the header row found by the same information-theoretic scorer (detect_header()), so messy exports with title rows or merged cells still load with sensible column names. Delimited text works with base R; spreadsheet and statistical formats use optional packages and degrade gracefully if they are not installed.

Attribution

rolescry is derived from Boynukara, C. (2026). MDStatR (v2.1.0 Veritas). Zenodo. https://doi.org/10.5281/zenodo.20707791. Run citation("rolescry") to cite the package and its parent engine.