--- title: "Name-blind variable-role detection with rolescry" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Name-blind variable-role detection with rolescry} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(rolescry) set.seed(1) ``` ## The problem: names lie, data does not Real datasets arrive with column names that are missing, misleading, in the wrong language, or simply wrong: a column called `category_code` holding continuous lab values, a `gender` column that is actually a free numeric measurement, an outcome buried under an opaque `v7`. Any tool that decides a column's statistical *role* from its **name** inherits every one of those lies. `rolescry` decides roles from the **data signature** instead -- the guiding principle is *Data inspice, non nomen* ("inspect the data, not the name"). Renaming every column to `col_1, col_2, ...` does not change a single role assignment. This is the **turnusol** (litmus) invariant, and it is the package's keystone test. ## A worked example ```{r example} d <- data.frame( arm = rep(c(0, 1), each = 50), # a balanced 2-level grouping pre = rnorm(100, 10, 2), # measured before ... post = rnorm(100, 11, 2), # ... and after (paired) resp = rbinom(100, 1, 0.4) # a binary response ) res <- detect_roles(d) res summary(res) ``` The same call on the name-stripped twin yields the same roles by position: ```{r blind} d_blind <- setNames(d, paste0("col_", seq_along(d))) pos <- function(r, dat) match(r$roles$paired_pairs$columns, names(dat)) identical(pos(detect_roles(d), d), pos(detect_roles(d_blind), d_blind)) ``` ## How a role is scored Each column is first **typed by value** (`continuous`, `binary`, `categorical`, `ID`), never by name. Candidate roles are then scored by signatures that capture the statistical shape a role implies -- correlation and distributional overlap for paired measurements, Bland-Altman bias and intraclass correlation for agreement, event-rate and right-skew for survival, inter-item correlation and a Cronbach-alpha proxy for scale items, and so on. Every score is a transparent sum of named components you can inspect: ```{r breakdown} res$roles$paired_pairs$components[[1]] ``` ### Shannon entropy For a categorical column with level proportions \(p_1, \dots, p_k\), the normalized Shannon entropy \[ H_{\text{norm}} = \frac{-\sum_i p_i \log_2 p_i}{\log_2 k} \in [0, 1] \] measures how balanced the levels are. A grouping variable (treatment vs control) has high entropy (near-balanced); a near-constant flag has entropy near zero. Entropy drives both the value classifier and the group-balance signal. ### Normalized mutual information To ask -- name-blind -- whether a candidate grouping actually *carries information about* an outcome, `rolescry` uses normalized mutual information: \[ \text{NMI}(X, Y) = \frac{I(X; Y)}{\min\{H(X),\, H(Y)\}} \in [0, 1], \] which is 0 for independent variables and 1 for a deterministic association, and is comparable across variables with different numbers of levels. It is exposed directly: ```{r nmi} g <- sample(c("A", "B", "C"), 300, replace = TRUE) y <- ifelse(g == "A", "event", sample(c("event", "none"), 300, replace = TRUE)) compute_nmi(g, y) # > 0: g informs y compute_nmi(g, sample(g)) # ~ 0: shuffled -> independent ``` ## The optional, capped name bonus Names are not *useless* -- they are just untrustworthy. When you do trust them, pass a keyword dictionary via `name_bonus`. Names then act only as a small, **capped** tie-breaker (at most a +10 point nudge, i.e. <= 10% of the selection score); the mathematical signature still dominates (>= 90%), the relationship enforced by `score_gap_ok()`. ```{r namebonus} clin <- data.frame( male = rbinom(120, 1, 0.5), # a demographic binary (first) death = rbinom(120, 1, 0.3) # the intended outcome ) detect_roles(clin)$roles$outcome_binary$columns # positional default detect_roles(clin, name_bonus = rolescry_default_name_bonus())$roles$outcome_binary$columns # "death" ``` ## Header-aware loading `read_data()` reads a file with the header row found by the same information-theoretic scorer (`detect_header()`), so messy exports with title rows or merged cells still load with sensible column names. Delimited text works with base R; spreadsheet and statistical formats use optional packages and degrade gracefully if they are not installed. ## Attribution `rolescry` is derived from Boynukara, C. (2026). *MDStatR* (v2.1.0 Veritas). Zenodo. . Run `citation("rolescry")` to cite the package and its parent engine.