BioUnfold #14 — Similarity Is Not Mechanism: Limits of Representation in Biology

Measurements - Knowledge - Hypothesis

Self-supervised learning has become the default strategy for training foundation models in biology. It offers structure without labels, scale without annotation, and a way to align heterogeneous assays within a single representation. Yet self-supervision is not a single idea. It is a collection of architectural decisions that encode assumptions about how variation should be organized.

Most discussions focus on data scale or modality. Fewer acknowledge that the architecture itself determines which biological signals are preserved, which are suppressed, and which appear more meaningful than they are. Biological images carry multiple sources of variation at once — morphology, texture, dose response, artifacts, and context — so architectural choices inevitably shape what the model represents.

Architecture is not neutral. It makes a bet about the biology.

The Two Levers of Self-Supervision

Self-supervised models vary along two fundamental axes.

1. The task

This defines what the model predicts without labels.

2. The regularization

This governs the geometry of the latent space.

These distinctions also appear in natural language. Masked language modeling (BERT, RoBERTa) resembles reconstruction, emphasizing local structure. Contrastive and distillation-based approaches (SimCSE, Sentence-BERT, MiniLM) resemble alignment tasks, emphasizing global invariances. Across fields, the task largely determines the structure of the representation.

MAE and DINO Emphasize Different Biology

This divergence is especially visible in imaging.

Reconstruction-based models (MAE, iBOT):

Alignment and distillation models (DINO):

Increasing model size does not guarantee better biological generalization. Often, the model generalizes the most visually dominant structure rather than the underlying biology.

How Embeddings Are Used in Practice

A common pattern has emerged across industry:

  1. Start with a general-purpose vision backbone such as DINOv2.
  2. Apply it to biological images without fine-tuning.
  3. Evaluate similarity or clustering on top of the embeddings.

This frequently works because these evaluations reward visual coherence, not mechanism. The results often look intuitive and therefore seem reliable, but unsupervised success mostly reflects architectural priors rather than biological equivalence.

Why Latent Space Arithmetic Breaks in Biology

Some NLP and vision models exhibit vector arithmetic in embedding space. Biology rarely does. Perturbations follow nonlinear dose–response curves, trigger stress pathways at high concentrations, or produce weak signal at low concentrations. Cell-line and environmental contexts reshape these relationships further.

To analyze this rigorously, let:

This notation clarifies how similarity is quantified.

What Biological Workflows Actually Require: Equivalence

The central question in many discovery workflows is:

Do two perturbations produce the same mechanism of action in this context?

Similarity is often assessed using cosine similarity, which measures the angle between embedding vectors rather than their distance. It evaluates whether two perturbations point in the same direction in representation space.

\[\cos(f(P_a \mid c), f(P_b \mid c)) \approx 1 \quad \Rightarrow \quad \text{similar mechanism}\]

Cosine similarity assumes that:

These assumptions often fail. Dose–response curves are nonlinear or biphasic. High doses introduce stress phenotypes unrelated to mechanism. Low doses may show little signal despite true mechanistic similarity. Context shifts alter phenotypes in ways cosine geometry cannot express.

As a result, two perturbations may appear similar at one dose and dissimilar at another. The issue is not model failure but the mismatch between cosine geometry and biological reality.

A more realistic framing introduces a learned equivalence function:

\[g(P_a, P_b \mid c) \approx 0\]

This function evaluates whether two perturbations are functionally similar within the same context, without relying on linearity in embedding space.

Embeddings represent perturbations. They do not inherently represent relationships between perturbations. Equivalence must be learned.

The Core Argument

Self-supervised architectures embed assumptions about how biological variation should be organized. They determine:

Biological foundation models will not succeed through scale alone. They will succeed when architectural choices align with the biological questions that matter most:

Are two perturbations equivalent in this context?