BioUnfold #12 — Lead Optimization: Learning the Chemistry

Lead optimization is where discovery becomes engineering. It is the careful process of turning an active molecule — often unstable, imperfect, and biologically complex — into something that can survive the full environment of the body.

In BioUnfold #7, I wrote about how AI can help chemists think beyond binding and optimize the whole molecule. Here, the focus is not on the property space itself but on the process: how learning and chemistry interact inside the DMTA loop.

From Hit Discovery to Lead Optimization

At the end of Hit Discovery, teams typically emerge with several dozen active molecules across a range of chemotypes. Some chemotypes produce a small constellation of variants, while others appear as singletons — molecules that show activity but offer no immediate structure–activity relationship. Once singletons, artefacts, and unstable actives are removed, most programs begin lead optimization with a few dozen viable starting points.

These molecules are rarely “drug-like.” They show potency, but little else is tuned. Some bind cleanly; some behave inconsistently across assays; others depend on context that is not yet understood. But they represent the first concrete evidence that the biology can be modulated. Lead optimization begins by taking these early signals and asking a more ambitious question:

How do we turn a biologically active molecule into a viable therapeutic candidate?

During hit discovery, success meant “find signal.”
During lead optimization, success means “shape that signal into something that works in a living system.”

DMTA: A Learning System Hidden in Plain Sight

DMTA — Design, Make, Test, Analyze — is both a workflow and a feedback system. Each iteration teaches the team what improves the molecule and what the system will not tolerate.

Even without AI, DMTA is a closed-loop learning process. With AI, it becomes explicit: the model proposes hypotheses, experiments validate them, and the system updates. But this only works if the process is structured to enable learning at the same pace as the chemistry.

Pricing and Speed: The Real Constraints of DMTA

A reality rarely discussed publicly is that DMTA is constrained less by algorithms and more by synthesis cost and assay throughput.

A good DMTA pipeline treats model output not as “the molecules to make,” but as ranked hypotheses competing for scarce experimental capital.

Thin Data, Practical Constraints, and the Need for Discipline

At the start of lead optimization, teams often have fewer than twenty analogs per chemotype with measured activity. Structure–activity relationships are faint. Data are noisy. Models cannot yet distinguish signal from artefact — not because they are weak, but because the biology has not expressed its shape.

Assay strategy widens this gap:

Most programs therefore combine both: the binding assay provides reliable signal and a calibration target for models, while the cell assay provides biological truth. Together, they define the early DMTA cadence as much as any chemistry decision.

Models Must Move at the Pace of Chemistry

One of the quietest but most damaging failure modes in AI-driven chemistry is model staleness. If a model is trained on data from two cycles ago, it proposes molecules aligned with old priorities. Chemistry and assay realities move forward; the model points backward.

To avoid this, the model must be:

When models update at the same cadence as experiments, DMTA becomes a coherent learning loop rather than a parallel track.

Exploration vs. Exploitation Across the DMTA Lifecycle

Exploration vs. Exploitation

Not all cycles are equal:

Early cycles

In the earliest cycles, the priority is to apply project-specific static filters (Ro5 heuristics, aromatic ring limits, solubility thresholds, toxicophore removal), because very little data is available.
As the cycle progresses, these filters become more sophisticated as the team begins to understand what the biology will accept.

Middle cycles

Once enough data accumulates, a generative model can be trained.
The model should be modulable, allowing team priorities to become explicit inputs that steer the search process.

Late cycles

A healthy DMTA system implicitly has a temperature parameter — high in the early phase, cooling as decisions become more constrained.
Most AI pipelines ignore this progression, generating excessive novelty late in programs or over-exploiting too early, leading to misalignment between model behaviour and chemical feasibility.

Bridging the Tempo Gap Between Chemistry and Data Science

A recurring challenge in real programs is that computational and chemical workflows operate on different tempos. Chemistry advances according to synthesis queues and assay turnaround times. Data science advances based on clean inputs, feature stability, and model retraining cycles.

When these rhythms diverge, insights often land after decisions have been made — not because chemists resist AI, and not because data scientists lag, but because the process does not define when the model should influence the decision.

The solution is to align roles explicitly:

When this alignment is present, AI becomes a directional tool. When it is absent, AI becomes commentary.

From Iteration to Direction

Lead optimization can feel incremental — cycle after cycle, small adjustments, gradual improvements. Yet it is also the stage where the feedback structure between design, experiment, and analysis becomes visible.

The model proposes.
The lab tests.
The system learns.
And slowly, the molecule becomes a candidate.

What elevates a program is not the sophistication of the algorithm, but the architecture of the loop. When computation and experiment move together, DMTA becomes more than a workflow: it becomes a directional engine capable of guiding a molecule toward the clinic with speed, coherence, and scientific realism.