The Context Window, Episode 3 — with Toffer Winslow, CEO of Diffblue
One of our co-founders stepped in as host for episode three after a scheduling collision left him with very little advance notice and, by his own admission, very little context going into the conversation. Which turned out to be appropriate: this episode is about what happens when you remove the safety net and still need to ship quality code.
The guest was Toffer Winslow, CEO of Diffblue, a company spun out of Oxford University that has spent eight years applying reinforcement learning to a problem most engineers find somewhere between tedious and existentially demoralizing: unit tests. By the end of the conversation, the discussion had moved well past testing tools into questions about expertise, the iron triangle of modernization, and whether the AI boom is quietly undermining the pipeline of engineers who will be needed to run all of it.
“You rob banks because that’s where the money is. Java is where the money is.”
— Toffer Winslow, CEO of Diffblue
Reinforcement learning vs. LLMs — and why it matters for code
Most people in the AI space have collapsed “AI for code” into a single category dominated by large language models — GitHub Copilot, Claude Code, and their peers. Diffblue started from a different place entirely: reinforcement learning, the technique behind AlphaGo and increasingly behind the fine-tuning of frontier LLMs themselves.
The distinction matters in practice. When Diffblue’s agent generates a unit test, it proposes a candidate, executes it against the actual code, evaluates the result, learns from it, and iterates — sometimes thousands of times — until it produces something that is guaranteed to compile and guaranteed to pass. No hallucinated library imports. No tests that look right but silently fail. A general-purpose LLM, asked to do the same thing, gets you 60–70% of the way there on a good day. Diffblue gets you to 100%, consistently, at scale.
For a single developer writing a handful of tests, that gap might feel manageable. At enterprise scale — 700 Java applications, 3 million lines of legacy code, 3% existing test coverage — the gap is the difference between a project that’s feasible and one that would take 20 years.
The iron triangle of modernization
Toffer introduced a framing that resonated throughout the episode: the iron triangle of application modernization. Cost, time, and risk. Historically, the weight of all three landing simultaneously is why enterprises let technical debt pile up for decades. The math never works. The teams who built the original systems are long gone. The documentation, if it ever existed, is gone too. Changing anything feels like defusing a bomb with no diagram.
This is where AI — applied intelligently, not just enthusiastically — changes the equation. The episode discussed a project that brought Sirrus7 and Diffblue together: a client with somewhere between 700 and 800 legacy Java applications, a modernization timeline estimated at 15–20 years, and a codebase with 3% unit test coverage running billions of dollars of business logic. The kind of problem that sounds impossible until you break it into components.
The solution was a pipeline: Diffblue to generate a regression suite and lock in the existing behavior, Open Rewrite to deterministically refactor the code, and LLMs to handle the parts that benefit from generative flexibility. Three months. All the repositories modernized. A project that was previously “never” became “Q1.”
Key ideas from episode 3
1. Reinforcement learning and LLMs are not the same thing. For deterministic, high-stakes outputs like unit tests, RL-based tools produce guaranteed results. LLMs produce probable results — useful, but not the same.
2. Unit tests are the documentation. In legacy codebases where the original developers are gone and docs never existed, a generated regression suite is the only reliable record of what the code is supposed to do.
3. The iron triangle of modernization — cost, time, risk — is being broken by AI. Projects that were effectively impossible (15–20 year timelines) are now achievable in months when the right tools are combined in the right sequence.
4. AI is most powerful in the hands of experts — and that’s the problem. The people who get the most from these tools are those who’ve spent years developing the judgment to evaluate their output. We may be quietly eliminating the path to building that judgment.
5. The right tool for the right job still applies. LLMs, RL agents, static analysis tools, and open-source refactoring frameworks each have a lane. The engineers and services teams who know how to combine them are where the real value lives.
6. Junior engineers aren’t disappearing — but their job description is changing. The path forward is T-shaped breadth: infrastructure, architecture, quality, operations. Deep specialization in a single stack is no longer a viable career strategy.
The expertise crisis nobody’s talking about
The most uncomfortable thread of the episode wasn’t about tools. It was about people.
Toffer put it plainly: AI in its current form is most powerful in the hands of experts — people who can look at the output and say “this is great” or “this is garbage, ask differently.” But those experts built their judgment through years of doing the less glamorous work. The testing. The debugging. The reading of someone else’s undocumented legacy code at 11pm. The exact work that AI tools are now absorbing.
If we automate away the early-career work that produces expert engineers, where do the next generation of experts come from? Who will be able to evaluate AI output in five years if we’ve spent the last five years removing the experiences that build that capability?
A counterpoint worth sitting with: junior engineers today have an opportunity that didn’t exist a year ago. Fields like model interpretability, AI-native testing frameworks, and LLM orchestration are so new that experience level is essentially flat. A motivated junior engineer who commits six months to becoming genuinely expert in one of these areas can be as capable as anyone in the world — because everyone is new to it. The window to build that kind of career-defining expertise is open right now, and it won’t stay open forever.
The services layer that makes it all work
One observation from Toffer that’s worth pulling out explicitly: the AI tech stack, by itself, doesn’t modernize anything. The value comes from the people who know how to operate it, the processes they’ve developed for getting reliable output, and the judgment to know when a tool is working and when it’s leading you somewhere you don’t want to go.
That’s not a comfortable thing for the “AI replaces everything” narrative, but it’s an accurate one. Diffblue generates the tests. Open Rewrite does the deterministic refactoring. The LLM handles the generative work. Sirrus7 figures out how to sequence all of it, manages the client relationship, validates the output, and ensures the business logic survived the migration. None of those pieces work without the others. The tool stack is the easy part to replicate. The operational knowledge is not.
Which is, as it happens, a pretty good argument for why both companies think the next several years are going to be interesting.