Why the Hard Part Isn't the LLM Model: The Application of Harness Engineering in FinCatch

There's a growing argument in AI research that what makes an agent reliable has surprisingly little to do with which model is powering it. What matters more is everything built around the model: the system that decides how it runs, what tools it can use, how it remembers things, and what keeps it from going off the rails. A recent survey from CMU, Yale, and Amazon researchers gave this idea a name, agent harness engineering, and after looking at more than 170 real projects the authors concluded that reliability doesn't live in the model's "brain." It lives in the scaffolding around it.

We read the paper with an uncomfortable jolt of recognition, because it described most of the hard decisions we've made building FinCatch.

FinCatch is an investment research platform built around AI agents. You ask a question — about a company, a financial filing, a market trend, how two businesses are connected — and the agents go off to gather data, reason across sources, and hand back a clear analysis. That's easy to describe and genuinely hard to make reliable once real users are leaning on it every day.

The pattern we kept running into was this: every time something broke, the fix was never "use a smarter model." It was always somewhere in the surrounding machinery. Three examples stuck with us.

Memory that follows the agent

shipping — Fig 2: Each user's preferences and history get bundled into the cloud, then unpacked by whichever machine catches the next request.

A research agent needs to remember you. If you've told it your preferred data sources, the companies on your watchlist, or how you like your earnings summaries framed, none of that should vanish the moment you start a new session. The obvious fix is to save your settings in a database and feed them back to the agent each time. That works until you realise the agent often runs on a brand-new, throwaway machine that has no memory of anything you did before. You end up with a polite conversation sitting on top of total amnesia.

So we started treating the agent's entire working setup as something portable. Each user's information — preferences, settings, recent context — gets bundled up and stored in the cloud. When a new machine starts handling your request, the first thing it does is unpack that bundle and restore everything exactly as you left it. Changes get saved continuously while you work, and the final state is tucked away when you're done. Getting the timing right was fiddly; early on we lost people's changes when their session ended faster than we could save them. But once it worked, the agent reliably picks up where it left off, no matter which machine happens to catch the request.

Machines that already know you

For safety and tidiness, we run each user's work in its own isolated environment so nobody's session bleeds into anyone else's. The downside is that a fresh environment starts every request cold, and cold is expensive in two ways. There's the obvious delay of starting everything up, and a subtler cost too: an environment that knows nothing about you tends to produce slightly worse answers than one that's already adjusted to your patterns.

We solved this by being smart about which environment handles your request. The system first looks for one that's already working on your session. If that one's busy wrapping up, it waits. If one is winding down from your recent activity, it wakes it back up. Only when nothing of yours is available does it reach for a completely fresh one. We sold this internally as a speed improvement, which it is, but the real reason we built it was to keep your context intact between requests.

Treating know-how as part of you, not the system

tools — Fig 3: Research skills travel with the user alongside their preferences and history — owned by the person, not the platform.

This last decision is the one we're proudest of, and it looks like a content choice until you really sit with it.

FinCatch agents work from a set of research "skills" — basically structured playbooks for tackling different kinds of financial questions. How to use a particular government filings database. How to map out the web of relationships between companies. When to dig into raw numbers versus trust a quick pre-calculated signal. We didn't plan these in advance; they built up over time, because agents given clear instructions consistently beat agents left to figure everything out from scratch.

The unusual call was to treat these playbooks as travelling with the user rather than living only on the platform. Every user starts from a shared default set — a curated library of playbooks we maintain and update centrally. That default layer gets merged into your personal bundle the first time you connect, and from then on your copy travels with you: bundled up and restored alongside your preferences and history, and open to being tailored over time. A kind of master playbook sits on top to help the agent pick the right one for the question at hand. That blurs the line between what the agent knows how to do and what you specifically have set up — and keeping the two together, evolving with you rather than frozen on our servers, made the whole system far easier to adapt to different people.

The pattern underneath

Across all three stories the shape is the same. None of the improvements came from making the model itself cleverer. They came from changing what the model found waiting for it when it woke up: a richer memory, a warmer starting point, a better-stocked toolkit.

This is still a young field, and the main gift of a paper like this is vocabulary — words that let teams argue clearly about trade-offs they used to make on instinct. For us, it mostly confirmed bets we'd already placed, which is reassuring to read about your own work, even if it's a little deflating to learn you were doing textbook engineering all along.

FinCatch is an AI-native investment research platform. If you are building something in this space and want to compare notes, we would like to hear from you.