For the longest time we’ve gone to our CRM for customer data, and our HR portal to book a holiday, and pulled product and pricing data into a spreadsheet for our own analysis.

And with the likes of zapier, paragon, mulesoft etc we’ve created a choreographed dance of today’s systems. This is about to give way to something far more profound. What we currently call “agents” – LLMs wrapped in API calls following predetermined paths – represent merely the opening act in a transformation that could rival the internet and cloud computing in its impact on software economics. As systems evolve from stateless token generators to reasoning frameworks with persistent memory and autonomous goal-setting capabilities, SaaS businesses face an uncomfortable question: what happens when your carefully designed user experience becomes an unnecessary middleman between data and decision?

The current state Link to heading

Where are we now? Link to heading

Highly choreographed workflows, with minimal agency or adaptability by the agent.
Heavy influence on prompt engineering. Which often leads to over fitting to a known problem, rather than open exploration or ability to handle variation.
“reasoning” in reality is “just more text to interpret” heavily influenced by the prompts (system & user prompts), which in turn play on the attention mechanism
Brittle (try upgrading the model, or tweaking the API they interact with)
Amnesia: as the prompts/interaction gets long they start to suffer from gaps in the attention matrix. Focusing on different facts and forgetting the initial bigger goal
Error handling, in line with the brittle argument. But made worse with the non-deterministic nature of the execution. Debugging starts with being able to reproduce a problem!
Considering all the above. The true decisions and important verification still is done by a human. Which at hearts makes these tools automators rather than autonomous agents.

There is value in all this. Mainly in areas where these systems provide an input to us, rather than independently make decisions with significant impact.

But it’s also often really just chaining an LLM with a wrapper over a bunch of APIs and tools, in a mostly pre-determined decision tree.

The primary innovation in my perspective is that we moved the control plane closer to natural language. The flipside is most of the tools that came with our old software/code control plane no longer work.

Prompt Theater Link to heading

System prompts can encode certain characteristics such as “work through step by step” or “consider alternatives”, creating a veneer of reasoning but in reality is setting up a rail to generate tokens. Much like writing the outline of a book before filling in the details, it forces a structure that constrains.
So while agents may appear to choose a path/solution/direction, they’re often more influenced by prompts than we credit. And because the prompt comes before the output, the prompt causes (or at least severely tunes) the result, rather than independent reasoning.
“act as an expert in X” or “be a helpful friendly customer support” will create convincing sounding language that we as humans will interpret as skill. They’re in reality of course just tokens generated by a complex mathematical function.
The second element with those prompts is that they also do not generate any permanence in terms of identity. Underneath it is a contextless stateless system at heart. So you need to load in this behavior every time. Which then goes back to suffering from the attention/amnesia problems.
Even if prompts are the same, the output will not be deterministic. Which means that the executed plan (first look at X, then Y, then do Z) is not guaranteed consistent between runs.

Determinism Link to heading

The entire world (of software) is built on deterministic execution. Code does what code says it does. Every time. If it fails, it fails all the time in the same way. (note, we may struggle to reproduce the exact scenario, but at heart given the exact same input you get the exact same output).

genAI based systems are non-deterministic. This does not make the answers they generate necessarily wrong, there is often more than one correct answer to a question after all. And equally, humans aren’t deterministic either. But I will put that the boundaries in which humans provide non-deterministic answers to a question are far more narrow than current genAI models.

But it does raise a number of operational questions

How can we test? Our entire world is based on “given input X, look for output Y”.
How do we debug when it goes wrong if we can’t replay?
What does version control mean if you can’t determine how changes from one version to the next affect the output?
Can we limit the “spread” of the answers?
How do we generate some amount of consistency between runs?
How do we think about authority? And how do we enable multi-goal and trade-offs? Can we avoid “our sales oriented algo cracked the puzzle by selling $30k cars for $1…”
And I haven’t touched on the entire world of highly regulated/compliance industries and the potential liabilities this behavior creates; Enterprises generally want accountability and its brother and sister transparency and auditabilitiy.

These aren’t problems of the category “we need a bigger LLM”. They are fundamental problems at the heart of the agentic paradigm.

If we wanted something that does exactly what it says all the time … that’s called code. And the idea that these systems will always be “correct” goes against their inherent statistical nature.

So the slow step in this won’t be developing agents, but rather establishing a new set of tools for this new control plane. Which is inevitably a much slower process.

The future Agent Link to heading

Autonomous goal decomposition. Break down high-level goals into executable sub-goals without human guidance (or over-reliance on system prompts), while maintaining goal consistency across execution paths.
Implicit context recognition. Understand unspoken requirements and context, or be aware of the context under which a request is received (or even who makes the request and what that implies)
Evolving context models. Not just raw memory storage but ability to evolve conceptual understanding of domain models, stakeholders.
Self calibrated boundaries. A system that knows how to operate inside boundaries with judgement about when to proceed autonomously vs seek clarification
Multi context reasoning. The ability to articulate (and make) decisions based on tradeoff between technical/operational/business/user (and even ethical) dimensions

In other words: an Agent isn’t just persistent software that completes a task; but rather a reasoning framework that can navigate ambiguity while maintaining a coherent goal in mind. And in practice you’ll probably have many of these trained on particular domain knowledge. Even potentially down to a full agent model trained on your proprietary knowledge that has never been in the public domain ever.

A customer perspective Link to heading

I’ll come back later to what needs to be true for those agents to exist. For now, let’s assume they’re real and what that means for the current crop of SaaS businesses.

We’ve all been told that a good SaaS business is one that uniquely solves a (narrow) problem for a user. And the economic efficiencies sit in the software model, where marginal costs for new users are low. The value exchange is a user renting this solution for a fraction of the cost of an in-house build and maintained system. And it all gets hosted centrally in a multi-tenant setup with a vendor controlled update/improvement cycle and accessed over a public internet connection, taking the operational hassle of running live systems away.

SaaS comes in many flavors that fill different functions. So I’m going to put a somewhat arbitrary list down to help slice through the variety.

Systems of record, who’s main job is create/read/update/delete actions on relatively simple data (CRM, ERP, HRIS, Document Storage)
Workflow applications. Those that help in automating/tracking specific workflows (Jira, DocuSign, Asana, …)
Collab tools. Enablers of communication, both synchronous and async. (Zoom, Teams, slack, Notion, Basecamp)
Specialist functions. Tools that are often rooted in Systems of record but have focused deeply on a particular function in the org (zendesk, Greenhouse/bullhorn, Shopify, Marketo)
Analytics tools. Spanning the range from ETL to storage and visualization (tableau, snowflake, looker, …)
Integration plays. Tools that focus on the automation of activity by hitting API/connections between existing systems (Zapier, Mulesoft, … )

And in a typical company, that means I am paying a lot of companies to hold fragments and “views” of my data with probably a few tools in each category. And I’m paying for my right to manage that data (through seat licenses), and/or for the amount of data they hold. And no matter how much “AI” each of them adds to their solution, they’re not making my life much better. Sure Databricks’ AI dashboard builder lets me ask for a pie chart rather than drag the widget and configure it. Big whoop. Gemini now generates a document with some summary of what we talked about and tries to figure out any action points. Did I really need that?

None of these AI features hit the fundamental problem that Agents pose for them. I want the agent to know everything all at once. I want the Agent to be able to understand that when George talked about “prepare me for today’s meeting” it had a lot to do. In fact, it had to understand that George works in Sales (HR data), and “today’s meeting” (a calendar event) is a Teams call (make sure the meeting recorder joins) with Jane from Coca Cola (customer data) and that he needs the “first sales call” script & matching slides (document storage). Oh and when Sally ask “prepare me for today’s meeting” it’s actually 3 engineers chatting about a problem they discussed on slack a few days back. That’s of course a stereotypical “siri on steroids”. But you could take that same lense to business processes. How often does a business process start to finish live entirely in one SaaS tool?

If you want to spell out the collapse of SaaS. It’ll be slow at first and then suddenly. The first step is humans bypassing the UI. It might look like openAI’s operator using screenrecognition to complete an action or pull some data. But you’ve lost the human interaction. And why would anyone pay a SaaS vendor 100 seat licenses for the entire company if the vast majority of data input&retrieval can be agentized by a single license? And while that UI might be a pleasant place for humans to get something done and be a differentiator, an agent doesn’t really care that much. Needing to click around because I can never remember where exactly that information lives in a system, or which system exactly it is? Agent got it. Need a full deep dive on a key customer activity? Have fun copy and pasting data from zendesk, salesforce, quickbooks, product usage dashboards, call notes (if they exist at all…). Breeze for an agent, be on your desk in 10 minutes.

I’m sure you get the point by now. The crux is that these future Agents hurt most SaaS players at their core. That’s a world where SaaS no longer owns their UI/UX but instead UI/UX creates friction over let’s say an API (or any interface optimized for data exchange between software systems, rather than software<>human). A world where fixed automated processes and pre-defined dashboards are irrelevant, as agents can dynamically push and pull data and summarize. All of this will lead to standardization and the winners will initially be those who interact best with agents. And rather than the current “best of breed” approach of selecting best-vendor in a category, we’ll look for consolidated solutions that reduce friction for the agents. And Agents will push for open formats on data, to reduce more friction and erroneous pathways in their execution. And with agents in place and sensible data formats, switching costs will drop so SaaS lock-in will severely reduce. All of which will hurt the pricing power of SaaS vendors.

Because ultimately the flip side for most companies will be using genAI to create a number of “mini-saas” in-house and focus on making sure their agents have a clean well maintained dataset to work with. . Which incidentally will also make them truly owners of their data again, unencumbered by SaaS vendor opinions on what that data should look like. Anyone who’s ever worked with data and data analysis will tell you, your system is only ever as good as the quality of your data. Whether companies are ready for the operational fallout of running all this tech in-house and building crucial business processes on top of them is another matter, but it likely won’t stop some of them trying.

Are we there yet? Link to heading

Not yet.

There are a number of critical pieces missing to enable the agents of the future. Critical pieces that aren’t just evolution of today, but represent a significant step change in technology.

Let’s start by addressing the issues of long term memory, persistence and ability to react to the knowledge inherent in those memories. At heart an agent powered by a language model can only focus on what’s in the context window. And while current models allow to put close to 200,000 words in that window (about an 800 page novel), there is fundamentally a limit. LLM don’t need to prioritize because they process the entire context window with equal attention all the time. Windows are growing with every new model. But growing windows bring their own problems in that models have a harder time figuring out what’s important in all that input. To get over this we need processes to effectively synthesize information. Figuring out what is and isn’t important, resolve contradictions between old and new information as we collapse it. And ultimately, bluntly, decide what is worth remembering to begin with. And once we have all that raw consolidated information, we need a way to make sure that each of those concrete experiences can be rolled up in a broader conceptual framework to apply across unseen scenarios. This isn’t prompt-engineering by loading up the context with more “memories”, but rather fundamentally altering the paths through the model as a consequence of those memories. And that is very different from any technical track we’re on today. It’s not a bigger context window, bigger model, or more RAG. But a fundamental new class of storage that impacts the processing architecture. It’s a subtle nuance in language between “pathway from given information” and “pathway from experience”. But it is a fundamentally different architecture that turns a static system with dynamic input, into a dynamic system that adjusts itself. These are the boundaries of our current thinking as we aren’t just changing the parameters of the model, but the actual model’s layers and code. That would form the basis for augmenting the model over their basic limitations inherent in the fixed parameter space and stateless world they operate in today. The neural networks of today have some emergent properties, but they are not fundamentally able to rewire their own compute ability. And that is currently an area of (limited) fundamental research.

Next up we’ll look at the challenge of goal setting. Or more precisely, goal decomposition. Everything today is rooted in text generation. Even the “reasoning” models ultimately simply generate more text, and that text gets fed back into the attention mechanism of the transformer architecture. And it’s a known problem that ever larger input leads to several side effects of shifting attention. For instance the well documented lost-in-the-middle effect, which effectively means that text at the beginning or end of a prompt is more likely to get attention. So the idea of long term goal permanence as a feature is not inherent in our current designs. Nor do systems have the capability to apply temporal reasoning. The idea that a goal at time T1 is more or less relevant than a goal at T2 has no meaning to our current models. It’s all just input. Nor are our current systems capable of working out if a change in goal is a bonafide adjustment or corruption. All because there is no true mechanism for representing goals (the final outcome), subgoals (the parts that make the final outcome) or meta-goals (goals about goals such as resource usage or safety constraints). And while any LLM will happily write out a set of subgoals, there is nothing that will make them inherently verify that these constitute a solution or know when a set is complete. And the idea that an agent will check an agent is bonkers. There is computationally no difference between Agent A checking Agent B and prompting Agent A “are you sure?”. It is simply processing a slab of text. The goal/subgoal problem goes a little deeper. There is no real hierarchy other than computed attention between words. There’s plenty research to show that adding irrelevant information to a prompt often throws off the answer. A part of this ties in with these systems lacking ground-truths. They don’t really know if they’re making progress towards a real-world impact. It’s all based on exchanged words. We need ways of adding ground truth and linking specific inference to real world outcomes. Granted, some of these we can work around by prompt hacking but then we’re back to brittle systems with prompt theater. What is fundamentally needed is a means to reason about goals beyond, and in fact outside of, transformer’s attention system.

We touched briefly on the emergent compute properties of a truly learning system when discussing the need for memory. Let’s turn to where those memories might come from: integrating feedback from the environment. One of the core components of any agent would be detecting signal from noise and linking long delayed real-world results to specific inference results. When is a change in the environment meaningful? What is cause and effect? Which sets of decisions, and when, did influence this? As a trivial example. When another system doesn’t react is that because the system is gone or a temporary issue? And if you receive new information, how does a system handle reliability of a sources it might never have seen or heard of before? And while software has a very standardized set of exceptions and status representations, we have no means of representing those in models. In fact we lack a causal world model. This is very early scientific research (late 2024) in this area to augment standard LLM with Causal Representation Learning. Which is showing promise in working on causality across longer planning horizons. The systems today can take feedback, but again it all has to ultimately be fed in through the prompt. Until we find mechanism to navigate around this, and can classify our different inputs accordingly we’re going to keep building a system that feeds the same single-channel bottleneck.

And finally while I’ve written a lot here about Agents being at heart excellent at pulling data together from different sources. They are also their own bottleneck. Different systems inevitably use different conceptual and data representations for the same real world thing. And deduplicating complex entities is challenging, particularly if they represent different “views” on the same object and might therefore have different properties they capture. This is in part an object mapping and standardization question, but also a much deeper one. Not to mention the fact that difference in standards usually exist for a reason and real use case. It’s deeper because, for instance, What does “customer” really mean? It’s easy to see that for one system that might be a collection of “users”, whereas in another it might be a primary entity with lots of metadata associated. Our entire world of data is built by representations that mattered to the solution in question, not to a universal truth. Agents can overcome some of this as after all they’re good at pulling data, but they can’t overcome some of the implicit data design choices and the world models they represent. And those different ontologies create downstream issues with propagating cause and effect. This is a deeply and fundamentally different way of reasoning about our data infrastructure. Not to mention that we also need to plan for the fact these models change over time as we add and remove elements. Which brings us to context sharing. Since the systems are at heart stateless, we need a means of sharing context between them. It doesn’t naturally transfer, and there is no agreed exchange format. There is no “Agent API” so to speak, that would allow one agent to pass appropriate context to the next without anything being lost in translation (since sending and receiving human language really is a very imprecise exchange format). If you combine the lack of shared context with the ontology problems it’s clear we need some fundamental shifts in this space to truly unlock Agent capabilities. Some of these challenges will be solved outside of “AI”, and border in large parts on our ability to represent knowledge. And they will inevitably require us to rethink how we build and design our agents as well.

I’ve alluded to this earlier but we do have the problem of non-determinism. An agent will produce output when told to do so. Whether that’s hallucination, correct or creative genius is really your interpretation. The words produced are ultimately merely the result of a statistical system. And there are use cases where “hallucination” is a very desirable feature, for instance in deepmind’s work in protein folding or drafting marketing copy. It’s a less desirable feature when a transcript accidentally picked up “police” instead of “policy”. The engineering meeting summary became “discussing a recent event involving police action, possibly a stop or arrest, that the team refers to as a ‘dot’ in the discussion. It is implied that the police action was deemed successful in some way. …. The conversation lacks formal details but indicates a positive outcome following a police-related incident”. The actual discussion was a 2 minute conversation about a new policy that was successfully implemented for one of our notetaker bots. Imagine the action points an Agent would take from that… (that’s a real example btw)

So, SaaS is dead? Link to heading

I’m aware that a large number of resources are aimed at this and it’s a field evolving at a rapid pace. Some of the areas we need are actively worked on. And we’re coming at them both with foundational scientific research and practical engineering “good enough” solutions. At the same time a large number of big voices in the market have their own, usually commercial, reasons for making statements which the media happily strips into soundbites without any nuance. From Sam Altman who needs a story that fits the kind of money he hopes to raise, to Nadella who released Magnetic-one and has a substantial stake in openAI, to every VC who’s trying to get an LP to allocate money to their new AI fund, or even every founder who hopes to tap into some VC capital. There’s truth in all of those statements, there’s also a lot of valuable nuance outside the hype train.

I strongly suspect that a number of the problems we face can be addressed with engineering solutions, before we need some of the foundational breakthroughs. And with that, more importantly we will eventually develop a tool stack with the scaffolding to reliable build and operate these things.

The big problem we will hang on for truly broad autonomous agents ultimately boils down to alignment. Can we trust these systems to consistently do the right thing? Can we hold them accountable for actions? Can we create transparency in their training and operation? Those are fundamental questions we will eventually need to address. Because as agent’s autonomy increases, the scope for misalignment grows with it. I believe that is a key reason we won’t go to fully autonomous broad agents for a long time, if ever.

We will start to see more limited use cases. Places where a wrong decision perhaps doesn’t matter that much, as long as we can spot it and recover. Or where the chain of activity is small enough so a single misstep doesn’t spiral out of control. And as we’ve done before with RPC, SOAP, XML and REST we’ll develop new ways of software communicating with other software. Personally I’m not convinced human language is a good API at all.

In short, situations where the human in the loop is augmented, rather than replaced. With a strong focus remaining on automation rather than true autonomy. All of which will lead to efficiencies and productivity gain for the individual. And while I said earlier that everyone could build their own mini-saas in house, I believe just as many will not do so. The complexity of running business critical software (agents are after all software) is not to be underestimated, even assuming we can all get our hands on adequate hardware at a reasonable cost. And ramping up the domain knowledge to train and maintain let’s say a legal agent for your specific purpose is probably a step to far for most companies. Instead we’ll see marketplaces appear where people sell pre-built pre-trained agents. Who of course will tie in to your existing SaaS systems and ideally operate in a shared-knowledge system so they can operate with each other. And that will be the generic training we have today and truly “just for you”, which is where we’ll need progress on memory as I outlined earlier.

So I don’t think if you’re in SaaS you need to be an “AI company” (whatever that means) or “add AI” to your product tomorrow. While great for marketing and storytelling, they’re not going to the heart of the challenge. I do think though you need to carefully consider your relation with your customer’s data, your place in your customer’s ecosystem and the added value you bring to that data. The core mantra of solving a (narrow) problem for your users well won’t go away. However solving it will look different, and there will always be money in integrating with others. If anything, over the longer run horizontal M&A to bring different data sources in one house and disintermediate some of the complexity is where I’d turn. If your new users are going to be automaton-styled agents, make their life easy just as you’ve always tried for your users. This kind of M&A with a focus on tight integration could resolve a number of integration problems around authentication/authorization and data model inconsistencies.

Which leaves us with the economic mismatch between seat licenses and Agentic operations. The traditional logic for seat licensing is that a user does some activity from which they derive value. And therefore as long as they use the software they pay a fee. And as a SaaS consuming organization, more users = more value = more fees. With all the usual commercial variations and discounts for min commit, volume, pre-pay, bolt-ons, … . And on the COGS side, each user creates a small marginal cogs increase. So in theory enormous unit margins, but a huge costbase between engineers and sales activity. Resulting in a landgrab / winner takes all market dynamic. Because theoretically at the limit your net margin will converge to gross margin. Agents of course have the exact opposite dynamic if you looked at them as a “user”. One agent is likely to cause a comparatively big increase in cogs but only a single license. Ask yourself this though: is one agent really providing 1 user’s worth of value to your customer? Maybe it’s time to rethink how we price SaaS products and hone in on the actual value delivered; rather than taking a “user” as a proxy for value. Fundamentally cost structures will no longer scale with seats, but rather with consumption and outcomes. We already have consumption pricing models (telco minutes, dropbox storage, cloud styled metered compute, … ) which are detached from seats and aligned with underlying cogs. What we don’t understand yet is the consumption-value link when an Agent is in the mix, and whether the underlying cogs for an agent (both running the agent and accessing the data) are worth it.

Conclusion Link to heading

As often in the the real world, it won’t be a binary “deterministic or not”, or “automator vs true agency”, “mini-saas in house vs 3rd party saas”.

We will operate on a spectrum between the extremes. We will have things we call Agents, and we’ll eventually find a good use for them once the novelty wears off.

Disruption has been a constant in technology and every so often something truly novel comes along that makes us rethink our existing positions.

I do believe AI (or more precisely, transformer architecture that is powering language models today) is one of those. It’s probably up there with the internet in the late 90s and the cloud in the early 2010s. None of them wiped out the existing systems and players, but they did redefine our relationships with them.

Ultimately, after the hype faded and with it the economic shake-out of winners and losers in these new ecosystems we settled on a world that was bigger and better.

In short SaaS as a business is far from dead, even though it will need to evolve and find its place.