Introducing Iris: Our AI Tax Development Agent

Introduction – we’ve built Iris: an AI tax development agent

We’ve built Iris: an AI tax development agent that is able to understand tax law, turn it into software code, and test itself to achieve correctness. It’s speeding up our development considerably, and has helped us become one of the first companies in the past two decades to build a personal income tax engine.

Tax development requires 100% accuracy, so we’ve built a one-of-a-kind agent that builds our tax engine calculations so that we can verify that those calculations are correct.

Iris is backed by a custom orchestration system that:

Provides relevant context from tax law, IRS coding requirements, and our existing graph computation system.
Runs a “core loop” that writes formulas & evaluates them for correctness against synthetic & anonymized real-world datasets.
Connects to our existing systems to write, run, and securely test code.

‍

‍

Building Iris has only been possible thanks to our world-class tax experts, unique data, infrastructure, and AI Engineering team. For the first time, we're exposing the details of how we built & deployed this agent.

‍

The problem: building a tax engine is hard

Tax filing is an exercise in once-a-year¹ personal financial data collection: you collect² all of your financial data & facts for the year (e.g. W-2s, 1099s, your dependents' DoBs, your spouse's data if you're married), so that you or your accountant can enter it into tax software. Tax software developers like us call these your “inputs”.

Underlying that tax software you (or your accountant) use³ to file taxes is a piece of software called a “tax engine”.

A tax engine takes those “inputs” and runs them through calculations and transformations defined by the IRS & State agencies. A tax engine then outputs the completed tax return in the XML format specified by the IRS to be e-filed and a PDF for users to view their return.

The tricky part is that those calculations & transformations are only defined in English. The core challenge of building a tax engine is turning those English calculations into programmatic code that can compute a user’s tax return, 100% accurately⁴, for every possible permutation of user inputs.

Here’s one small example of those calculations, defined in English on the Form 1040, for computing a return’s total W-2 income. If the user has two W-2s, one with $30k in box 1 and the other with $20k in box 1, Form 1040 Line 1a will be the sum, $50k:

‍

‍

Contrary to popular belief, the IRS doesn't "know the answer" ahead of time – this is something that each tax software provider⁵ must implement.

‍

A tax engine’s scale

For a sense of scale, the US income tax code, across Federal & ~40 states, is more than 75k pages and over a million lines of text total. It's not the arithmetic–addition, subtraction, multiplication, division–that's particularly difficult, but it's the number of rules and the number of interconnections between those rules–in addition to the fact that there is no given “answer key”⁶ from the IRS or States–that makes this problem hard. A single input like Filing Status is a dependency with downstream impacts on thousands of additional calculations on the Form 1040 alone.

What we built is a graph computation system where every node is a single tax formula. Our computation graph has now grown to over 100k nodes, each representing a single tax transformation: e.g. Line 8 is the sum of Lines 6 + Lines 7. Even for a single user with a basic W-2 return, the cascading dependencies in the tax code mean that–without caching–we would perform over a billion calculations to transform their inputs through the computation graph into the final XML return output.

In short: Tax Forms → Tax Engine → XML → IRS. Here’s a very simplified diagram to illustrate:

‍

‍

But building this graph computation system solely by-hand takes significant time. Because of the complexity of the problem, almost no companies have built a US personal income tax calculation engine in the past two decades. Very few companies in the US have one: about a dozen have ever been built. And it’s what gave existing tax companies their moat. Until now.

Thanks to our world-class Tax team, unique infrastructure, and product’s commercial scale, we’ve been able to build Iris, which has significantly sped-up our development. Column Tax is proud to have built one of the first feature-complete tax engines from scratch in the past two decades. And our product has now been battle-tested across tax seasons with over 1 million returns filed & over $1 billion in refunds processed.

‍

Introducing Iris: Column Tax’s AI Tax Development Agent

Why models can’t do taxes on their own

Today’s LLMs cannot “do taxes” on their own. That’s because tax calculations require 100% correctness. Today’s models hallucinate. While they’re getting better–thanks to reasoning & techniques like Chain-of-thought–they are not great at arithmetic or counting on their own, both critical for tax calculation.

Additionally, Reinforcement Learning (RL), the technique used to improve LLM’s math skills, wouldn’t work for a tax engine. That’s because each year, Congress passes bills that update our tax laws. IRS & States update (or “rollover”) the tax code to account for these changes. RL requires the answer in order to train against. From Oct to Dec of each year, there is no software that can give you the correct result for a tax calculation for that tax year (the one people will start filing for in Jan).

‍

What we’ve built: a tax development agent, backed by a custom orchestration system

Given the 100% accuracy requirements of our domain, we’ve built Iris: an AI tax development agent that writes software code to build our tax engine calculations so that we can verify that those calculations are correct. It allows us to build our tax engine much faster than traditional development methods.

Instead of trying to use AI to directly calculate taxes (which wouldn’t work for the reasons above), we’ve built an agent and requisite scaffolding that outputs code we can verify to be fully-correct.

Iris is backed by a custom orchestration system that:

Provides relevant context from IRS & State Forms & Instructions⁷, IRS & State coding requirements, and our existing graph computation system.
Runs a “core loop” that writes formulas & evaluates them for correctness against synthetic & anonymized real-world datasets.
Connects to our existing systems to write, run, and securely test code.

‍

‍

Now, instead of writing our tax engine by-hand, we use Iris, which significantly speeds up our ability to build our tax engine to cover all States, Forms, and tax law situations. Here’s how it works:

‍

Orchestrator

Our agent is made up of a set of composable parts, each tasked with doing a specific part of the tax development workflow. The orchestrator is the overall conductor of those components.

The orchestrator is responsible for creating the development plan, including deciding on the order of lines & forms to build, and directing the other subsystems.

As we discussed above, the programmatic representation of tax law is a graph. For example, 2024 Form 1040 Line 8 asks for “Additional income from Schedule 1, line 10”, and in order to calculate Schedule 1 Line 10, you must also calculate (if relevant) Schedule C, Form 4797, and about a dozen other Forms/Schedules, which may also reference other forms.

‍

‍

You can treat the resulting structure as a directed (and mostly ;) acyclic) graph.

When coding up a Form/Schedule (or State), our Orchestrator uses a combination of traditional graph processing algorithms (topological sort) with LLM processing to decide on the order of forms (and lines within those forms) to implement such that the right prerequisites are always available. The Orchestrator requires deep context about the forms and instructions as well as expected outputs, which we provide via our context retrieval system.

Once it’s created the development plan, the Orchestrator then instructs the other subsystems to begin retrieving context, developing, and evaluating in order to produce valid tax engine code.

‍

Context retrieval system

The agent attempts to output tax engine code that is mergeable into our overall graph computation engine. In order to write this code, the agent formula development subsystem needs context about:

The IRS & State Forms & Instructions
The specific output format expected
The existing codebase it’s going to be integrating with (which has over 100k nodes already in its graph computation system)

‍

‍

The context retrieval subsystem uses a combination of embeddings, vector database, semantic search, and LLM techniques to gather the right context for each line that the core development loop will implement. It's important that only the relevant context is given to the development loop so that the development loop can correctly integrate with the existing system. Without the right context, the quality of formula outputs is much worse, and the size of the overall tax code and our codebase is much too large for current context windows. Given the number of interconnections in the tax code, each line is very likely to have to reference other lines to correctly calculate.

‍

Core Development & Evaluation Loop

The core loop of the system is a combination of two steps:

Formula Writer: the subsystem that actually implements the formula code given the relevant context
Evaluator: the evaluation system that tests that formula against the correct output

We call this development and evaluation loop a “loop” because the system can go through multiple iterations, similar to gradient descent, to “self-heal” and ensure that the formula it is outputting is correct in all cases of possible input permutations.

‍

‍

The Formula Writer takes the relevant context from the initial Context retrieval system and from the Evaluator to write code that implements the specifications of the IRS & States and produces output that can run inside the existing graph computation system.

For example, for Form 1040, Line 1a:

‍

the Formula Writer might output:

line_1a = sum(w2.box_1 for w2 in w2s) - sum(sch_c.temporary_statutory_employee for sch_c 
in schedules_c) - schedule_1.nonqualified_deferred_compensation

‍

The “(see instructions)” parenthetical above is crucial in making this formula more complex than simply summing all box 1s across W-2s.

The Evaluator has access to the code that the Formula Writer has just written and then tests that code against synthetic, human expert-curated tests as well as real-world data that we “replay” in a secure sandbox environment. The Evaluator provides the Formula Writer feedback on if the formula it's created is correct.

The test set utilized as part of this evaluation step is critical to ensure that the code the formula writer is writing is 100% correct in all cases. It's taken years of work and very rare human expertise to curate this synthetic test set to be comprehensive. Additionally, we are able to replay code against real-world data that we only have because of the scale of our commercial product which allows us to test against even the most rare edge cases.

If the Formula Writer’s outputted code is incorrect for any test cases, the Evaluator sends that feedback to the Formula Writer, which has another chance to fix the code and continue. If the code is correct, the orchestrator moves on to implement the next line or Form.

‍

How we use Iris: human feedback & operations

Similar to Claude Code, Cursor’s Agent Mode, Devin, Charlie, or Codex, Iris is a system that is operated by a human expert. In our case, instead of a software engineer operating the system, we have tax software subject matter experts who use our agent to accelerate their work. The human operator oversees the agent and its orchestration system, and can steer the system and provide feedback when necessary.

Additionally, the agent’s output is inspectable at every step of the process, so the human operator can verify the derivations that led to Iris’s calculations.

‍

‍

Similar to other agentic coding systems, our human operators are responsible for the output of the system and ensuring it is merged into the codebase only after being thoroughly tested and QA’d.

‍

Conclusion

Iris has been a superpower for our tax development workflow. It's allowed us to become one of the first companies in two decades to build a feature-complete tax engine. The engine has battle-tested across tax seasons with over 1 million returns filed & over $1 billion in refunds processed. Iris is giving our human subject matter experts incredible leverage. And it allows us as a company to focus on improving the user and developer experiences of our product because we're able to keep our core tax engine up to date year-over-year with much less effort.

This is just the beginning. We have lots of ideas for improving Iris. If this type of work sounds interesting to you, we are hiring Software Engineers and Tax Analysts to work on the next generation of our agentic tax development workflows.

P.S. Can you guess why we chose the name “Iris”? Share your theory in the job applications above :).

--------

¹ For now. We’re actively working on making taxes a year-round, proactive, and automatic thing.

² “Collect” is a bit of a misnomer since with software like Column Tax, much of your return data is likely pre-filled.

³ Unless you’re part of the less than 1% of American taxpayers who still files manually by-hand/pen & paper.

⁴ Column Tax is IRS Authorized and has a 100% Accuracy and Maximum Refund Guarantees.

⁵ It's also worth noting that there is no magic in tax software where one provider can provide a greater refund than another.

⁶ For example, a reference implementation or API to hit.

⁷ For example, Line 1a of Form 1040 states “Total amount from Form(s) W-2, box 1 (see instructions)” and the Instructions add info about how to handle joint returns, a carveout for earned wages if you were an inmate in a penal institution, and instructions for pensions & annuities from nonqualified deferred compensation or nongovernmental section 457(b) plans. There are also other caveats for statutory employees listed elsewhere.

Heading