Building a Sovereign AI Financial Assistant

Earlier this year, Yellow Radio began building a prototype financial assistant for a client, designed to help users, particularly international students, manage their money with the help of AI. The premise was straightforward. The execution was anything but.

The project had two intertwined goals. First, to build a genuinely useful financial management application with an intelligent AI advisor that could reason about a user’s transactions, accounts, and financial goals. Second, to prove that this could be done with true data sovereignty: open-weight LLMs running locally, with no user financial data leaving the infrastructure.

Six months in, the prototype is functional, the architecture is sound, and the lessons learned are worth sharing.

The technology stack

Before getting into the lessons, it is worth setting out the architectural choices made at the outset, and why. These decisions shaped everything that followed, and for anyone considering a similar project, the choices matter more than the code.

Backend: FastAPI (Python). Async by default, lightweight, and excellent for building API-first applications. Python was the natural choice given the AI/ML ecosystem, and FastAPI’s type hints and automatic documentation made it well-suited to rapid, AI-assisted development.

Frontend: React with Material-UI. A mobile-first responsive design, with Capacitor for native Android and iOS packaging. The priority was a working cross-platform interface, not a bespoke design system.

Database: PostgreSQL. ACID compliance is non-negotiable for financial data. SQLAlchemy 2.0 as the ORM, with Pydantic for data validation and settings management.

Local LLMs: Ollama. The core architectural decision. All financial reasoning runs on locally hosted open-weight models. The application is model-agnostic by design, supporting Llama 3.1 8B, Mistral 7B, DeepSeek-R1 32B, and Qwen3 30B, with models swappable without code changes. GPU acceleration via NVIDIA for inference performance.

Event-driven infrastructure: Apache Pulsar, Redis, and Celery. Pulsar handles reliable message delivery with topic-based routing by domain (financial, user, system, LLM, research). Redis and Celery manage the task queue with specialised workers for LLM queries, data processing, analytics, and general tasks. This architecture made the system observable and debuggable in ways that synchronous designs are not.

Market research: Anthropic Claude (optional). Used only for aggregate market data (product comparisons, interest rates, provider analysis). Explicitly opt-in, with no personal financial data sent externally. A scheduled daily research job caches results so that user queries are answered from local data, not live API calls.

Deployment: Docker Compose with NGINX. Containerised everything, including Ollama with GPU passthrough. NGINX handles SSL termination, rate limiting, and static file serving.

Authentication and security: JWT with bcrypt. Role-based access control, rate limiting (10 requests/second API, 5/minute login), security headers, and parameterised queries throughout.

The principle behind these choices was to start with production-quality architectural foundations, not to build a quick demo and harden it later. This proved to be one of the most important decisions of the entire project.

Why sovereign AI matters for financial data

The case for data sovereignty in financial services is not abstract. As we explored in The Battle for Sovereign AI, the UK’s dependency on US-headquartered cloud and AI providers creates structural risks that contractual protections cannot fully address. The US CLOUD Act alone makes this a real concern for any application handling sensitive personal financial data.

For this project, that meant a hard constraint from the outset: the core financial reasoning had to run on locally hosted open-weight models, not proprietary APIs. The user’s transactions, balances, and goals never leave the local infrastructure.

The models used evolved over the project’s lifetime. Each brought different strengths and weaknesses. The critical finding is that for structured financial analysis, where the task is to reason over a known dataset rather than generate creative content, open-weight models are genuinely capable. Not equivalent to frontier models, but good enough for the job, at a fraction of the cost and energy consumption, and with full transparency over what the model is doing and why.

The reality of AI-powered development

The application was built primarily using Cursor with a range of foundation models (including GPT-4o and Claude Sonnet), supplemented by Claude Code for architecture design and increasingly for code generation. The productivity gains were extraordinary. Cursor and the underlying models were able to build in minutes what would have taken a team weeks, and demonstrated expertise across a wide range of frameworks, platforms, and languages.

The numbers tell the story. Sprint velocity accelerated from 1.4 story points per day in the initial infrastructure phase to nearly 10 story points per day at peak, during the most intensive feature development in August 2025. The entire foundation of the application, the FastAPI backend, React frontend, PostgreSQL database, Docker containerisation, Ollama integration, authentication, and a working chatbot, was built in a matter of days.

But the experience was far from seamless, and experimentation with different models in both Cursor and Claude Code produced highly variable results. Some models excelled at architectural reasoning but struggled with implementation detail. Others were fast and fluent but introduced subtle bugs. The choice of model mattered enormously depending on the task at hand, and there was no single model that was consistently best across all types of work.

Several patterns emerged that anyone building with these tools should be aware of.

Context drift. Within longer sessions, models would occasionally “forget” earlier decisions or constraints. Fixes applied to one part of the codebase would sometimes undo improvements made elsewhere. This was particularly problematic during debugging, where a model might re-introduce a bug it had already fixed in a previous iteration.

Goal-seeking over architectural integrity. The models consistently prioritised getting the immediate task done over preserving system design. Left unchecked, this meant bug-specific hacks instead of generic solutions, and rapid technical debt accumulation. Rigorous code review was essential, not optional.

Deterministic outcomes. For a financial application, deterministic behaviour is non-negotiable. Achieving consistent, repeatable outputs from LLM-powered features required significant engineering effort: strict temperature settings, explicit data-only instructions, and circuit breakers to catch runaway behaviour.

The mitigation that proved most valuable was starting with a clear, well-documented architecture before writing any code. The models performed dramatically better when given precise constraints about frameworks, patterns, and boundaries than when given open-ended instructions.

Building an agentic pipeline in late 2025

The most ambitious and challenging aspect of the project was the agentic pipeline powering the AI financial advisor.

Rather than a simple question-and-answer interface, the advisor uses an iterative reasoning loop: plan, execute, evaluate, refine. When a user asks a question, the system first triages the query to understand intent and complexity. It then plans which tools to use (transaction lookups, market research, financial health calculations), executes them, evaluates whether it has enough context to answer well, and iterates if needed.

This is genuinely powerful when it works. A user can ask “how can I save money?” and the system will agentically extract their recent transactions, identify categories where there may be overspend, compare their spending against the financial goals set in their profile, and cross-reference their existing providers against the daily market research cache to identify opportunities, such as when a better credit card rate is available. The response is contextual, data-driven, and specific to that user’s actual financial position.

Getting it to work reliably was another matter entirely.

The first major challenge was hallucination. Small open-weight models, when faced with financial data, have a strong tendency to invent transactions, fabricate totals, or make “educated guesses” that sound plausible but are factually wrong. Fighting this was a persistent effort from the very first day of development, requiring increasingly strict prompting, data-only response constraints, and validation layers.

The second challenge was infinite loops. The iterative reasoning approach, by design, allows the model to decide it needs more information and loop back for another iteration. In practice, models would sometimes enter loops where they repeatedly requested the same data, or got stuck in circular reasoning patterns. Implementing circuit breakers, maximum iteration limits, and repetitive-pattern detection was essential, though the pattern detection itself had to be carefully tuned to avoid interrupting legitimate multi-step reasoning.

The third, and most memorable, challenge was when the model went comprehensively off the rails. During testing with sample transaction data, the local LLM decided that the best financial advice for the user was to switch their bank account to Monzo. Having made this recommendation, it then hallucinated that it was Monzo, adopted the persona of a Monzo customer service agent, and entered an infinite loop attempting to draft a letter inviting the user to open an account. This was simultaneously the funniest and most instructive failure of the project: a vivid demonstration of why guardrails, validation, and the ability to kill a runaway process are not optional features.

What worked well

Architecture-first development. Defining the tech stack, event-driven architecture, database schema, and API contracts before letting AI tools generate code produced dramatically better results than iterating without constraints. The models are excellent implementers when given clear specifications.

Privacy by default. Making local LLM usage the default, with cloud services as an explicit opt-in, simplified the architecture and avoided the temptation to reach for proprietary APIs whenever the local model struggled. It also meant the data sovereignty story was clean from the start, not retrofitted.

Open-weight model flexibility. Being able to swap between models without changing application code meant the project could take advantage of rapid improvements in the open-weight ecosystem. A model that was state-of-the-art in July was outperformed by a new release in October. This flexibility is a genuine advantage over being locked into a single provider’s API.

Conversational memory. The system maintains session continuity through automatic conversation summarisation, entity tracking across messages, and preference persistence. This means users do not have to repeat context across interactions, and the advisor builds a progressively richer understanding of each user’s financial situation.

What was less successful

Scope creep. The project’s velocity dropped sharply after the initial intensive development period. A sprint originally focused on guardrails and service quality expanded to include AI video avatars and other features that were interesting but not essential. Maintaining discipline on scope is harder when the tools make it so easy to start new features.

NLP dependency churn. An early decision to integrate spaCy for natural language processing proved more trouble than it was worth in containerised environments. It was eventually replaced with simpler keyword extraction using fuzzy matching (rapidfuzz), which was more reliable and far lighter. The lesson: reach for the simplest tool that solves the problem, especially when running in constrained environments.

The four-month gap. After an extraordinarily productive July and August, the project went quiet from September through November. Partly this reflected competing priorities, but also a degree of fatigue from the intensity of AI-assisted development. The constant cycle of generating, reviewing, debugging, and regenerating code is cognitively demanding in ways that are different from, but no less real than, traditional development.

Looking forward

The prototype has a clear backlog of work ahead: automated testing, semantic search via vector databases, fine-tuning local models on domain-specific financial data, and a proper deployment pipeline. The fine-tuning work is particularly promising. The agentic pipeline already exports successful reasoning paths to JSONL format, creating a growing dataset of high-quality financial reasoning examples that can be used to train smaller, faster, more accurate domain-specific models.

The broader lesson from this project is that building production-quality AI applications with open-weight models and AI-powered development tools is genuinely viable, but requires more discipline, not less, than traditional development. The tools are powerful, fast, and occasionally brilliant. They are also inconsistent, context-limited, and prone to subtle errors that compound quickly if not caught. The human in the loop is not a nice-to-have. They are the difference between a working system and an impressive demo that falls apart under real use.

For anyone considering a similar path: start with architecture, be specific about constraints, review everything, and keep a close eye on the Monzo letters.