GG
Available for remote work
all work
05 / XAUUSD Trading System

XAUUSD Trading System

An LLM-driven trading system for gold (XAUUSD) on MetaTrader 5. Three analytical tiers, three memory layers, no LLM in the hot path.

2026·Backend · LLM systems · Trading·Architect & sole engineer·in development

I trade XAUUSD — the gold-against-USD pair — on MetaTrader 5 during the London session. After about six months of doing it manually, two patterns were obvious. The first: my best decisions were the ones I'd made the same way the day before, against the same market structure, with the same reasoning. The second: my worst decisions were the ones where I'd convinced myself a setup was different from what it actually was, usually under time pressure.

The system this case study is about is an attempt to externalize the first pattern and remove the conditions that produce the second. It's a long-running Python service that watches the gold market continuously, analyzes setups against a persistent memory of past setups, and places pending orders rather than reacting to live price. The LLM is the analyst. The execution layer is deterministic. There is no LLM call in the hot path between price arriving and an order being placed.

This is not a “Claude trades for me” system. The LLM doesn't see live price ticks, doesn't have authority to size positions, doesn't have authority to override risk rules. What it does have is access to the chart context, a memory of every previous setup it has analyzed, and the job of deciding whether the current setup looks like the kind of setup that has historically worked.

The system is built in three analytical tiers and three memory layers. The tiers determine how much expensive analysis a given moment of market data warrants. The memory layers determine what context is available at each tier.

The three tiers. A local Python pre-filter watches the 5-minute chart and applies cheap, deterministic checks: market structure direction, EMA alignment, ATR, RSI bounds, session volatility profile. About 95% of price snapshots fail the pre-filter and never reach the LLM. The ones that pass go to a Sonnet-class analysis call which produces a confidence score and a written rationale. Setups above a confidence threshold are escalated to an Opus-class confirmation call which adds news and macro context — Fed commentary, USD strength, geopolitical risk — before deciding whether to place a pending order. The cost shape this produces is roughly $1–3/day in API spend across an active trading day, not the $50+/day a naive “ask the LLM about every candle” approach would burn through.

The three memory layers. A frozen system prompt — versioned in the repo as prompts/analysis_v1.0.md — defines the trading framework, style, and risk rules; it does not change without an explicit version bump. An active-lessons file maintains a short list of recent corrections distilled from postmortems on losing trades. A ChromaDB vector store holds every past observation and every past trade with its outcome, retrievable by similarity at analysis time. The combination means the LLM at tier two doesn't analyze from scratch — it analyzes against retrieved context from setups that looked like this one before.

                ┌──────────────────────────────────────┐
                │  MT5 → 5-min OHLC + indicators       │
                └──────────────────────────────────────┘
                                  │
                                  ▼
                ┌──────────────────────────────────────┐
                │  TIER 1  Local pre-filter            │
                │  structure · EMA · ATR · RSI · session│
                │  ~95% rejected here, no LLM cost     │
                └──────────────────────────────────────┘
                                  │ pass
                                  ▼
       ┌──────────────────────────────────────────────────┐
       │  TIER 2  Sonnet analysis                         │
       │  inputs: snapshot + retrieved similar past setups │
       │  output: { confidence, rationale, action }        │
       └──────────────────────────────────────────────────┘
                                  │ confidence ≥ threshold
                                  ▼
       ┌──────────────────────────────────────────────────┐
       │  TIER 3  Opus confirmation + news                 │
       │  output: CONFIRM | REJECT | MODIFY                │
       └──────────────────────────────────────────────────┘
                                  │ confirm
                                  ▼
       ┌──────────────────────────────────────────────────┐
       │  EXECUTION  pure Python, no LLM                   │
       │  pending order · hard-coded risk rules · expiry   │
       └──────────────────────────────────────────────────┘
                                  │
                                  ▼
       ┌──────────────────────────────────────────────────┐
       │  Vector DB write: observation + trade + outcome   │
       └──────────────────────────────────────────────────┘

Risk rules are not LLM-controlled. Position size is computed deterministically from the stop-loss distance and a fixed per-trade risk percentage. Minimum risk-reward ratio is enforced in code. Daily loss caps and maximum-drawdown shutoffs run as guards around every order. The LLM cannot override these — it can only propose entries that the deterministic layer accepts or rejects. This was a deliberate design choice. An analyst that occasionally hallucinates is acceptable. An analyst that occasionally hallucinates a position size is not.

Operability is via Telegram. A FastAPI server runs alongside the trading loop, exposing webhook endpoints for high-impact news events (so the system can pause around scheduled releases) and a Telegram bot interface with topic-separated channels: alerts, trades, results, weekly report. Commands like /status, /pause, /resume, /stats work without restarting the process. The operator's interface to the system is a chat, not a dashboard.

Deployment is a Windows VPS in the same datacenter as the broker. MT5's Python package is Windows-only. The broker's MT5 server is in Amsterdam, so the system runs on a Contabo Windows VPS in Amsterdam, putting the network latency to the broker in single-digit milliseconds. Development happens on macOS, deployment is a Windows binary — the codebase is structured so the development environment uses simulated MT5 data and the production environment uses the live package, gated by an environment variable.

The honest version of this writeup includes the things that aren't true yet.

The system is currently running on a paper-traded $200 account against live market data, not against real capital. The vector database is small — a few months of observations and a few dozen trades — which means the similarity retrieval is closer to a sanity check than a meaningful prior. The active-lessons file is short. The confidence-vs-actual-win-rate calibration that would let me trust the system at scale doesn't have enough samples yet to be statistically meaningful.

There's a deeper tradeoff in the design itself. A pure-LLM system gives up too much determinism for a financial product; a pure-rules system gives up the pattern recognition that's the actual reason to use an LLM. The three-tier architecture is an attempt to put the LLM where its judgment matters and keep it out of where its inconsistency would hurt. Whether that's the right line is going to be answered by months of paper trading, not by the design document.

A smaller decision worth noting: I chose pending orders over market orders explicitly to remove latency from the system's edge. The LLM can take 10–30 seconds to analyze a setup. By the time it decides, the live market has moved. A pending order placed at a logical retest level either fills at the price the system intended or it doesn't fill at all — which is a much cleaner failure mode than filling 5 pips worse than the analysis assumed.

The system runs. Pre-filter, tiered analysis, vector retrieval, pending-order execution, postmortem write-back to the vector DB — the full loop is implemented. The Telegram interface works. The risk guards work. The Amsterdam VPS deployment works.

What it doesn't have yet is the track record that would justify trusting it with real capital, and that's the next chapter rather than this one.

The next milestone is a meaningful sample of paper-traded setups — enough to compare confidence scores against actual outcomes and decide whether the calibration is real or whether the LLM is just confident. Beyond that, two architectural items are queued: a proper backtest harness that replays historical price data through the same three-tier pipeline (currently the harness exists but the simulated Sonnet/Opus calls are stubbed), and an event-based “Powell straddle” mode that places opposing pending orders ahead of high-impact news releases and cancels the loser once direction is established. Both wait for the calibration data.