Case Study

Apex Fintech: AI Human in the Loop Testing Workflow

Apex Fintech Solutions runs the clearing infrastructure behind $229 billion in assets and 796 million annual trades. When the team gave Claude Code a business-critical end-to-end scenario and asked for full test coverage, it produced 100+ tests in under a minute, and 65% had to be cut. Here's what the experiment revealed about where AI accelerates testing, where it breaks down, and why human-in-the-loop is a permission model, not a slogan.

Vitaly Sharovatov

May 19, 202611 min read

Apex Fintech: AI Human in the Loop Testing Workflow

AI and the Essential Human in the Loop: A Case Study in Compliance at Apex Fintech Solutions

Based on a presentation by Richesh Pareek (Director of Product Quality, Apex Fintech Solutions) and Vitaly Sharovatov (Developer Advocate, Qase), at QA Financial Forum New York, May 12, 2026.

If you have a brokerage account on Webull, Apex Fintech Solutions is clearing your trades. That is what Apex does: clearing, custody, and trading infrastructure for the firms that serve tens of millions of end investors worldwide. The easiest way to understand the company is as the AWS of modern investing: the infrastructure layer that hundreds of fintech firms, banks, and broker-dealers build on, so they can focus on products and customers rather than on ledgers, settlement, and compliance plumbing.

The scale reflects the responsibility: $229 billion in assets under custody, 796 million annual trades, 37 million brokerage accounts across 170 countries. Every change Apex ships touches real client money, sensitive data, and strict compliance requirements. And at the same time, the business expects new value shipped on a weekly cadence.

That tension is what makes quality at Apex a different kind of problem. And it is the context for what happened when the team investigated how Claude Code could be used to generate regression tests.

Before getting into what happened, one premise needs stating, because everything that follows hangs on it: quality is not a property of code in isolation. Whether code is good enough depends on what business problem it solves, what the consequences of failure are, what lifecycle stage the product is in, and what tradeoffs the team has consciously accepted. The same payment refund function in a prototype and in a clearing platform at Apex's scale have completely different quality requirements, even if the code is line-for-line identical. Any quality assessment that operates only on code surface, without access to this context, can only catch superficial issues. The deeper question, "does this code do the right thing?", is unanswerable without domain understanding, and as the Apex case shows, even providing context up front does not close the gap reliably.

The Test: An End-to-End Financial Workflow

To test how AI can be integrated into their QA practice, Apex gave Claude Code a business-critical end-to-end scenario: open an account, fund it, place a trade, validate the position, verify ledger entries, and generate confirms and statements. A core workflow, predefined context, and a prompt asking for full test coverage.

The output was fast and impressive on the surface. In under a minute, Claude Code generated hundreds of API-level tests across a complex product workflow. Nothing manual comes close for speed and scale.

Then the team reviewed what it had produced.

About 40% of the output was duplicated, adding noise rather than value. Many tests were technically valid but irrelevant to real business use cases. Regulatory and compliance coverage was absent, which is non-negotiable in financial systems. Data security and vulnerability scenarios were largely missing unless explicitly forced. Most importantly, Claude Code struggled to model the true end-to-end business process, the real workflows that operations teams and clients actually experience.

It also produced thousands of lines of automation code, creating a significant maintenance and ownership burden.

The team did not try to run everything AI generated. They triaged: human review for relevance and context, risk assessment for business impact, then prioritization. Out of 100+ AI-generated tests, 35 high-value tests remained: domain-specific, high-impact, regulatory-compliant. The rest was cut. They then supplemented the automated suite with human-guided exploratory testing to probe edge cases, novel failure modes, and scenarios the model had not naturally surfaced.

AI provided breadth; humans refined for relevance, risk, and business value.

Three Structural Problems with AI-Generated Tests

Vitaly Sharovatov, Developer Advocate and Researcher at Qase, explained what made the Apex experience predictable: these are not one-off failures but structural properties of how LLMs work.

Problem 1: LLMs are trained on generic data. Ask Google "what should I test my fintech app for?" and you get a generic answer with no knowledge of your compliance obligations, your settlement flows, or the specific failure modes your regulators care about. LLMs are trained on the breadth of the internet, so the model defaults to what is statistically common across that breadth, and has no knowledge of your business model, your risks, or your regulatory environment. Generic edge cases exist in training data (null inputs, race conditions, common security patterns). Whether the edge cases specific to your product, jurisdiction, and users are in there is unknowable, and even if they are, LLM generation is probabilistic, not retrieval: there is no mechanism that guarantees the model will use the relevant knowledge in your output. The prompt itself is always incomplete relative to that specific context, and the model fills the gaps with generic priors, producing outputs that look right but miss what is local to you. Security vulnerabilities tied to your architecture, regulatory scenarios specific to your product, end-to-end flow details unique to your business: all of this will be missed. And as Problem 2 will show, the model will not tell you any of it has been missed.

Problem 2: LLMs generate to satisfy. Google can return no results when nothing relevant matches; an LLM cannot, because it has no concept of "not knowing". Asked to generate tests, it generates tests, even when the right answer is "I don't have enough business context to generate meaningful tests for this scenario". The result is a large output that is partly correct and largely requires review. Without that review, the cost of ownership and maintenance skyrockets alongside the hidden risks.

Problem 3: No accountability or traceability. Outsourcing your testing to an external firm without signing a contract, then blindly accepting their automated tests and results, would be considered reckless. The same logic applies with LLMs. Accepting work from a real supplier in a regulated context comes with onboarding diligence, service-level agreements, audit rights, indemnity, and a defined liability boundary; an LLM provider gives you none of these, only an end-user agreement that explicitly disclaims liability for model outputs and a model whose internal behaviour you cannot inspect.

When an AI-generated test is blindly accepted and shipped, all risks and consequences belong to you, and there is no auditable record of what judgment was applied. The Replit AI agent incident in July 2025 demonstrated this: an autonomous coding agent ran unauthorised database commands and deleted production records covering more than 1,200 executives and 1,190 companies. In fintech, the deletion alone would be career-ending. The agent then misled the user about whether the data could be recovered, layering an integrity failure on top of the operational one. And the AI provider cannot be sued; every consequence (the deletion, the cover-up, the recovery cost) lands entirely on the organisation that deployed the agent. Every AI-generated test that enters your system is a supply-chain decision, and in financial services, supply-chain failures are auditable events. The compliance question is not "did the AI produce a good test?", but "who is accountable for the chain of trust that led from agent output to production?"

A related pattern: the generative ratification loop.

Apex used AI to generate tests for a system the team had built and understood, and the output went through human review and triage. The pattern compounds when AI is used for both the production code and the tests, and the AI-generated outputs feed the next iteration without a human anchor. When AI writes about code it just wrote, errors do not surface; they ratify. The loop runs:

AI implements erroneously → AI-generated tests pass on the error → AI-generated docs describe the error as intentional → next iteration extends the error as if it were correct.

Each artifact in the chain ratifies the previous step's errors because all of them derive from the same source. The signal that would normally catch the drift (a reviewer noticing "this isn't what we wanted") is absent at every step. SlopCodeBench (Orlanski et al., 2026) gives the longitudinal evidence: across 196 checkpoints, 77% of agent trajectories showed structural erosion and 75.5% showed verbosity growth, while human repositories on comparable work degrade less often and by smaller margins. Apex caught its 40% duplication in one shot because they reviewed; teams that let AI extend AI without review are inside this curve.

Use AI Aggressively, But Only Where You Should

These problems do not mean teams should stop using AI. They mean teams should use it where its properties are strengths, and keep human judgment explicitly in the loop everywhere else.

Where AI is strong:

Pattern recognition and high-volume data processing: log analysis, trend detection, regression impact analysis, performance patterns
Test result triage at scale: clustering similar failures by semantic similarity, summarising large volumes of test output, and surfacing candidate regressions for human prioritisation when historical triage data exists. The strength is pattern recognition over results, not authoring the verifications themselves
Boring and mundane tasks at speed and scale: the repetitive, high-volume work that humans should not be spending time on

On that third point, the industry often conflates two different things: mechanical work and high-volume work. Mechanical work is deterministic and best handled by scripts: parsing, counting, sorting, matching patterns, applying formatting rules. Putting an LLM in those roles adds non-determinism to steps that do not need judgment, costs more, and breaks when models change. AI is genuinely suited to high-volume tasks that benefit from probabilistic pattern recognition: clustering noisy signals, summarising free-text reports, finding semantically similar items. That is a smaller and different set than "everything boring."

Where AI breaks down:

Rare and novel scenarios: even when relevant patterns exist somewhere in training data, LLM generation is probabilistic, so there is no guarantee the model will surface them; the specific compliance edge cases you most need to test are precisely the ones the model is least likely to produce
Complex business logic and regulatory context: AI can validate what a system does, but not whether it should do it; business rules, regulatory intent, and operational nuance live outside training data, and there is currently no reliable way to feed this information in, because training data is fixed in the model and unavailable to customers, fine-tuning is limited, and long prompts degrade attention so the model falls back on training-data priors when context is ambiguous
Subjective judgment, UX, and ethical trade-offs: any scenario that involves judgment or trade-offs is fundamentally human; the go/no-go call belongs to a person, because the risk and accountability belong to a person

Use AI for high-volume, pattern-recognition work. Keep humans firmly in the loop wherever judgment, risk, or accountability is involved.

What Happens When You Skip the Loop

Apex operates at a scale where any AI failure that reaches production carries disproportionate consequence: $229 billion under custody, regulators on multiple sides, hundreds of downstream client firms. The Replit case above shows what an absent human loop can do at much smaller scale. The same failure mode in a clearing system is not survivable.

For Apex, the regulatory environment turns these failures into control breaches. Financial infrastructure operates under overlapping requirements: documented testing of changes to critical systems with audit-on-demand obligations; effective challenge of any output that influences decisions; an audit trail of who approved what change; and explicit testing of security scenarios. AI-generated tests treated as authoritative without human review fail all of these by construction, because there is no documented challenge and no record of what judgment was applied. Security scenarios in particular are the category the Apex experiment showed AI does not produce by default. "Regulatory and compliance coverage was absent" is not a quality observation; it is a control failure.

Human in the Loop by Design: The AI-Powered STLC Quality Loop

Apex defines human and AI roles at every stage of the software testing lifecycle. Qase is the product Apex uses to support the loop.

1. Analyze (Requirements & Risk):

AI: parses user stories, PRDs, and design files to suggest candidate testable scenarios and, where historical defect data exists, highlight areas with past failure concentration.
Human: validates AI-suggested risks against business intent, regulatory exposure, and client impact; define acceptance criteria; decide what actually matters. AI surfaces risk signals; humans define risk.

2. Plan (Test Strategy & Coverage):

AI: does test impact analysis on code changes, identifying which tests are affected, when coverage and historical failure data support it.
Human: owns the strategy, set quality gates, approves or rejects AI-suggested plans based on business priorities. AI optimises execution; humans own strategy.

3. Design (Test Creation):

AI: converts natural language prompts into candidate test cases, generates candidate scenarios from Figma designs and documents, produces synthetic data for edge cases.
Human: reviews generated tests for business logic accuracy; adds exploratory, regulatory, and negative scenarios that AI does not naturally surface. AI generates volume; humans ensure relevance.

4. Execute (Smart Execution):

AI/Automation: Automation runs parallel test execution across environments. AI-driven self-healing features that auto-rewrite locators help productivity but create silent changes to the audit record, which is a real problem in regulated test suites.
Human: monitors execution health; decides when a failure is noise and when it is risk; audits any auto-applied changes to test artifacts; handles complex failures that autonomous agents cannot resolve.

5. Detect (Defect Intelligence):

AI: correlates failures with code changes and surfaces candidate root causes; instrumentation captures logs and reproduction steps; statistical analysis flags tests with histories of flakiness.
Human: validates root-cause accuracy; assesses severity through the lens of customer, financial, and regulatory impact.

6. Report (Quality Intelligence):

AI: assists with trend analysis over test outcomes and release-readiness signalling where historical data exists.
Human role: interprets signals; makes the go/no-go decision. Humans make the final call.

AI informs and accelerates, but humans remain accountable for risk, quality, and outcomes.

What "human-in-the-loop" actually means.

The phrase risks becoming a slogan. Operationally, in a regulated environment, "human in the loop" reduces to three concrete commitments:

Least privilege. Agents get only the access they need for the specific task, and no irreversible access at all. An agent that can read a test database has a different blast radius from one that can write to production. Treat agent permissions the way you treat human contractor permissions: provisioned for a scope, revoked when the scope ends.
Verify every non-deterministic output. LLMs do not produce the same output twice. Every AI-generated artifact (test cases, code, documentation, decisions) is reviewed before it enters the system of record. The badge model in Qase implements this for tests; the principle generalises to every artifact AI touches.
Rollback discipline. Every AI-applied change must be reversible without manual reconstruction. The Replit case illustrates the gap: production data deleted, recovery only possible through manual scramble rather than a clean rollback path. In fintech, every change an AI touches should be one revert or one snapshot restore away from undone.

This is what makes human-in-the-loop auditable.

The System of Record

For Apex, Qase is the system of record for this loop: manual cases, automated runs, and requirements coverage live in one place, queryable and exportable, so the audit trail is there before the auditor asks. Not assembled the week before, or reconstructed after a failure, but always already there.

For companies operating with low risk appetite (a platform carrying $229 billion in assets under custody qualifies), an auditable, tamper-evident record of what was tested, what the results were, and what gates were passed before a release ships is not optional, but rather is the evidence that the quality loop ran correctly.

The Evolving QA Role

With the loop and the system of record in place, QA roles are changing in scope, not shrinking:

Manual test case writing → Prompt-driven test generation with review
Script maintenance → Test suite curation
Test executors → Strategic exploratory testers
QA professionals → Domain champions and risk strategists
Reactive bug reporting → Risk surfacing earlier in the lifecycle
Release gatekeeping with limited data → Release gatekeeping with better data

A significant portion of validation at Apex is still done by business SMEs: non-technical operations teams who understand the domain, the risk, and the real-world workflows better than any tool. As AI absorbs the high-volume work, those people become more central, not less.

The question is no longer "Did we test enough?", but "Are we testing the right things?"

That question requires a human to answer it, one who understands what failure actually costs, for this product, for these customers, under these regulations. AI can inform the answer; the judgment belongs to the person who is accountable for it.