Machine Learning for Fraud Detection That Actually Cuts Chargebacks

The chargeback rate creeps up quarter over quarter and the operator blames the card networks. The Shopify risk score is set to "block high risk." The Stripe Radar default rules are running.

9 min read · 14 December 2025

Machine Learning for Fraud Detection That Actually Cuts Chargebacks

What this covers

Machine Learning for Fraud Detection That Actually Cuts Chargebacks
The 0.71 Percent Industry Average Is Not A Floor
The Chargeback Triage Protocol
Phase 1: Build the Labelled Training Set (Days 1-30)
Phase 2: Train, Deploy, And Tune The Model (Month 2-6)

Machine Learning for Fraud Detection That Actually Cuts Chargebacks

The chargeback rate creeps up quarter over quarter and the operator blames the card networks. The Shopify risk score is set to "block high risk." The Stripe Radar default rules are running. The brand has a fraud strategy in name only, because the fraud strategy is "trust the vendor's defaults," and the defaults are calibrated to the platform's average customer, not to the brand's actual buying patterns.

By the time the chargeback rate clears 0.7 percent and Visa starts threatening monitoring program enrolment, the operator goes looking for a fix. The fix is almost never "tighten the vendor score." The fix is "stop treating the vendor score as the verdict and start treating it as one feature in a model trained on the brand's own labelled history."

The 0.71 Percent Industry Average Is Not A Floor

The benchmark numbers are clear and they should have stopped the conversation about vendor-default fraud detection a decade ago. Signifyd state of fraud reports the ecommerce industry average chargeback rate at 0.71 percent and the cost-to-merchant multiplier at $3.75 for every $1 in chargebacks received. The multiplier is the part most operators miss. A "small" 1 percent chargeback rate against $5M in revenue is not a $50,000 problem. It is a $187,500 problem after the fees, the lost merchandise, the fulfilment cost, the dispute response time, and the issuing-bank penalties get layered in.

The card networks are tightening, not loosening. Visa VAMP changes lays out the Visa Acquirer Monitoring Program changes effective from April 2026. The "excessive" threshold drops to 0.9 percent. Merchants over the threshold face higher dispute fees, mandatory monitoring program fees, and acquirer escalation. Signifyd VAMP impact frames why the new rules force a strategy rebuild now, not next year. The brands that stayed under 0.9 percent on vendor defaults are about to discover that the defaults were always carrying invisible fraud they did not detect, and the new threshold will surface every gap.

The default vendor score is the trap. Shopify fraud analysis documents the built-in low-medium-high risk score, and the documentation is honest about what the score is: a generic, cross-merchant signal calibrated to broad ecommerce patterns. It is not calibrated to the brand. The brand has a unique customer base, a unique average order value, a unique geographic distribution, and a unique repeat-buyer pattern. The vendor score does not know any of that. The vendor score sees a transaction and outputs a probability based on what is normal across all Shopify merchants, and "normal across all Shopify merchants" is not the same thing as "normal for this brand."

Stripe Radar docs is similarly transparent. The Radar default ruleset is a calibration baseline, not a finished system. The documentation explicitly invites merchants to tune rules, add custom features, and run their own model on top of the Radar score. Most merchants never do. They install Radar, accept the defaults, and treat the verdict as final. Stripe Radar guide is Stripe's own primer on ML fraud detection, and the primer's central message is the one most operators ignore: the vendor score is a feature, not a verdict.

The other half of the story is friendly fraud. Signifyd friendly fraud reports first-party fraud now accounting for 40 to 60 percent of all disputes. Friendly fraud is structurally different from authorisation fraud. The card was authorised. The transaction was legitimate. The customer received the product and then disputed the charge. The features that predict friendly fraud are nothing like the features that predict authorisation fraud, which is why a vendor score that conflates them under-performs at both. The brand needs the two streams separated, with different feature sets and different intervention paths.

The Chargeback Triage Protocol

The replacement is The Chargeback Triage Protocol. The principle is single-sentence simple: the vendor's risk score is one input feature, not the verdict, and the brand trains a supervised model on its own labelled refund and dispute history with friendly fraud and authorisation fraud kept on separate paths.

The Protocol has three components. The labelling layer tags every refund and dispute in the brand's history with one of three categories: true authorisation fraud, friendly fraud, or operational error. The feature layer combines the vendor score with brand-specific signals (basket value relative to the brand's median, repeat-buyer flag, geo distance between billing and shipping, time-of-day buckets, device fingerprint stability, account age). The model layer scores each transaction against the labelled history and routes to one of three paths: auto-approve, manual review, or auto-block, with the path threshold tuned to the brand's actual chargeback economics.

I have walked through the Protocol on enough physical product brands that the labelling step is the one most teams under-budget. The labelling is the entire foundation. Skip it, and the model trains on the vendor's labels, which means the model learns to reproduce the vendor's mistakes. Spend the time on labelling, and the model learns the brand's actual fraud distribution, which is what makes the chargeback rate move.

Stripe Radar AI is a useful technical reference for how Stripe runs the Radar score in roughly 100 milliseconds at transaction time. The engineering deep-dive is worth reading because it illustrates what the brand cannot replicate (the cross-merchant baseline data) and what the brand can do that Stripe cannot (calibrate to its own labelled history). The two complement each other. The vendor score plus the brand's own model is consistently more predictive than either alone, and the Protocol is the discipline that combines them.

Phase 1: Build the Labelled Training Set (Days 1-30)

The first 30 days are about labelling. No model yet. Just a clean training set the model can be built against later.

Week 1: pull every refund and every dispute from the prior 12 months. Get them into a single table with one row per event, capturing transaction ID, customer ID, order value, refund or dispute amount, refund or dispute date, and any free-text notes the agent left. Most brands have this data scattered across Shopify, Stripe, and the support platform. Consolidate it. The consolidation work is real and is the only way to start.

Week 2: define the three labels. True authorisation fraud means the cardholder did not initiate the transaction (stolen card, account takeover, synthetic identity). Friendly fraud means the cardholder did initiate the transaction and is now disputing it (item-not-received claims that contradict tracking data, "unauthorised" claims from a returning buyer, family-member disputes). Operational error means the brand caused the dispute (wrong item shipped, damaged in transit, unfulfilled order). The three categories are mutually exclusive and the boundary cases need a written rule.

Week 3: hand-label 500 historical events. Two reviewers label independently and reconcile the disagreements. The inter-reviewer agreement target is 90 percent. Below 90 percent, the category definitions are wrong, not the reviewers. Rewrite the definitions until the agreement clears 90 percent on a fresh sample of 100 events.

Week 4: scale the labelling. Once the categories hold up on a 500-event sample, label the remaining 12 months. The labelling can be agent-assisted at this stage: the agent suggests a label based on the rules, the reviewer confirms or corrects. Target a 90-day-old labelled training set of at least 2,000 events. Below 2,000, the training data is too thin to support a brand-specific model and you need to extend the lookback window.

The deliverable at end of Phase 1 is a labelled table with three category counts visible to leadership. The leadership team will be surprised by the friendly-fraud share. The friendly-fraud share is what unlocks Phase 2.

Phase 2: Train, Deploy, And Tune The Model (Month 2-6)

Phase 2 is where the labelled history becomes a working model and the chargeback rate starts moving.

Month 2: feature engineering. The vendor score is one feature. Brand-specific features include basket value relative to median, repeat-buyer flag with cohort, billing-shipping geo distance, time-of-day bucket, device fingerprint match across sessions, email domain risk, and account age. Build the feature pipeline against the labelled training set first. Validate that each feature has predictive power above random against the friendly-fraud and authorisation-fraud labels separately.

Month 3: train and validate two models. One model scores authorisation-fraud probability using the authorisation-fraud labels. A second, separate model scores friendly-fraud probability using the friendly-fraud labels. The feature sets diverge: authorisation fraud is heavily about device, geo, and account features; friendly fraud is heavily about customer history, basket composition, and previous dispute patterns. Validate both models on a 20 percent holdout slice of the labelled data. Target precision-at-recall-90 of 0.6 or better on each.

Month 4: deploy the models behind a triage layer. Every transaction now gets two scores plus the vendor score. The triage layer routes by combined risk: low combined risk auto-approves, medium goes to manual review, high blocks. Calibrate the manual-review threshold against your manual-review capacity. Most brands can absorb 1 to 3 percent of orders into manual review without blowing the queue. Above that, the cost of review exceeds the cost of fraud and the threshold needs to lift.

Month 5: tune the operational-error stream. Operational-error labels are not a fraud problem. They are an ops problem, and they belong in the ticket-driven prevention loop, not the fraud model. The Protocol pulls operational-error events out of the fraud pipeline entirely and routes them to merchandising, ops, and logistics for prevention.

Month 6: re-audit the model on a fresh quarter of labelled data. Fraud patterns drift. The model trained on Q1 will not be calibrated on Q4 if the brand has changed its product mix, expanded geographically, or run a campaign that drew a new customer profile. The Protocol re-trains quarterly on the rolling 12-month labelled set. Skip the re-train and the model decays silently, which puts the brand right back in the position the vendor defaults left it in.

The team is small. One ops or risk lead owns the labelled table. One analyst or contractor builds the feature pipeline and the two models. The CX lead owns the manual-review queue. Three named roles. Two models. One labelled table. That is the entire build.

The North Star: Predicted Versus Actual Dispute Rate At 30 Days

The standard fraud KPI is absolute chargeback rate, and the absolute rate is the wrong metric to drive against because it lags by 30 to 90 days and tells you nothing about whether the model is improving.

The Chargeback Triage Protocol replaces absolute chargeback rate with predicted-versus-actual dispute rate at 30 days. Defined cleanly, it is the difference between the model's predicted dispute probability across the prior 30 days of transactions and the actual dispute rate as it materialises 30 days later. A well-calibrated model lands the predicted-actual delta inside a tight band. A miscalibrated model drifts, and the drift is the early signal that the model needs retraining or that the fraud distribution has shifted.

The brands that run the Protocol for two quarters typically see their chargeback rate drop below 0.4 percent, well under the new VAMP threshold. They also see the friendly-fraud share of disputes shift, because the friendly-fraud model is now blocking a meaningful slice of the disputes the vendor defaults would have approved. The composition of the remaining chargebacks tells the operator where to invest next: more on identity verification if authorisation fraud dominates, more on delivery-confirmation evidence if friendly fraud dominates, more on PDP rewrites if operational-error labels show up where the Protocol routes them out.

You do not need a new fraud vendor. You need a labelled history, two models, and a triage layer that treats the vendor score as one input among many. The vendor score is doing its job. The Protocol gives you the brand-specific layer the vendor cannot build, and the chargeback rate finally moves because the model is finally trained on the brand's own data instead of an industry average.

The brands I have watched run the Chargeback Triage Protocol against the new VAMP rules share a pattern: the labelling phase is harder than they expected, the model phase is easier than they expected, and the re-train discipline is the hardest part to keep up over four quarters. The brands that keep the discipline up keep their chargeback rate below the threshold. The brands that drop the discipline drift back into vendor-default territory inside two quarters and their chargeback rate climbs to meet the industry average all over again.

Free tool · put it to numbers

Unit Economics Calculator

Contribution margin per order after COGS, shipping and fees — the number scaling actually depends on.

Open calculator →

Practical FMCG & eCommerce growth playbooks — margins, retention and scaling tactics, straight to your inbox.

Put it to work

Turn ai optimization into profit you can see

Get a hands-on operator to turn the frameworks above into results — book a free audit call.

Book a free audit →Browse the full AI Optimization

Machine Learning for Fraud Detection That Actually Cuts Chargebacks

Machine Learning for Fraud Detection That Actually Cuts Chargebacks

The 0.71 Percent Industry Average Is Not A Floor

The Chargeback Triage Protocol

Phase 1: Build the Labelled Training Set (Days 1-30)

Phase 2: Train, Deploy, And Tune The Model (Month 2-6)

The North Star: Predicted Versus Actual Dispute Rate At 30 Days

Unit Economics Calculator

Predictive Lead Scoring That Works For Physical Product Brands

AI Powered Risk Assessment That Covers All Four Risks

Natural Language Processing Applications That Move Margin

Payment Processing Fee Optimization for Ecommerce

Shopify Flow Automation: The Operating Engine for Scaling Stores

Why Sentiment Analysis For Brand Monitoring Misses Real Signal

Turn ai optimization into profit you can see