Visa Research · arXiv: 2511.08939

TransactionGPT

A Foundation Model for Consumer Transaction Data

Yingtong Dou, Zhimeng Jiang, Tianyi Zhang, Mingzhi Hu, Zhichao Xu, Shubham Jain, Uday Singh Saini,
Xiran Fan, Jiarui Sun, Menghai Pan, Junpeng Wang, Xin Dai, Liang Wang, Chin-Chia Michael Yeh,
Yujie Fan, Vineeth Rakesh, Huiyuan Chen, Mangesh Bendre, Zhongfang Zhuang, Xiaoting Li,
Prince Aboagye, Vivian Lai, Minghua Xu, Hao Yang, Yiwei Cai, Mahashweta Das, Yuzhong Chen

Project Lead: Yuzhong Chen · yuzchen@visa.com

+22.5%
Over Production Model
300×
Faster than LLMs
56M
Parameters (vs 7B)
3D
Transformer Architecture
Motivation
Why a Foundation Model for Payments?
Foundation models have transformed NLP and vision, but payment transaction data — one of the richest behavioral signals at scale — has remained largely unexplored. TransactionGPT bridges that gap.

Massive Scale

Visa processes billions of transactions — an ideal foundation for self-supervised pretraining at unprecedented scale. TGPT is trained on up to 200M sequences.

Multi-Task Capability

A single pretrained model serves multiple downstream tasks: fraud detection, merchant prediction, trajectory generation, and anomaly classification.

Specialized Architecture

TransactionGPT introduces a purpose-built 3D-Transformer architecture that hierarchically encodes features, metadata, and temporal patterns with a novel Virtual Token Layer.

Data Structure
Multi-Modal-Temporal-Tabular (MMTT) Data
Each transaction is represented as tr = [M ⊕ E ⊕ F], where M = metadata, E = entities, F = features — forming rich trajectories over time.

Multi-Modal

10–20 metadata fields per transaction: merchant ID, MCC, amount, timestamps. Two modalities — metadata (few, high-dimensional) and features (many, low-dimensional) — with conflicting embedding requirements.

Temporal

Sequences of transactions over time forming payment trajectories. Irregular intervals between events, seasonal patterns, and weekday effects captured via five time encoding fields.

Tabular

Stored as database tables with categorical and numerical columns. High-cardinality entities (e.g., 10M+ merchants) and task-specific derived features require specialized handling.

Architecture
3D-Transformer Evolution
TGPT evolved through three architectural stages, each adding a Transformer dimension to better model the complexities of MMTT data.
1D

TGPT-1D

Temporal Only
TGPT-1D Architecture

A GPT-style decoder-only Temporal Transformer with local attention (window w) and a tabular MLP. Processes each transaction embedding as a token for auto-regressive generation.

2D

TGPT-2D

+ Metadata Transformer
TGPT-2D Architecture

Adds a bidirectional encoder-only Metadata Transformer that replaces the MLP, modeling cross-field interactions across merchant category, amount, and location with compositional embeddings for high-cardinality entities.

3D

TGPT-3D

+ Feature Transformer
TGPT-3D-FMT Architecture

A separate Feature Transformer encodes hundreds of downstream features with optimally-sized small embeddings. Two variants — MTF (sequential) and FMT (with Virtual Token Layer) — enable effective cross-modality fusion.

Virtual Token Layer (VTL)

The key innovation enabling effective modality fusion across the 3D architecture. Inspired by ResNet, VTL uses dual channels: a linear path to preserve gradient flow and a nonlinear path for expressiveness.

Virtual Token Layer (VTL) Architecture

Linear Channel

Softmax(W) × Embeddings

Weighted combination preserves gradient flow and ensures stable training through direct information pathways.

Nonlinear Channel

MLP(Embeddings) → Rescale

MLP with activation rescales to any target dimension, enhancing expressiveness while decoupling bandwidth from embedding size.

Step 1: Feature → Transaction

Virtual Feature Tokens compress feature embeddings to match metadata dimension (dM)

Step 2: Transaction → Temporal

Virtual Transaction Tokens convert each transaction into vt tokens for the Temporal Transformer

Results
Experimental Performance
Evaluated across three major tasks — joint generation & classification (TJGC), dining trajectory generation (TRES), and MCC prediction vs. LLMs (TMCC).

Headline Results on Business-Critical Task (TJGC)

+22.5%
Improvement on Metric A
(FMVTL + LLM Embeddings)
+17.9%
Improvement on Metric B
(FMVTL-2 variant)
50.12%
MCC Recall@1
(Best TGPT-3D variant)
32.73%
Merchant Recall@1
(FMVTL + LLM)

Relative Improvement over Production Baseline (TJGC)

Model Metric A (Δ%) Metric B (Δ%) MCC Rec@1 Mrch Rec@1
TGPT-2D w/ Features -87.0 -34.3 49.20 32.30
TGPT-3D-MTF (Seg) +14.6 +8.0 48.23 32.41
TGPT-3D-MTF (MLP) +9.5 +8.1 48.17 32.29
3D-FMT FMVTL +19.2 +11.2 49.94 32.08
3D-FMT FMVTL-2 +15.5 +17.9 50.12 32.70
3D-FMT FMVTL + LLM +22.5 +12.0 50.01 32.73
TGPT vs. Fine-Tuned LLMs on MCC Prediction

MCC Prediction Recall@1 (%)

Phi-2 (2.7B params) 31.1%
31.1
Mistral-v0.1 (7B params) 38.8%
38.8
Llama-2 (7B params) 41.2%
41.2
TGPT-2D (56M params) 42.6%
42.6

300× Faster

Inference speed: 0.27ms vs 84.9ms per sample on a single NVIDIA A100 GPU (80GB)

92% Fewer Parameters

56M parameters vs 7B — orders of magnitude more efficient while achieving superior accuracy

+1.4% Accuracy

TGPT-2D outperforms Llama-2 (7B) in MCC Recall@1 despite being 125× smaller

Dining Trajectory Prediction (TRES)
TGPT predicts future dining locations and restaurants without ingesting any location data — purely from transaction patterns.

Restaurant Prediction Performance

Trained on 200M sequences, tested on 20M sequences, predicting among 500K unique merchants.

11.9%
SASRec Rec@1
12.8%
TGPT-1D Rec@1
14.2%
TGPT-2D Rec@1

TGPT-2D ranks the exact future restaurant in the top 50 out of 500K candidates in 45.6% of test cases.

Location Prediction Accuracy

Inferred purely from transaction patterns — no explicit location input.

84%
State (Top-1)
69%
City (Top-10)
28%
Zip (Top-1)

Merchant embeddings encode rich geographic and behavioral similarities — city clusters (LA, SF, NYC) are clearly separated in UMAP space, and airport restaurants form distinct clusters.

Insights
Scalability & Design Insights
-22.6%
Compositional Embedding
Model size reduction via hashing for high-cardinality entities
+1.9% Performance
-24.9%
Local Attention
Window-based attention reduces O(|S|²) to O(w²)
+2.8% Performance
-73.8%
Weight Tying
Reuses entity embedding table as classifier, saving 1.3B params
-0.6% Performance

Training Lessons Learned

BatchNorm > LayerNorm for MMTT data with diverse field cardinalities

Linear MLP activations outperform nonlinear between Transformer modules

One-hot positional encoding preferred over sinusoidal for non-sequential metadata fields

FP32 preserves scaling gains better than BF16 on larger datasets

3D architecture scales more efficiently than 2D — advantages emerge with more data

LLM-based embedding initialization — MCC descriptions encoded by LLMs warm-start entity embeddings

VTL Ablation Study

Component-level analysis of the Virtual Token Layer on TJGC.

VariantMetric A (Δ%)Metric B (Δ%)Key Takeaway
FMVTL (full)+19.2+11.2Best with dual VTL linear + nonlinear
FMVTL-nonlin+6.7+10.7Removing linear path hurts gradient flow
FMVTL-lin-map+0.2+6.2Simple linear mapping lacks expressiveness
FVTL (no Meta VTL)-12.9-13.0Both VTL stages are essential
FMVTL + LLM+22.5+12.0LLM-derived embeddings add semantic gains
Conclusion
Key Takeaways & Future Work

TransactionGPT is the first foundation model purpose-built for consumer transaction data, deployed within one of the world's largest payment networks. The novel 3D-Transformer with Virtual Token Layer achieves state-of-the-art performance across generation, classification, and representation learning tasks.

Merchant embeddings learned by TGPT encode rich geographic and behavioral similarities — without any explicit location input. The model achieves +22.5% improvement over the production model on a business-critical metric, with 300× faster inference than LLM-based alternatives.

Future directions include accelerating model performance at even larger data scales, developing superior multi-modal encoders for MMTT data, and exploring joint optimization of MMTT foundation models with LLMs.

Read on arXiv Download PDF