A Foundation Model for Consumer Transaction Data
Project Lead: Yuzhong Chen · yuzchen@visa.com
Visa processes billions of transactions — an ideal foundation for self-supervised pretraining at unprecedented scale. TGPT is trained on up to 200M sequences.
A single pretrained model serves multiple downstream tasks: fraud detection, merchant prediction, trajectory generation, and anomaly classification.
TransactionGPT introduces a purpose-built 3D-Transformer architecture that hierarchically encodes features, metadata, and temporal patterns with a novel Virtual Token Layer.
10–20 metadata fields per transaction: merchant ID, MCC, amount, timestamps. Two modalities — metadata (few, high-dimensional) and features (many, low-dimensional) — with conflicting embedding requirements.
Sequences of transactions over time forming payment trajectories. Irregular intervals between events, seasonal patterns, and weekday effects captured via five time encoding fields.
Stored as database tables with categorical and numerical columns. High-cardinality entities (e.g., 10M+ merchants) and task-specific derived features require specialized handling.
A GPT-style decoder-only Temporal Transformer with local attention (window w) and a tabular MLP. Processes each transaction embedding as a token for auto-regressive generation.
Adds a bidirectional encoder-only Metadata Transformer that replaces the MLP, modeling cross-field interactions across merchant category, amount, and location with compositional embeddings for high-cardinality entities.
A separate Feature Transformer encodes hundreds of downstream features with optimally-sized small embeddings. Two variants — MTF (sequential) and FMT (with Virtual Token Layer) — enable effective cross-modality fusion.
The key innovation enabling effective modality fusion across the 3D architecture. Inspired by ResNet, VTL uses dual channels: a linear path to preserve gradient flow and a nonlinear path for expressiveness.
Weighted combination preserves gradient flow and ensures stable training through direct information pathways.
MLP with activation rescales to any target dimension, enhancing expressiveness while decoupling bandwidth from embedding size.
Virtual Feature Tokens compress feature embeddings to match metadata dimension (dM)
Virtual Transaction Tokens convert each transaction into vt tokens for the Temporal Transformer
| Model | Metric A (Δ%) | Metric B (Δ%) | MCC Rec@1 | Mrch Rec@1 |
|---|---|---|---|---|
| TGPT-2D w/ Features | -87.0 | -34.3 | 49.20 | 32.30 |
| TGPT-3D-MTF (Seg) | +14.6 | +8.0 | 48.23 | 32.41 |
| TGPT-3D-MTF (MLP) | +9.5 | +8.1 | 48.17 | 32.29 |
| 3D-FMT FMVTL | +19.2 | +11.2 | 49.94 | 32.08 |
| 3D-FMT FMVTL-2 | +15.5 | +17.9 | 50.12 | 32.70 |
| 3D-FMT FMVTL + LLM | +22.5 | +12.0 | 50.01 | 32.73 |
Inference speed: 0.27ms vs 84.9ms per sample on a single NVIDIA A100 GPU (80GB)
56M parameters vs 7B — orders of magnitude more efficient while achieving superior accuracy
TGPT-2D outperforms Llama-2 (7B) in MCC Recall@1 despite being 125× smaller
Trained on 200M sequences, tested on 20M sequences, predicting among 500K unique merchants.
TGPT-2D ranks the exact future restaurant in the top 50 out of 500K candidates in 45.6% of test cases.
Inferred purely from transaction patterns — no explicit location input.
Merchant embeddings encode rich geographic and behavioral similarities — city clusters (LA, SF, NYC) are clearly separated in UMAP space, and airport restaurants form distinct clusters.
BatchNorm > LayerNorm for MMTT data with diverse field cardinalities
Linear MLP activations outperform nonlinear between Transformer modules
One-hot positional encoding preferred over sinusoidal for non-sequential metadata fields
FP32 preserves scaling gains better than BF16 on larger datasets
3D architecture scales more efficiently than 2D — advantages emerge with more data
LLM-based embedding initialization — MCC descriptions encoded by LLMs warm-start entity embeddings
Component-level analysis of the Virtual Token Layer on TJGC.
| Variant | Metric A (Δ%) | Metric B (Δ%) | Key Takeaway |
|---|---|---|---|
| FMVTL (full) | +19.2 | +11.2 | Best with dual VTL linear + nonlinear |
| FMVTL-nonlin | +6.7 | +10.7 | Removing linear path hurts gradient flow |
| FMVTL-lin-map | +0.2 | +6.2 | Simple linear mapping lacks expressiveness |
| FVTL (no Meta VTL) | -12.9 | -13.0 | Both VTL stages are essential |
| FMVTL + LLM | +22.5 | +12.0 | LLM-derived embeddings add semantic gains |
TransactionGPT is the first foundation model purpose-built for consumer transaction data, deployed within one of the world's largest payment networks. The novel 3D-Transformer with Virtual Token Layer achieves state-of-the-art performance across generation, classification, and representation learning tasks.
Merchant embeddings learned by TGPT encode rich geographic and behavioral similarities — without any explicit location input. The model achieves +22.5% improvement over the production model on a business-critical metric, with 300× faster inference than LLM-based alternatives.
Future directions include accelerating model performance at even larger data scales, developing superior multi-modal encoders for MMTT data, and exploring joint optimization of MMTT foundation models with LLMs.