TransactionGPT: A Foundation Model for Consumer Transaction Data

Motivation

Why a Foundation Model for Payments?

Foundation models have transformed NLP and vision, but payment transaction data — one of the richest behavioral signals at scale — has remained largely unexplored. TransactionGPT bridges that gap.

Massive Scale

Visa processes billions of transactions — an ideal foundation for self-supervised pretraining at unprecedented scale. TGPT is trained on up to 200M sequences.

Multi-Task Capability

A single pretrained model serves multiple downstream tasks: fraud detection, merchant prediction, trajectory generation, and anomaly classification.

Specialized Architecture

TransactionGPT introduces a purpose-built 3D-Transformer architecture that hierarchically encodes features, metadata, and temporal patterns with a novel Virtual Token Layer.

Data Structure

Multi-Modal-Temporal-Tabular (MMTT) Data

Each transaction is represented as tr = [M ⊕ E ⊕ F], where M = metadata, E = entities, F = features — forming rich trajectories over time.

Multi-Modal

10–20 metadata fields per transaction: merchant ID, MCC, amount, timestamps. Two modalities — metadata (few, high-dimensional) and features (many, low-dimensional) — with conflicting embedding requirements.

Temporal

Sequences of transactions over time forming payment trajectories. Irregular intervals between events, seasonal patterns, and weekday effects captured via five time encoding fields.

Tabular

Stored as database tables with categorical and numerical columns. High-cardinality entities (e.g., 10M+ merchants) and task-specific derived features require specialized handling.

Architecture

3D-Transformer Evolution

TGPT evolved through three architectural stages, each adding a Transformer dimension to better model the complexities of MMTT data.

1D

TGPT-1D

Temporal Only

A GPT-style decoder-only Temporal Transformer with local attention (window w) and a tabular MLP. Processes each transaction embedding as a token for auto-regressive generation.

2D

TGPT-2D

+ Metadata Transformer

Adds a bidirectional encoder-only Metadata Transformer that replaces the MLP, modeling cross-field interactions across merchant category, amount, and location with compositional embeddings for high-cardinality entities.

3D

TGPT-3D

+ Feature Transformer

A separate Feature Transformer encodes hundreds of downstream features with optimally-sized small embeddings. Two variants — MTF (sequential) and FMT (with Virtual Token Layer) — enable effective cross-modality fusion.

Virtual Token Layer (VTL)

The key innovation enabling effective modality fusion across the 3D architecture. Inspired by ResNet, VTL uses dual channels: a linear path to preserve gradient flow and a nonlinear path for expressiveness.

Linear Channel

Softmax(W) × Embeddings

Weighted combination preserves gradient flow and ensures stable training through direct information pathways.

Nonlinear Channel

MLP(Embeddings) → Rescale

MLP with activation rescales to any target dimension, enhancing expressiveness while decoupling bandwidth from embedding size.

Step 1: Feature → Transaction

Virtual Feature Tokens compress feature embeddings to match metadata dimension (d_M)

⊕

Step 2: Transaction → Temporal

Virtual Transaction Tokens convert each transaction into v_t tokens for the Temporal Transformer

Results

Experimental Performance

Evaluated across three major tasks — joint generation & classification (T_JGC), dining trajectory generation (T_RES), and MCC prediction vs. LLMs (T_MCC).

Headline Results on Business-Critical Task (T_JGC)

+22.5%

Improvement on Metric A
(FMVTL + LLM Embeddings)

+17.9%

Improvement on Metric B
(FMVTL-2 variant)

50.12%

MCC Recall@1
(Best TGPT-3D variant)

32.73%

Merchant Recall@1
(FMVTL + LLM)

Relative Improvement over Production Baseline (T_JGC)

Model	Metric A (Δ%)	Metric B (Δ%)	MCC Rec@1	Mrch Rec@1
TGPT-2D w/ Features	-87.0	-34.3	49.20	32.30
TGPT-3D-MTF (Seg)	+14.6	+8.0	48.23	32.41
TGPT-3D-MTF (MLP)	+9.5	+8.1	48.17	32.29
3D-FMT FMVTL	+19.2	+11.2	49.94	32.08
3D-FMT FMVTL-2	+15.5	+17.9	50.12	32.70
3D-FMT FMVTL + LLM	+22.5	+12.0	50.01	32.73

Benchmark

TGPT vs. Fine-Tuned LLMs on MCC Prediction

MCC Prediction Recall@1 (%)

Phi-2 (2.7B params) 31.1%

31.1

Mistral-v0.1 (7B params) 38.8%

38.8

Llama-2 (7B params) 41.2%

41.2

TGPT-2D (56M params) 42.6%

42.6

300× Faster

Inference speed: 0.27ms vs 84.9ms per sample on a single NVIDIA A100 GPU (80GB)

92% Fewer Parameters

56M parameters vs 7B — orders of magnitude more efficient while achieving superior accuracy

+1.4% Accuracy

TGPT-2D outperforms Llama-2 (7B) in MCC Recall@1 despite being 125× smaller

Trajectory Generation

Dining Trajectory Prediction (T_RES)

TGPT predicts future dining locations and restaurants without ingesting any location data — purely from transaction patterns.

Restaurant Prediction Performance

Trained on 200M sequences, tested on 20M sequences, predicting among 500K unique merchants.

11.9%

SASRec Rec@1

12.8%

TGPT-1D Rec@1

14.2%

TGPT-2D Rec@1

TGPT-2D ranks the exact future restaurant in the top 50 out of 500K candidates in 45.6% of test cases.

Location Prediction Accuracy

Inferred purely from transaction patterns — no explicit location input.

84%

State (Top-1)

69%

City (Top-10)

28%

Zip (Top-1)

Merchant embeddings encode rich geographic and behavioral similarities — city clusters (LA, SF, NYC) are clearly separated in UMAP space, and airport restaurants form distinct clusters.

Insights

Scalability & Design Insights

-22.6%

Compositional Embedding

Model size reduction via hashing for high-cardinality entities

+1.9% Performance

-24.9%

Local Attention

Window-based attention reduces O(|S|²) to O(w²)

+2.8% Performance

-73.8%

Weight Tying

Reuses entity embedding table as classifier, saving 1.3B params

-0.6% Performance

Training Lessons Learned

✓

BatchNorm > LayerNorm for MMTT data with diverse field cardinalities

✓

Linear MLP activations outperform nonlinear between Transformer modules

✓

One-hot positional encoding preferred over sinusoidal for non-sequential metadata fields

✓

FP32 preserves scaling gains better than BF16 on larger datasets

✓

3D architecture scales more efficiently than 2D — advantages emerge with more data

✓

LLM-based embedding initialization — MCC descriptions encoded by LLMs warm-start entity embeddings

VTL Ablation Study

Component-level analysis of the Virtual Token Layer on T_JGC.

Variant	Metric A (Δ%)	Metric B (Δ%)	Key Takeaway
FMVTL (full)	+19.2	+11.2	Best with dual VTL linear + nonlinear
FMVTL-nonlin	+6.7	+10.7	Removing linear path hurts gradient flow
FMVTL-lin-map	+0.2	+6.2	Simple linear mapping lacks expressiveness
FVTL (no Meta VTL)	-12.9	-13.0	Both VTL stages are essential
FMVTL + LLM	+22.5	+12.0	LLM-derived embeddings add semantic gains

Conclusion

Key Takeaways & Future Work

TransactionGPT is the first foundation model purpose-built for consumer transaction data, deployed within one of the world's largest payment networks. The novel 3D-Transformer with Virtual Token Layer achieves state-of-the-art performance across generation, classification, and representation learning tasks.

Merchant embeddings learned by TGPT encode rich geographic and behavioral similarities — without any explicit location input. The model achieves +22.5% improvement over the production model on a business-critical metric, with 300× faster inference than LLM-based alternatives.

Future directions include accelerating model performance at even larger data scales, developing superior multi-modal encoders for MMTT data, and exploring joint optimization of MMTT foundation models with LLMs.

Read on arXiv Download PDF