This document addresses concerns regarding potential data leakage and "unrealistic" performance metrics (0.999 AUC) on the PaySim dataset.
We analyzed feature correlations and trained a Random Forest classifier to extract feature importances on a subset of the data. To ensure ongoing quality, we have implemented automated tests (tests/test_leakage.py) that verify no single feature acts as a perfect predictor (leak).
The top predictors for fraud in this model are:
| Feature | Importance | Description |
|---|---|---|
| newBalanceOrig | 20.02% | The balance remaining in the origin account after the transaction. |
| amount | 16.61% | The transaction amount. |
| oldBalanceDest | 15.82% | Initial balance of the recipient. |
| errorBalanceOrig | 13.11% | Discrepancy in origin balance change (new - old + amount). |
| newBalanceDest | 13.65% | Final balance of the recipient. |
Concern was raised that errorBalanceOrig might be a "leaked" feature.
- Correlation with Target: -0.0166 (Very low linear correlation).
- Distribution:
- Fraud: 110 cases have
0.0error, but many have massive errors (e.g., > 1,000,000). - Legit: The vast majority (22,000+) have
0.0error.
- Fraud: 110 cases have
- Conclusion: This feature is not a perfect predictor (leak). It is a strong signal because legitimate transactions strictly follow mathematical rules, whereas fraudulent transactions in this dataset often do not (or show specific patterns of "account emptying").
The PaySim dataset is synthetic. The fraud patterns are generated using specific rules (e.g., "transfer entire balance").
- High AUC (0.999): This is expected on this specific dataset because the fraud patterns are deterministic and low-noise.
- Real-world Applicability: In a production environment with noisier human behavior, AUC would likely be lower (~0.95). The current high score validates that the model successfully learned the underlying generation rules of the dataset, not that it is "leaking" future data.