Skip to content

mrlizhaozhi/bank-account-fraud-detection

Repository files navigation

Bank Account Fraud Detection

Date: 9 February 2026 - 15 February 2026
Author: Li Zhaozhi (李兆智)


Background: Fraud is a type of financial crime risk that poses threats to customers and banks. There're multiple typologies within fraud such as authorised and unauthorised digital, payment, credit card, application fraud, and scams, etc.

Financial institutions deploy data science capabilities to analyse fraud patterns, detect, and mitigate fraud risk. Banking data include transaction data, customer profile data, credit data, behavioural data, and device metadata, etc.

Application Fraud: in the form of bank account opening refers to the deliberate submission of false, forged, or stolen information during the account opening process with the intent to:

  • Obtain financial products/services under false pretenses
  • Facilitate money laundering or other financial crimes
  • Circumvent regulatory controls and due diligence requirements
  • Create vehicles for future illicit activities

(Reference: Association of Certified Anti-Money Laundering Specialists (ACAMS))

Objective: Discern fraud patterns and detect bank account fraud by applying statistical analysis, data visualisation, hypothesis testing, and machine learning techniques.

Challenges: Financial institutions face the following challenges in fraud detection:

  • False positive: Genuine customers/transactions flagged as fraudulent leading to increased investigation expenses.
  • False negative: Failure to detect fraudulent customers/transactions leading to financial loss and reputational damage, sometimes regulatory fines.
  • Class imblance: Fraud data is typically imbalanced which requires processing before modelling.

Methodologies: Model governance prioritises explainability in highly regulated industries such as financial services. This solution uses Logistic Regression and XGBoost to achieve the above objective.

Table of Contents:

Document Purpose Link Last Update Status Remarks
BFA-Fraud-Detection-Models Complete Python code for model development. Link 22 Feb 2026 In Progress Feature engineering completed.
BAF-Fraud-Detection-Pipeline Production-ready, automated data cleaning and feature engineering workflows Link 22 Feb 2026 In Progress
BFA-Fraud-Detection-Documentation Model Risk Management and Model Governance documentation to meet regulatory reporting requirements. Link 22 Feb 2026 In Progress
BFA-Fraud-Detection-Dashboard Interactive dashboards and visual stories to present analytical findings from the dataset. Link 9 Feb 2026 Not started
BFA-Fraud_Detection-Presentation Present analytical findings and the modelling process. Link 9 Feb 2026 Not Started

Tags: Fraud Detection, Descriptive Statistics, Data Visualisation, Chi-Square Hypothesis Testing, Logistic Regression, XGBoost


Data Sets: Feedzai is an AI fraud detection platform that uses machine learning to detect fraud. Feedzai Research released anonymised Bank Account Fraud data sets at NeurIPS 2022. These data sets are available in downloadable CSV format. This analysis uses the base dataset.

References:

Data definitions:

Num Variable Definition Data Type Unit Example
1 fraud_bool Fraud label (1: Fraud, 0: geunine) Numerical N/A 1
2 income Annual income in quantiles Numerical N/A 0.3
3 name_email_similarity Metric of similarity between email and applicant’s name. Higher values represent higher similarity. Ranges between [0, 1]. Numerical N/A 1
4 prev_address_months_count Number of months in previous registered address of the applicant, i.e. the applicant’s previous residence, if applicable. Ranges between [−1, 380] months (-1 is a missing value). Numerical Month 2
5 current_address_months_count Months in currently registered address of the applicant. Ranges between [−1, 406] months (-1 is a missing value). Numerical Month 100
6 customer_age Applicant’s age in bins per decade (e.g, 20-29 is represented as 20). Numerical N/A 30
7 days_since_request Number of days passed since application was done. Ranges between [0, 78] days. Numerical Day 12
8 intended_balcon_amount Initial transferred amount for application. Ranges between [−1, 108]. Numerical USD 100
9 payment_type Credit payment plan type. 5 possible (annonymized) values. Categorical N/A AD
10 zip_count_4w Number of applications within same zip code in last 4 weeks. Ranges between [1, 5767]. Numerical App 21
11 velocity_6h Velocity of total applications made in last 6 hours i.e., average number of applications per hour in the last 6 hours. Ranges between [−211, 24763]. Numerical App 12
12 velocity_24h Velocity of total applications made in last 24 hours i.e., average number of applications per hour in the last 24 hours. Ranges between [1329, 9527]. Numerical App 1400
13 velocity_4w Velocity of total applications made in last 4 weeks, i.e., average number of applications per hour in the last 4 weeks. Ranges between [2779, 7043]. Numerical App 2779
14 bank_branch_count_8w Number of total applications in the selected bank branch in last 8 weeks. Ranges between [0, 2521]. Numerical App 12
15 date_of_birth_distinct_emails_4w Number of emails for applicants with same date of birth in last 4 weeks. Ranges between [0, 42]. Numerical Emails 12
16 employment_status Employment status of the applicant. 7 possible (annonymized) values. Categorical N/A CA
17 credit_risk_score Internal score of application risk. Ranges between [−176, 387]. Numerical N/A -100
18 email_is_free Domain of application email (either free or paid). Numerical N/A 1
19 housing_status Current residential status for applicant. 7 possible (annonymized) values. Categorical N/A BC
20 phone_home_valid Validity of provided home phone. Numerical N/A 1
21 phone_mobile_valid Validity of provided mobile phone. Numerical N/A 1
22 bank_months_count How old is previous account (if held) in months. Ranges between [−1, 31] months (-1 is a missing value). Numerical Month 1
23 has_other_cards If applicant has other cards from the same banking company. Numerical N/A 1
24 proposed_credit_limit Applicant’s proposed credit limit. Ranges between [200, 2000]. Numerical USD 200
25 foreign_request If origin country of request is different from bank’s country. Numerical N/A
26 source Online source of application. Either browser(INTERNET) or mobile app (APP). Categorical N/A Internet
27 session_length_in_minutes Length of user session in banking website in minutes. Ranges between [−1, 107] minutes Numerical Minutes 12
28 device_os Operative system of device that made request. Possible values are: Windows, Macintox, Linux, X11, or other. Categorical N/A Windows
29 keep_alive_session User option on session logout. Numerical N/A 1
30 device_distinct_emails_8w Number of distinct emails in banking website from the used device in last 8 weeks. Ranges between [0, 3]. Numerical Emails 2
31 device_fraud_count Number of fraudulent applications with used device. Ranges between [0, 1]. Numerical N/A 0
32 month Month where the application was made. Ranges between [0, 7]. Numerical Month 2