Skip to content

Erika-Russi/accounting_fraud_detect

Repository files navigation

Building an Accounting Fraud Classification Model

Abstract

Slides

I built a Machine Learning model to detect accounting fraud. My model achieved its best metrics with a Gradient Boosting Classifier: .99 AUC; .85 F1 Score; .91 Cross Validation; and .96 Accuracy.

Motivation

On both a federal and international level, we spend a lot of money on compliance costs, including adherence to anti-fraud regulation, such as the Sarbanes-Oxley Act of 2002. Despite all the regulation put in place to disincentivize fraud, companies and executives continue to commit accounting fraud. I wanted to introduce machine learning to this continuing issue as a first measure to catch potential fraud and also possibly deter it.

Research

First, I needed to find the companies that had actually committed fraud. I did this by combing through the Securities and Exchange Commission's (SEC) Press Releases for the last 5 years to gather a list of 25 companies that had been accused of fraud by the SEC. There, I had access to the SEC orders that specified which financial statements were deemed fraudulent, when they had previously been filed, and the companies' CIK numbers (a unique identifier assigned to each corporation).

Dataset

Then, I downloaded the Financial Statement Data Sets containing the above mentioned fraudulent filings as well as some non-fraudulent filings. The number of rows associated with fraudulent filings were 739, compared with 5,069 associated with non-fraudulent filings.

dataset

Feature Engineering

  1. Once I had my dataset, I added Benford's Law as my first feature. Benford's Law is the "Mathematical theory of leading digits. Specifically, in data sets, the leading digits are distributed in a specific, nonuniform way." If a company had committed fraud, then the frequency of leading digits from their financial statements should stray from the theoretical numbers, and when graphing the numbers, it appeared to be the case:

Benford's Law

I used the Kullback–Leibler divergence as a single measurement of how much a company's filing actually strayed from Benford's Law.

  1. The next features I wanted to include were tags used within the financial statements. I bucketed common tags/account names listed on financial statements using repeating patterns. For instance, if the phrase "Accounts Receivable" occured in the Balance Sheet, the account should be tagged as such even if it read "Trade accounts receivable, net". I found common tags across three financial statements: Balance Sheet, Income Statement, and Statement of Cash flows.

Fin Statement

  1. Finally, I added financial ratios typically used by forensic accountants to detect fraud. I also included the Total Accruals to Total Assets (TATA), Asset Quality Index (AQI), and Depreciation (DEPI) ratios.

Finratios

Feature Selection

Once completed, I had added 28 features to my dataset. However, with the use of financial ratios and their potentially high correlation to related features, such as the account statement tags, I needed to remove overly correlated variables to avoid an overfit model. I dropped features with correlation greater than .7 and had 21 features remaining.

correlation

Modeling and Results

In order to find the best model and its parameters, I used TPOT, an automated machine learning algorithm. After running several iterations with different generation numbers, population sizes, and k-fold cross-validation amounts, TPOT optimized the best model as a Gradient Boosting Classifier.

Compared to the baseline Dummy Classifier model, the Gradient Boosting Model was very successful.

results

auc

Most Important Features and Graphs

The most important features used to distinguish between fraudulent and non-fraudulent filings were the Length of the Financial Statement and the KL Divergence from Benford's Law.

importance

Here are important features graphed against each other:

top3

And lastly, the three most important financial ratios graphed:

3finratios

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published