Building an Accounting Fraud Classification Model

Abstract

I built a Machine Learning model to detect accounting fraud. My model achieved its best metrics with a Gradient Boosting Classifier: .99 AUC; .85 F1 Score; .91 Cross Validation; and .96 Accuracy.

Motivation

On both a federal and international level, we spend a lot of money on compliance costs, including adherence to anti-fraud regulation, such as the Sarbanes-Oxley Act of 2002. Despite all the regulation put in place to disincentivize fraud, companies and executives continue to commit accounting fraud. I wanted to introduce machine learning to this continuing issue as a first measure to catch potential fraud and also possibly deter it.

Research

First, I needed to find the companies that had actually committed fraud. I did this by combing through the Securities and Exchange Commission's (SEC) Press Releases for the last 5 years to gather a list of 25 companies that had been accused of fraud by the SEC. There, I had access to the SEC orders that specified which financial statements were deemed fraudulent, when they had previously been filed, and the companies' CIK numbers (a unique identifier assigned to each corporation).

Dataset

Then, I downloaded the Financial Statement Data Sets containing the above mentioned fraudulent filings as well as some non-fraudulent filings. The number of rows associated with fraudulent filings were 739, compared with 5,069 associated with non-fraudulent filings.

Feature Engineering

Once I had my dataset, I added Benford's Law as my first feature. Benford's Law is the "Mathematical theory of leading digits. Specifically, in data sets, the leading digits are distributed in a specific, nonuniform way." If a company had committed fraud, then the frequency of leading digits from their financial statements should stray from the theoretical numbers, and when graphing the numbers, it appeared to be the case:

I used the Kullback–Leibler divergence as a single measurement of how much a company's filing actually strayed from Benford's Law.

The next features I wanted to include were tags used within the financial statements. I bucketed common tags/account names listed on financial statements using repeating patterns. For instance, if the phrase "Accounts Receivable" occured in the Balance Sheet, the account should be tagged as such even if it read "Trade accounts receivable, net". I found common tags across three financial statements: Balance Sheet, Income Statement, and Statement of Cash flows.

Finally, I added financial ratios typically used by forensic accountants to detect fraud. I also included the Total Accruals to Total Assets (TATA), Asset Quality Index (AQI), and Depreciation (DEPI) ratios.

Feature Selection

Once completed, I had added 28 features to my dataset. However, with the use of financial ratios and their potentially high correlation to related features, such as the account statement tags, I needed to remove overly correlated variables to avoid an overfit model. I dropped features with correlation greater than .7 and had 21 features remaining.

Modeling and Results

In order to find the best model and its parameters, I used TPOT, an automated machine learning algorithm. After running several iterations with different generation numbers, population sizes, and k-fold cross-validation amounts, TPOT optimized the best model as a Gradient Boosting Classifier.

Compared to the baseline Dummy Classifier model, the Gradient Boosting Model was very successful.

Most Important Features and Graphs

The most important features used to distinguish between fraudulent and non-fraudulent filings were the Length of the Financial Statement and the KL Divergence from Benford's Law.

Here are important features graphed against each other:

And lastly, the three most important financial ratios graphed:

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.ipynb_checkpoints		.ipynb_checkpoints
all_notebooks		all_notebooks
csv files		csv files
images		images
.DS_Store		.DS_Store
README.md		README.md
Read.me with Code.ipynb		Read.me with Code.ipynb
TPOT700to5000_70_.ipynb		TPOT700to5000_70_.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building an Accounting Fraud Classification Model

Abstract

Motivation

Research

Dataset

Feature Engineering

Feature Selection

Modeling and Results

Most Important Features and Graphs

About

Releases

Packages

Languages

Erika-Russi/accounting_fraud_detect

Folders and files

Latest commit

History

Repository files navigation

Building an Accounting Fraud Classification Model

Abstract

Motivation

Research

Dataset

Feature Engineering

Feature Selection

Modeling and Results

Most Important Features and Graphs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages