I built a Machine Learning model to detect accounting fraud. My model achieved its best metrics with a Gradient Boosting Classifier: .99 AUC; .85 F1 Score; .91 Cross Validation; and .96 Accuracy.
On both a federal and international level, we spend a lot of money on compliance costs, including adherence to anti-fraud regulation, such as the Sarbanes-Oxley Act of 2002. Despite all the regulation put in place to disincentivize fraud, companies and executives continue to commit accounting fraud. I wanted to introduce machine learning to this continuing issue as a first measure to catch potential fraud and also possibly deter it.
First, I needed to find the companies that had actually committed fraud. I did this by combing through the Securities and Exchange Commission's (SEC) Press Releases for the last 5 years to gather a list of 25 companies that had been accused of fraud by the SEC. There, I had access to the SEC orders that specified which financial statements were deemed fraudulent, when they had previously been filed, and the companies' CIK numbers (a unique identifier assigned to each corporation).
Then, I downloaded the Financial Statement Data Sets containing the above mentioned fraudulent filings as well as some non-fraudulent filings. The number of rows associated with fraudulent filings were 739, compared with 5,069 associated with non-fraudulent filings.
- Once I had my dataset, I added Benford's Law as my first feature. Benford's Law is the "Mathematical theory of leading digits. Specifically, in data sets, the leading digits are distributed in a specific, nonuniform way." If a company had committed fraud, then the frequency of leading digits from their financial statements should stray from the theoretical numbers, and when graphing the numbers, it appeared to be the case:
I used the Kullback–Leibler divergence as a single measurement of how much a company's filing actually strayed from Benford's Law.
- The next features I wanted to include were tags used within the financial statements. I bucketed common tags/account names listed on financial statements using repeating patterns. For instance, if the phrase "Accounts Receivable" occured in the Balance Sheet, the account should be tagged as such even if it read "Trade accounts receivable, net". I found common tags across three financial statements: Balance Sheet, Income Statement, and Statement of Cash flows.
- Finally, I added financial ratios typically used by forensic accountants to detect fraud. I also included the Total Accruals to Total Assets (TATA), Asset Quality Index (AQI), and Depreciation (DEPI) ratios.
Once completed, I had added 28 features to my dataset. However, with the use of financial ratios and their potentially high correlation to related features, such as the account statement tags, I needed to remove overly correlated variables to avoid an overfit model. I dropped features with correlation greater than .7 and had 21 features remaining.
In order to find the best model and its parameters, I used TPOT, an automated machine learning algorithm. After running several iterations with different generation numbers, population sizes, and k-fold cross-validation amounts, TPOT optimized the best model as a Gradient Boosting Classifier.
Compared to the baseline Dummy Classifier model, the Gradient Boosting Model was very successful.
The most important features used to distinguish between fraudulent and non-fraudulent filings were the Length of the Financial Statement and the KL Divergence from Benford's Law.
Here are important features graphed against each other:
And lastly, the three most important financial ratios graphed: