9.2.3 Sample Project

One type of data analysis project is building a fraud detection model. The thought and effort that would go into this type of project would also be required for most statistical modeling projects.

Data

_ _

What data are required for a fraud detection model? First, to answer this we must know what fraud is. Fraud refers to a “... deception deliberately practiced in order to secure unfair or unlawful gain” (Dictionary.com). There are different types of fraud committed within the credit industry. The type of fraud we want to model determines the data needed. This sample project will be on the typical application credit card fraud. This will be defined as someone pretending to be someone else in order to obtain a credit card under illegally and with no intent to pay. We will not cover other types of fraud. Thus the first step is to get a deeper understanding of what we wish to study. Do not assume it is obvious. Many people may feel they already know the meaning of fraud but a high level definition of fraud for a data analysis project is not enough.

How do we identify fraud? How do we know the nonpayment is a result of fraud and not a result of a bad risk? Credit risk is “The risk that a borrower will be unable to make payment of interest or principal in a timely manner” (Dictionary.com). Sometimes an individual will obtain credit legally but will be unable to pay even the first payment. This is known as first payment default (FPD)- when an individual does not make the first payment but did not commit fraud. How do we distinguish between fraud and first payment default? Honestly, they are often difficult to distinguish.

Fraud and risk are two very different concerns for an issuer of credit. The data required to predict these different concerns differ as well. For fraud, we desire identification data and credit information. Even the date of the data desired is different for investigating fraud than investigating risk. The most recent data are desired for fraud. For risk, we desire prior credit information.

A good statistical model starts with good data. Again, the old saying, garbage in garbage out (G.I.G.O.), is pertinent. What is my point with this statement? The message is that if the data collected are not good then the model is not expected to be good. Why is this of concern in fraud detection models? We need to distinguish between fraud and first payment default (FPD). If we combine fraud with FPD, it is like trying to create a single model to determine risk and fraud. This can be done but it will create a model that does not work well on either risk or fraud. It is better to create two separate models instead. We need our independent data also to be accurate. If the information in our database differs from the application, it is important that this is not an error within the database but a possible indication of fraud.

As stated earlier it is important to differentiate between FPD and fraud when building a fraud detection model. In truth some FPD are probably unidentified fraud. If we treat FPD as non-fraud when building the model we would have some frauds listed as both fraud and as non-fraud. For the latter reason we would remove the FPD data from the model-building process.

Some of the data needed for fraud detection are different from those pertaining to risk. Important data on fraud detection tend to be identification information. In the application for credit identification information is collected. This identification information is then compared to a large database containing people’s identification information. The difference in the identification information between the application and that of the database is a sign of potential fraud. When building a risk model, identification information is not needed. With a risk model, it is believed that the person is who he say he is, but the concern is that he will not repay the money borrowed.

Know your data.

  1. What is the percentage of success in the dependent (fraud) variable?
  2. What are the values of the independent data?

Technique

_ _

Honestly, there is more than one way to build a model to detect fraud. The standard at the company where I worked was logistic regression. Logistic regression is used when there is a binary response variable, such as fraud. Binary response means that there are two possible outcomes; in our case, fraud and not fraud. Other possible techniques include general linear models, decision trees and neural networks. Some investigation showed that there was no significant, if any, advantage to the other techniques. In business, time is money. In data analysis projects we often do not have enough time to try many techniques and compare results.

Create an estimation and validation Sample. This step is very important when creating a model to be used in the future. Validity “The extent to which a measurement is measuring what was intended” (Everitt 1999). In other words, does the model truly differentiate between success and failure? What is an estimation and validation sample? A validation sample is necessary when building a model to be used in practice. Unfortunately, validations are discussed much more in fields that apply statistics. An estimation sample is the sample used to determine the parameters in the model. A validation sample is the sample used to determine if the model produces consistent results. Will the model perform well in practice, on another set of data? The validation sample is another set of data used to investigate this question. Note: If the data are biased in general then the validation sample will not help in determining this. For example, if there are no women in your database then it is not possible to know what will happen when the model is applied to women. The validation sample has the same limitation as the estimation sample; thus the validation sample is not informative in this case.

Now that we know what estimation and validation samples are, how do we create them? The easiest sampling method is simple random sampling. A more commonly-used sampling design in this case is stratified sampling. Stratify your population into two groups, successes and failures. When ample data are available, sample 10,000 successes and 10,000 failures for the estimation and another 10,000 successes and 10,000 failures for the validation. Keep track of the proportion of successes in your population relative to the number of successes sampled. Do the same for failures. Often there is not enough data, and often you will not have 10,000 frauds. Sometimes one group will have many and another group will have very few. For the group with many you need not sample beyond 10,000. This is an opinion, and it depends in part on how many variables you plan to use in your model. In my opinion when data are small in quantity it is better to create a larger estimation sample than a validation sample. This may be merely personal preference on the part of the author, as I havent read anything proving this is best.

Many people feel that with computers sampling is not needed, but this is not the case. Without sampling you would only have an estimation sample and no validation sample. When dealing with millions of records, sampling can greatly aid in the speed of the analysis. Note: Many companies do not even have the latest in computer technology. Many consultants, for example, work on their laptops, although ultimately, when the model is finished it is run on the entire dataset.

The next step is variable selection. Often in practice “brute force” is used to eliminate most of the variables. For example, when we have 500 variables or more and then try all in the model, we use stepwise logistic regression to eliminate most of the variables and to determine the best 15-20 or so most important variables. Stepwise logistic regression is offered in most common software packages. In SAS, in order to increase speed, I use a procedure called StepDisc first to come up with the first top 60 variables and then I do stepwise logistic regression. Next within variable selection we investigate the variables selected and determine if they make sense. For example: A mismatch between the zip code on the application and the zip code in the database should have a positive relationship with the presence of fraud. A negative relationship between fraud and a mismatch on the zip code would make us question the results.

How do we know if the model is a good predictive model, or if it requires more work? First, what is good? The following questions need to be answered in order to determine if the model is good:

  1. Does the fraud detection model distinguish/separate between the two groups, fraud and non-fraud?
  2. Does the model validate well?
  3. What happens if we remove or change some of the less important variables in the model? Does this affect the model performance?

The most important question is whether the model distinguishes/separates the non-fraudulent applications from the fraudulent applications of credit. To answer this question many people use the Kolmogorov-Smirnov two sample statistic. “A distribution free method that tests for any difference between population probability distributions. The test is based on the maximum absolute difference between the cumulative distribution functions of the samples from each population”(Everitt 1999).

Two cumulative distribution functions can be created, one from the successes and one from the failures. From logistic regression we estimate the probability of success and the probability of failure. Consider the probabilities of failure as the “random variable”; then from this we can create two cumulative distribution functions, one for the successes and one for the failures.

The following pictures will suggests ways in which validating a fraud and logistic regression model in general can be understood.

PIC

PIC

PIC

PIC

PIC

PIC

PIC

Presentation

_ _

A key to understanding the results from the data analysis is the presentation. How do we view our results? Visualization and presentation are very important. It is important to know your audience: your audience determines how you will present what you learned from the logistic regression model. Senior management in a business is not interested in a theoretical discussion, they are interested in how your fraud detection model will help the company. A statistician would need less visualization as he or she already understands statistical modeling, but in my opinion a good presentation of results can only help. This is very important for gaining trust in your work.

It is important to present on a separate graph each independent variable in the model with the dependent variable. Often the variable, when viewed in the model, might have the opposite relationship with the dependent variable than it does when looked at separately. This can result from multicollinearity. Multicollinearity will not be covered in this discussion. Often when creating a model, it is good to think about the variables that enter into the model and why they are entered. You may be asked to explain why you chose to keep a certain variable and used it in the model. Usually simple graphs are used, such as bar charts, for understanding the relationship between the independent variables and fraud. The following is a partial sample presentation for a fraud detection model.

Partial Sample Presentation

_ _

PIC

PIC

PIC

PIC

PIC

PIC

PIC

PIC

PIC