9.2.2 The Data

1: Collect the Data

_ _

Data analysis projects in business are often performed on an existing databases, especially data mining projects. They are often carried out using the company’s database. In other instances data will be purchased. For example, banks within America will most often hire consultants to create risk models using data that are purchased from credit bureaus. Under the latter situation the phrase “data collection” means talking to the appropriate people to obtain the pertinent subset of data from the available database(s) at hand. The data you desire are obviously determined by the questions you desire to answer and the data you have access to will in part determine the business issues you can address. In order to answer certain questions, you may need to collect additional data and this collection of additional data can be costly. Sometimes for a project it may be necessary to modify the business question according to the information at hand. For example, revenue may be used as a proxy for profit. Revenue is clearly not profit, but if profit is not feasible to obtain, then revenue my be the next best measure. Further, profit may be available but too costly or the effort too extensive given the project at hand. In the data collection step it is important to talk with people closest to the questions at hand in order to obtain the best insight into the data required to answer those questions.

2: Use Exploratory Data Analysis to Familiarize Yourself with the Data and Discover Initial Insights

_ _

In the data collection step it is very important to recall what the people closest to the questions at hand are recommended to investigate. Especially when time is a concern, this will help you focus on the most important variables and to be certain that you dont miss any obviously important variables from rushing. From personal experience, if there are certain variables that people involved in the project believe are relevant then at the end of the project questions will arise pertaining to those variables. It is very important to be able to at least comment on why a variable is not used in the final solution if another person believed the variable was relevant. This could cause doubt in the solution if a certain variable perceived to be important was not investigated, possibly causing extensive additional work and even resulting in a different solution.

Often I have worked with databases on a short time line and with several files, with hundreds of thousands of records for each file, and hundreds of variables. In these circumstances the variables mentioned by the client will be focused on. The descriptive statistics minimum, maximum, mode, mean, median, 25th and 75th percentiles will be of most interest in order to obtain a quick understanding. Again, additional time and thought will be spent on the key variables mentioned by the client. When dealing with such a large number of variables, the modeling phase will often be used for the elimination of variables.

3: Evaluate the Quality of the Data

_ _

There are many aspects to cover when thinking about the quality of the data. We will cover a few of the most important ones.

  1. Do you have the data needed to answer the questions you have?
  2. Are the data accurate?
  3. Is a large percentage of data missing within the database?
  4. Can you make the most use of your data?
  5. Can you combine your databases to make the most use of all your data?
  6. Do you have consistent labeling and naming conventions within the database?

4: Data Preparation Phase

_ _
  1. Variable creation:
  2. Handling of outliers:
  3. Data transformations:
    1. When handling a database with a lot of variables, say in the hundreds, sometimes multiple transformations will be tried and a brute force method will be used.
      • In a statistical modeling project with a continuous dependent variable the researcher might transform every independent variable. Many transformations could be tried, such as reciprocal, square, square root, log. Then you would select the “version” of each independent variable that is most highly correlated with the dependent variable for the modeling phase.