Data analysis projects in business are often performed on an existing databases, especially
data mining projects. They are often carried out using the company’s database. In other
instances data will be purchased. For example, banks within America will most often hire
consultants to create risk models using data that are purchased from credit bureaus.
Under the latter situation the phrase “data collection” means talking to the
appropriate people to obtain the pertinent subset of data from the available
database(s) at hand. The data you desire are obviously determined by the questions
you desire to answer and the data you have access to will in part determine the
business issues you can address. In order to answer certain questions, you may need
to collect additional data and this collection of additional data can be costly.
Sometimes for a project it may be necessary to modify the business question
according to the information at hand. For example, revenue may be used as a proxy
for profit. Revenue is clearly not profit, but if profit is not feasible to obtain,
then revenue my be the next best measure. Further, profit may be available
but too costly or the effort too extensive given the project at hand. In the data
collection step it is important to talk with people closest to the questions at
hand in order to obtain the best insight into the data required to answer those
questions.
2: Use Exploratory Data Analysis to
Familiarize Yourself with the Data
and Discover Initial Insights |
_
_
In the data collection step it is very important to recall what the people closest to the
questions at hand are recommended to investigate. Especially when time is a concern, this
will help you focus on the most important variables and to be certain that you dont miss
any obviously important variables from rushing. From personal experience, if there are
certain variables that people involved in the project believe are relevant then at the
end of the project questions will arise pertaining to those variables. It is very
important to be able to at least comment on why a variable is not used in the final
solution if another person believed the variable was relevant. This could cause
doubt in the solution if a certain variable perceived to be important was not
investigated, possibly causing extensive additional work and even resulting in a different
solution.
Often I have worked with databases on a short time line and with several files, with
hundreds of thousands of records for each file, and hundreds of variables. In these
circumstances the variables mentioned by the client will be focused on. The descriptive
statistics minimum, maximum, mode, mean, median, 25th and 75th percentiles will be of
most interest in order to obtain a quick understanding. Again, additional time and
thought will be spent on the key variables mentioned by the client. When dealing with
such a large number of variables, the modeling phase will often be used for the elimination
of variables.
|
3: Evaluate the Quality of the Data |
_
_
There are many aspects to cover when thinking about the quality of the data. We will cover a
few of the most important ones.
- Do you have the data needed to answer the questions you have?
- A database is of little use if it is missing the key elements required to
answer the questions at hand, thus making the database in essence low
quality. Take for example a company keeping a database for marketing
purposes. The company wishes to use the database to learn about
the success of their marketing campaigns in order to improve future
campaigns. If their database does not contain information about the items
purchased as a result of their marketing campaigns, this would make it
almost impossible to learn from previous campaigns.
- Are the data accurate?
- Believe it or not, many databases keep a great deal of inaccurate information.
Do not spend a lot of money on advanced statistics if you know that your data
are inaccurate: this creates a politically driven solution.
- If the company likes the solution, they ignore that fact the data are
inaccurate, and if they do not like the solution they ignore it and
mention the fact the data are inaccurate.
- Often we look at the minimum, maximum, least frequent, and most frequent
observations to get a quick feel for reliability of the data.
- Is a large percentage of data missing within the database?
- Is there a lot of missing data?
- Are any essential fields missing?
- Can you make the most use of your data?
- Many companies have a customer on their database listed multiple times and
cannot determine when it is the same person.
- I have done a lot of work on databases and have often been told that
each record in the database refers to a unique person, only to later
find the same person listed multiple times.
- Can you combine your databases to make the most use of all your data?
- For example, can the company accurately combine all of its databases. Do
the databases have a consistent unique identifier for the customers?
- Do you have consistent labeling and naming conventions within the database?
- Many databases do not have a consistent labeling convention. Imagine a
database with province/city:
- Bangkok
- BKK (the abbreviation for Bangkok)
- bkk (some software packages are case sensitive)
- Krung Thep (another word for Bangkok)
- This is terrible for data analysis and can lead to difficulties in learning about
Bangkok. What if you do not realize that there are multiple labels for Bangkok?
This can add hours to your work and/or result in inaccurate statistics if you
want to summarize by province/city.
|
4: Data Preparation Phase |
_
_
- Variable creation:
- Often we have to create new variables from the data we are working with.
For example, a company may have on record which months a customer is
late and this may be turned into many other variables. One possible new
variable could be the maximum length of time that customer has been
late.
- Handling of outliers:
- Outliers, extreme observations, can greatly effect the data analysis results.
These outliers can cause a model to predict very poorly. You may wish
to simply delete outliers in order to get a better understanding of the
majority of your clients.
- Data transformations:
- When handling a database with a lot of variables, say in the hundreds,
sometimes multiple transformations will be tried and a brute force method will
be used.
- In a statistical modeling project with a continuous dependent variable
the researcher might transform every independent variable. Many
transformations could be tried, such as reciprocal, square, square root,
log. Then you would select the “version” of each independent variable
that is most highly correlated with the dependent variable for the
modeling phase.