This section provides a high level summary on data analysis projects from the point of
view of a consultant. Research questions are the most important part that drive the data
analysis project. Often consultants for a data analysis project are hired due to an existing
research question that the company desires to answer. Some questions are openly stated
and some are not. Political goals, and questions pertaining to internal politics, are often
not stated, but it is important to be aware of their existence. Data, technique and
presentation are key in forming readily understandable and actionable answers to the
questions at hand.
- What data do you have and can the data answer the questions you have?
- Do you have the data necessary to answer your questions.
- Note: The data that you have to use partially determine the technique
you will use for the data analysis.
- Garbage In Garbage Out (G.I.G.O.)
- This is very important. It means: you cannot expect to get good, reliable
results with bad data.
- In other words: if the data are not good, not accurate, not reliable, etc.,
you cannot trust the results.
- There are many data analysis techniques.
- There is often more than one technique that can be used to answer the
same question.
- The results from the different techniques often do not differ as much as one
might think.
- This is often true when investigating statistical models, such multiple
linear regression, logistic regression, decision trees, ...
- The technique is partially determined by the data you have.
- Many techniques within the data mining literature can also be found
within standard statistics textbooks.
- Data mining and statistics are both used to analyze data in order to gain
useful information.
- The presentation is a very important part of data analysis projects. Sometimes
it can be the most important part of them.
- A good presentation should support the findings and not just mention
them.
- The supporting statistics and graphs within the presentation can either
be an aid to understanding or create confusion.
- Management often relies on the presentation in order to understand the findings
from data analysis projects.
- Management needs to trust the findings; if the findings are presented
poorly, it is difficult to trust the findings.
- A poor presentation can even cause projects to fail. Management will
not implement what they do not trust or understand.
- Also, a poor presentation or explanation often leaves management
unclear concerning how to understand and proceed with the findings
from the project.
- Unfortunately, many statisticians and computer scientists are lacking in this critical
area.
- They tend to merely look at the results and the numbers in the computer
output.
- This makes many data analysis projects fail or less successful than they
could be.
In the opinion of the author the most important part of a data analysis project is the
data collection. Always think about “Garbage in garbage out, G.I.G.O.” before collecting
the data. For this reason, the next subsection will be devoted solely to understanding the
data collection and investigation of a data analysis project. The following subsection
consists of an example covering the three main parts of a data analysis project: the data,
technique, and presentation.