
| |
Section 7 - Exploratory Data Analysis
|
Five
number summary
Box plots
Outliers
|
Objectives (what you should know what
to do ) after completing this section you should be able to:
- Understand what is meant by Exploratory Data Analysis
- Know how to construct a Boxplot
- Know how to calculate the five number summary from a set of data
- Know how to identify mild and extreme outliers in a set of data
|
EXPLORATORY DATA ANALYSIS
Often when working with statistics we wish to answer a specific question -
such as does smoking cigars lead to an increased risk of lung cancer? Or does
the number of keys carried by mean exceed those carried by women? We will learn
much in this class about how to do this kind of statistics. However sometimes we
just wish to explore a data set to see what it might tell us. When we do this we
are doing Exploratory Data Analysis - many of techniques we have studied so far
are used in exploratory data analysis. This section introduces two more methods:
the five number summary and the boxplot.
The five number summary
The minimum score, the first quartile ,
the median, the third quartile and
the maximum score constitute the 5 number summary of a set of data
Box plots
A box plot (sometimes called a box and whisker diagram) is a graph that
consists of a line extending from the lowest score to the highest, and a box
with lines drawn at the first quartile, the median, and the third quartile
An example should make this clear:
Consider the data given below of CEO ages (notice the data is sorted for
convenience)
|
36
|
37
|
37
|
39
|
39
|
41
|
|
41
|
43
|
45
|
46
|
48
|
48
|
|
48
|
50
|
50
|
52
|
52
|
52
|
|
53
|
55
|
55
|
59
|
61
|
62
|
Using the methods of previous methods - you get the following 5 number
summary:
Min: 36
: 41
Median: 48
: 52.5
Max: 62
To construct a boxplot - create a number line with labels appropriate for
your problem. Now you should label the five number summary on above the number
line, construct a box from quartile 1 to quartile 3, with a line at the median
and two straight line extending from the ends of the box to the min and max.
Here is a completed boxplot for the data above:

The end of the first line is at 36 for the min, the first end of the box is
at 41 for quartile 1, the second veritcal line in the box is the median of 48
and the final vertical line is at 52.5 for quartile 3, the end of the second
line is at the max of 62
OUTLIERS
A outlier is a very unusual data point in that it is far from the rest of the
data. For example - consider the following data which represents ages of stat
2005 students:
|
17
|
19
|
21
|
21
|
22
|
|
22
|
23
|
26
|
26
|
29
|
|
33
|
35
|
35
|
35
|
37
|
|
39
|
40
|
41
|
42
|
67
|
It might seem a value like 67 is very far from the rest of
data. How do we know this for sure - we use the following rule:
- If a data point more than 1.5 times the Interquartile Range (this is
)
above or below ,
then it is a mild outlier
- If a data point is more than 3 times the Interquartile Range above
or
below 
, then it is an extreme outlier.
To modify a box plot so that it includes outliers - do the following
- Calculate the Interquartile range, denote it as D
- Draw a box with the median and quartiles as usual, but when extending the
lines that branch out from the box, go only as far as the scores that are
within 1.5D of the box
- Mild outliers are scores above
by
and amount of 1.5D to 3D, or scores below by
an amount of 1.5D to 3D. Plot mild outliers as solid dots
- Extreme outliers are scores that exceed
by
more than 3D or are below 
by an amount more than 3D. Plot extreme outliers as hollow circles
In the data above Q3 is 38 and Q1 is 22, so the Interquartile range is 16,
1.5*16=24, since Q3+24 = 62 and 67 is above 62, 67 is a mild outlier.
Back to top
|