Stat 2005 homepage

Stat 2005 activity page

Stat 2005 bulletin board

Back to Chapter 2 page

Stat 2005 resource page

CHAPTER 2

Section 7

Lesson material

 

Five number summary

Box plots

Outliers

Objectives (what you should know what to do ) after completing this section you should be able to:

  1. Understand what is meant by Exploratory Data Analysis
  2. Know how to construct a Boxplot
  3. Know how to calculate the five number summary from a set of data
  4. Know how to identify mild and extreme outliers in a set of data

EXPLORATORY DATA ANALYSIS

Often when working with statistics we wish to answer a specific question - such as does smoking cigars lead to an increased risk of lung cancer? Or does the number of keys carried by mean exceed those carried by women? We will learn much in this class about how to do this kind of statistics. However sometimes we just wish to explore a data set to see what it might tell us. When we do this we are doing Exploratory Data Analysis - many of techniques we have studied so far are used in exploratory data analysis. This section introduces two more methods: the five number summary and the boxplot.

The five number summary

 

The minimum score, the first quartile , the median, the third quartile and the maximum score constitute the 5 number summary of a set of data

 

Box plots

 

A box plot (sometimes called a box and whisker diagram) is a graph that consists of a line extending from the lowest score to the highest, and a box with lines drawn at the first quartile, the median, and the third quartile

 

An example should make this clear:

Consider the data given below of CEO ages (notice the data is sorted for convenience)

36

37

37

39

39

41

41

43

45

46

48

48

48

50

50

52

52

52

53

55

55

59

61

62

Using the methods of previous methods - you get the following 5 number summary:

Min: 36

: 41

Median: 48

: 52.5

Max: 62

 

To construct a boxplot - create a number line with labels appropriate for your problem. Now you should label the five number summary on above the number line, construct a box from quartile 1 to quartile 3, with a line at the median and two straight line extending from the ends of the box to the min and max. Here is a completed boxplot for the data above:

The end of the first line is at 36 for the min, the first end of the box is at 41 for quartile 1, the second veritcal line in the box is the median of 48 and the final vertical line is at 52.5 for quartile 3, the end of the second line is at the max of 62

 

OUTLIERS

 

A outlier is a very unusual data point in that it is far from the rest of the data. For example - consider the following data which represents ages of stat 2005 students:

17

19

21

21

22

22

23

26

26

29

33

35

35

35

37

39

40

41

42

67

It might seem a value like 67 is very far from the rest of data. How do we know this for sure - we use the following rule:

  1. If a data point more than 1.5 times the Interquartile Range (this is ) above or below , then it is a mild outlier
  2. If a data point is more than 3 times the Interquartile Range above or below

, then it is an extreme outlier.

To modify a box plot so that it includes outliers - do the following

  1. Calculate the Interquartile range, denote it as D
  2. Draw a box with the median and quartiles as usual, but when extending the lines that branch out from the box, go only as far as the scores that are within 1.5D of the box
  3. Mild outliers are scores above by and amount of 1.5D to 3D, or scores below by an amount of 1.5D to 3D. Plot mild outliers as solid dots
  4. Extreme outliers are scores that exceed by more than 3D or are below

by an amount more than 3D. Plot extreme outliers as hollow circles

In the data above Q3 is 38 and Q1 is 22, so the Interquartile range is 16, 1.5*16=24, since Q3+24 = 62 and 67 is above 62, 67 is a mild outlier.

Back to top