Not all roles available for this page.
Sign in to view assessments and invite other educators
Sign in using your existing Kendall Hunt account. If you don’t have one, create an educator account.
The histogram and box plot show the average amount of money, in thousands of dollars, spent on each person in the country (per capita spending) for health care in 34 countries.
Here is the data set used to create the histogram and box plot from the warm-up.
In a science class, 11 groups of students are synthesizing biodiesel. At the end of the experiment, each group recorded the mass in grams of the biodiesel they synthesized. The masses of biodiesel are
In statistics, an outlier is a data value that is unusual in that it differs quite a bit from the other values in the data set.
Outliers occur in data sets for a variety of reasons including, but not limited to:
Outliers can reveal cases worth studying in detail or errors in the data collection process. In general, they should be included in any analysis done with the data.
A value is an outlier if it is
In this box plot, the minimum and maximum are at least two outliers.
It is important to identify the source of outliers because outliers can affect measures of center and variability in significant ways. The box plot displays the resting heart rate, in beats per minute (bpm), of 50 athletes taken five minutes after a workout.
Some summary statistics include:
It appears that the maximum value of 112 bpm may be an outlier. Beacuse the interquartile range is 14 bpm () and , we should label the maximum value as an outlier. Searching through the actual data set, it could be confirmed that this is the only outlier.
After reviewing the data collection process, it is discovered that the athlete with the heart rate measurement of 112 bpm was taken one minute after a workout instead of five minutes after. The outlier should be deleted from the data set because it was not obtained under the right conditions.
Once the outlier is removed, the box plot and summary statistics are:
The mean decreased by 0.86 bpm and the median remained the same. The standard deviation decreased by 1.81 bpm which is about 17% of its previous value. Based on the standard deviation, the data set with the outlier removed shows much less variability than the original data set containing the outlier. Because the mean and standard deviation use all of the numerical values, removing one very large data point can affect these statistics in important ways.
The median remained the same after the removal of the outlier and the IQR increased slightly. These measures of center and variability are much more resistant to change than the mean and standard deviation are. The median and IQR measure the middle of the data based on the number of values rather than the actual numerical values themselves, so the loss of a single value will not often have a great effect on these statistics.
The source of any possible errors should always be investigated. If the measurement of 112 beats per minute was found to be taken under the right conditions and merely included an athlete whose heart rate did not slow as much as the other athletes' heart rate, it should not be deleted so that the data reflect the actual measurements. If the situation cannot be revisited to determine the source of the outlier, it should not be removed. To avoid tampering with the data and to report accurate results, data values should not be deleted unless they can be confirmed to be an error in the data collection or data entry process.
An outlier is a data value that is far from the other values in the data set. A value is considered an outlier if it is:
In this box plot, the minimum, 0, and the maximum, 44, are both outliers.