video

Lesson video

In progress...

Loading...

Hello, everyone, I'm Mr. Gratton, and thank you so much for joining me for today's maths lesson.

In today's lesson, we'll be looking at the range and how it can be affected by outliers.

The definition of an outlier will be covered throughout the lesson, but pause here to take a quick look at the definitions of the range and spread, or dispersion.

Okay, first up, we will look at what an outlier is and how this influences how representative the range is as a measure of spreads.

What looks unusual about the dot plot of dataset A compared to dataset B? This value on dot plot A looks very out of place.

It is extremely large in comparison to the other data points in that dataset.

Whilst in dot plot B, there are no data points that look extremely large or extremely small compared to the rest of the data points in that dataset.

And this single point is what we call an outlier.

An outlier is a data point that is extremely large or extremely small compared to the rest of the dataset.

Visually, outliers lie very far away from where the majority of the rest of the points are clustered.

In which of these data sets is there an outlier, and what are those outliers? For the first dataset, the data point of -13 is extremely small compared to the other data points.

Outliers can be negative.

For dataset B, the data point of 94.

3 is extremely large compared to the other data points.

Outliers can also be decimals.

However, dataset C does not have any outliers.

This is because all of the data points are very spread out.

An outlier is only considered an outlier if it is more spread out than the rest of the dataset.

Here's a check for understanding.

In which of these datasets is there an outlier? Pause to choose at least one correct answer, one dataset, that has an outlier in it.

The answer is A and C.

A has one data point at 60, which is very far away from the next biggest data point of 37, and C has one value of 50, which is far away from the next biggest value of 21.

As a quick recap, the range of a dataset is the value of the largest data point, take away the value of the smallest data point.

On a dot plot, this is the same as finding the distance between the largest and smallest data points.

Both these data sets look nearly identical, but the range of dot plot A is so much larger.

That one outlier at 24 is affecting the largest data point parts of the range calculation, making its range more than twice that of dot plot B.

If we were to remove the outlier value of 24, what would happen to the range of dot plot A? Well, the new range of dot plot A is actually smaller than that of dot plot B.

The opposite effect is also true.

Which of these dot plots actually looks more spread out? Dot plot A is pretty evenly spread out across the whole range of 0 to 30, whilst dot plot B isn't very spread out at all, it is mostly compacted around those few middle values between 10 and 20.

It has three outlier values that influence the range.

One comparatively small value at 0, then two comparatively large values, one at 28 and one at 30.

This dataset is far less varied if you do not include these three outliers in the range calculation.

By removing these three outliers, the new range is far more representative of how varied dot block B is.

By removing these outliers, the range of dot plot B decreases from 30 down to 10.

Okay, let's go through the model solutions to a set of questions.

What is the range of this data set? Well, the range is 97, the biggest value, take away 32, the smallest value, the range is therefore 65.

If I remove the data point of 44, will the range change? No, not at all.

The range will remain at 65.

This is because 44 is neither the largest value or the smallest value in that dataset.

Which one of two data points could be removed to change the range? If either 32 or 97 is removed, then the range will decrease.

This is because they are the smallest and largest data points.

Okay, now that we know 32 and 97 are the ones that could be removed to change the range, which of those data points if removed will change the range the most? The answer to that is 97.

Removing the 97 would decrease the range the most from 65 down to 19.

If 32 were removed instead, the range of the data set would only decrease by four, from 65 down to 61.

Because the range decreases so much more if 97 were removed instead, it is far more likely that 97 is the outlier.

As I alluded to earlier, removing a data point does not instantly mean that the range will change.

Which data point could be removed in order to change the range? Removing a data point of 1 would not change the range as there is still a second 1 there to keep the range the same, even though 1 is a lowest data point.

Removing any of these data points won't change the range either as they are all neither the smallest nor the largest data points.

The only data point that can be removed to change the range is this, 23, the outlier.

If I were to remove the outlier, the range would become a lot smaller, at 12 rather than 22.

Okay, here's another quick check for understanding.

Match the statements below with the correct numbers that would go at the end of each of those sentences.

Pause to give each a go.

And the answers are as follows.

By removing the one outlier at 28, the range is decreased by 11.

Next check, match the statements below with the correct numbers that end each sentence.

Pause to give each of these statements a go.

The answers are as follows.

Removing a 22 from the dataset does not decrease the range, but removing the outlier of 82 decreases the range significantly.

Okay, onto some practise questions.

By focusing on the outlier, what impact will removing it have on the range of this dataset? For part C, do not actually remove the data point of 39, just consider the effect on the range if it were removed.

Pause to consider each of these questions.

For question number 2, it is similar again.

Pause to try each of these questions.

For question number 3, this dataset only has even numbers in it, how much would the range reduce if the three outliers were removed? Pause to give this question a go.

For question number 4, for this dataset containing decimals, complete each of the sentences.

Pause to look through each of these five sentences and complete them all.

Here are the answers.

15 is the outlier with the current range of 27.

Removing 39 would not change the range, but removing the outlier will drop the range to 9.

For question number 2.

22 is the outlier and the dataset has a range currently of 26.

Removing the -4 would change the range down to 25, so the range is only decreased by 1.

However, removing the outlier of 22 will drop the range all the way down to 14.

For question number 3, the outliers are 12, 72 and 74.

If the outliers were removed, the range would decrease from a range of 62 down to 22, the range would decrease by 40.

Here are the answers for question number 4.

The original range is 81, but removing the date point 37.

7 would not change this range.

If the smaller outlier was removed, the new dataset would be 68, but if the larger outlier was removed, the range would decrease even further to 45.

3.

The outlier that we more impactful to remove is therefore the larger outlier of 82.

5.

That is some amazing work so far in today's lesson.

We've looked at outliers and how the range changes if they are removed, but we cannot just change or remove data points without valid reasons.

Let's have a look at what some of those possible reasons are.

Outliers created from errors in the data collection process must always be removed.

Outliers from data collection can sometimes be easy to spot because their value is on a different magnitude to the rest of the data points in a dataset.

For example, there could be a decimal point accidentally included that mistakenly makes the value 10 or 100 times smaller than it should be, or vice versa, a data point accidentally forgotten, which makes a data point 10 or 100 times larger than it should be.

Outliers from these sorts of errors may also simply make no sense in the context they were collected in, such as the height of a person being far too tall or the number of hours that something happened within a day with far more than 24.

In this data set showing temperatures, the 152 degrees is clearly an error that is making the range massive, at 148.

9 degrees.

A possible reason for the error is a missing decimal point that makes the date point 15.

2 degrees instead.

By doing this, the range reduces far down to 9.

5 degrees.

A second reason to consider is that outliers may be emitted from the graph specifically if the readability of the graph is low because of that outlier.

Please note, emitted does not mean removed completely, just taken off of the graph.

The existence of this outlier may then be shown in different ways.

This dot plot shows the number of daily website visits a blogger received.

There are no errors in this data collection process.

The outlier of 300 is due to their blog being promoted on one specific day.

It is unfair to remove this data point completely because it is a valid data point, not an error in the data collection method or a typo when representing this information on the dot plot.

But, because of this outlier, the use of the dot plot isn't great.

The details of the graph cannot be read properly because the vast majority of the points are congested on the far left hand side of the graph to accommodate the space on the right for that one single data point, the outlier.

Without this outlier being shown, the dot plot is far more readable.

In cases like these, we can show the range in different ways.

For example, on this dot plot, it is mentioned separately that there is one extra data point at the number 300.

But sometimes outliers must be kept and represented.

Outliers can be important data points in the dataset.

The large range created by these important outliers shows the scale of the spread or variance in a dataset, and losing this outlier would reduce how informative the range is at describing the sheer variety of the values in that dataset.

And here's an example of this.

Notice this outlier at 125,000 pounds.

The range of salaries in this business is 105,000 pounds.

This data point is essential to keep, it isn't an error, it shows the salary of the CEO, the big boss of this business.

Keeping the salary of the CEO therefore shows the variety in how much people earn in this business.

The salary of the CEO is several times higher than the lowest paid workers, at 20,000 pounds.

If we omitted the CEO from this dot plot, we would lose the ability to make this comparison between the highest earner and the lowest earners.

Okay, here's the next check.

Sophia tracks the amount of water in litres that she drinks each day for three weeks.

Find the outlier and consider whether she should keep, remove or investigate this outlier further.

Pause here to look at the data and consider your answer for yourself.

26 is the outlier.

Would she really be drinking 26 litres of water? Sophia should remove this outlier, it is clearly an error.

Maybe the data point should have said 2.

6 litres instead.

Onto some independent practise questions.

This dot plot shows the number of days each month that had air frost over 21 months.

What is the outlier and should it be removed? Pause to consider this.

This dot plot shows the number of hours of sunshine per month, all data points have been correctly measured.

What is the outlier on this dot plot, and should it be removed? Pause to consider this question.

Onto question number 3.

This dataset shows the height of 24 chimpanzees, in centimetres.

By first calculating the range, what is the outlier, and should it be removed? Pause to consider this dataset.

Question number 4.

Jun wants to represent this dataset on a dot plot and wants to see if removing any outliers impact on the range of the data that he has to plot.

Calculate two ranges, one with and one without the outlier, and suggest if Jun should represent all of the data points or whether he should remove the outliers first.

Pause to look through this data.

Onto the answers.

24 is the outlier, and he should definitely investigate this further.

Because this question was talking about the number of days of air frosts in a month, well, 24 is still less than a month, so it is still a possible value.

Perhaps it was a winter month with a lot of days of air frost.

However, because it is noticeably bigger than all the other data points, it should be something that you need to investigate to check if it was an error or not.

For question number 2, 320 is the outlier, and this outlier should definitely be kept as this month might just have been a very sunny month, and this information does need to be represented on a graph.

For question 3, the range is 196.

6 centimetres.

1.

4 is very definitely an outlier, I don't think a chimpanzee could be 1.

4 centimetres tall.

However, 198 may be an outlier or maybe it's just a very, very big chimpanzee.

The 198 does need to be investigated further, but the data point of 1.

4 should definitely be removed.

For question number 4, the original range is 279.

8, however, if you remove the outliers, the range decreases to 148.

2.

Jun should definitely emit the outliers from the dot plot as the range nearly halves without it, however, the outlier should still be acknowledged in case those data points just represent very rainy months.

And that is all for today's lesson.

Well done for all of your efforts.

In today's lesson, we have covered identifying outliers and their impact on increasing the range of a set, as well as situations where the outliers should and should not be removed.

Thank you so much for joining me, and I hope to see you soon for another maths lesson.

Have a good day.