Loading...
Hello, my name is Dr.
Rowlandson and I'll be helping you with your learning during today's lesson.
Let's get started.
Welcome to today's lesson from the unit of graphical representations of data with scatter graphs and time series.
This lesson is called, "Outliers in Scatter Graphs," and by the end of today's lesson we'll recognise outliers and understand whether they should be included in the dataset.
This lesson will use the keyword outlier a lot.
An outlier is a data point that is extremely large or small compared to the rester dataset, and visually, outliers lie far away from where the majority of the results are clustered.
The lesson contains two learn cycles.
In the first part of the lesson, we're going to focus on identifying outliers in a scatter graph, and then in the second part lesson we'll think about what to do with them.
Let's start off with identifying outliers in scatter graphs.
Here we have a scatter graph that shows data from the Met office about the weather in Bradford, where each point represents a month from 1941 to 2022.
Across the horizontal axes we have the total sunshine duration for each month, which is measured in hours.
And on the vertical axis, we have the mean of the daily maximum temperatures for each month, which is in degrees Celsius.
Now, an additional point is about to be plotted on this scatter graph, and this point is going to be an outlier.
Where could this point be on the graph? Pause the video while you think about this and press play when you're ready to continue.
Well, let's revisit what an outlier is.
An outlier is a data point that is extremely large or small compared to the rest of the dataset, and visually, outliers lie far away from where the majority of the results are clustered." So in this scatter graph here, we can see that the majority of the data points are clustered together and there is a lot of blank space around these points on the scatter graph.
So there are many places where an outlier could be.
On this scatter graph, an outlier could be at any point in this shaded region here because all that's in the shaded region is far away from the rest of the points.
For example, an outlier could be any data values that are significantly greater than all the other data points.
So in this case, this would be an outlier.
Let's think about this outlier in the context of the data.
This outlier is a month that had a lot of sunshine and was also much hotter than all other months on record, so that's why it's an outlier.
An outlier could also be any data values that are significantly lower than the rest of the data points.
So in this case, this would be an outlier here.
And in the context of the data, this outlier would represent a month that has a lot less sunshine than other months, and was much colder than all of a month on record too.
However, an outlier could also be within the normal range for one variable but is unusually high for the other variable.
For example, this would be an outlier here.
Now we should note that this point was not the hottest month on record, but it still stands out from the rest of the data points on that scatter graph.
The reason why is that this month had a similar amount of sunshine at other months, however, it is an outlier because it was much hotter than other months that had a similar amount of sunshine.
An outlier could also be within the normal range of one variable, but also is unusually low in the other variable.
For example, this would be an outlier here.
Once again, this point here had neither the least amount of sunshine or the most, it wasn't the warmest month of the year or the coldest month, but it still stands out from the rest of the data sets.
In context, this month had a similar amount of sunshine as other months, however, it's an outlier because it was much colder than other months that had a similar amount of sunshine.
Now, Sophia says, "If I just looked at this data as numbers in a table, then I probably wouldn't notice that this point is an outlier because its values are inside the range for both of the variables.
But by viewing it on a scatter graph, I can see more clearly that the data for this month was quite different to the rest of the months in the dataset." So for bi-variate data, it can be easier to spot an outlier by viewing the data on a scatter graph than it is by viewing the data as a list of numbers.
It's also worth noting that a scatter graph may present multiple outliers.
We can see a few outliers on this scatter graph here.
Multiple outliers could be distinct from each other.
For example, these two outliers are quite different to each other, and multiple outliers could also be clustered together.
However, not every single scatter graph will have outliers.
A scatter graph may also present no outliers.
A mistake that is often made with data is that people assume that the greatest or the lowest values in a dataset are definitely outliers but they might not necessarily be so.
For example, let's take a look at these two points here which are not outliers.
These are at the edges of the dataset.
One of them is the month with the lowest mean daily temperature, and another one is a month which has the most total sunshine duration.
However, they are not outliers because they are still close to where the majority of the data is clustered.
So let's check what we've learned.
True or false? The highest value in a data set is always an outlier.
Is that true or false? Pause the video and make a choice, and then press play when you're ready for the second part of this question.
The answer is, false, so what's our justification? Here are two options to choose from.
Pause the video while you make a choice and press play when you're ready for an answer.
It's false because it is only an outlier if it is significantly greater than where the majority of the data is clustered.
True or false? A data set may have more than one outlier, is at true or false, and choose a justification.
Pause the video while make your choice and press play when you're ready for an answer.
The answer is true, and that's because there may be more than one data point that has values which are extremely large or extremely small.
True or false, a dataset can have no outliers.
Pause the video while you choose either true or false and a justification, and press play when you're ready for an answer.
The answer is true, and that's because there could be no points that differ significantly from where the majority of the data is clustered.
Okay, it's over to you now for task A.
This task contains two questions and here is question one.
You have six scatter graphs, and on each scatter graph you need to circle any points which could be considered as outliers.
Pause the video while you do this and press play when you are ready for question two.
Here is question two.
The scatter graph shows data about population and greenhouse gas emissions for a set of towns, and each point in the scatter graph represents a town.
Now, there are four outliers on this scatter graph which have been circled and labelled A to D.
And what I'd like you to do please, is explain what each outlier means about the town in the context of this data.
You might need to write a sentence for each one.
Pause the video, while you do this and press play when you're ready to go through some answers.
Okay, let's see how we got on with that.
In this question we had to circle any points which could be considered as outliers.
In the first scatter graph, this would be an outlier because its values in both variables are significantly lower than the rest of the data points.
That's why it stands out.
In the second scatter graph, this would be our outlier because its values in both variables are significantly greater than the rest of the data points, and that's why it stands out.
In the third scatter graph, we have two outliers.
Now these points do have values that are within the range of the rest of the data, but the reason why they are outliers is because when you compare the value of the dependent variable to other points that have a similar value in the independent variable, these points stand out from the rest.
Now with the bottom three scatter graphs, deciding what is and isn't an outlier isn't quite so clear, and that's because data in real life can be a bit messy and sometimes data clusters seem to fade out rather than suddenly stopping.
But let's try our best with these.
In the bottom left graph, what we can see is we have two outliers.
There is one at the top right corner whose values are significantly greater than the values in both variables for all other data points.
And we have also another point that is within the range of data, but it stands out from the rest because it's dependent variable is significantly greater than other points that have a similar independent variable.
In the fifth graph.
Well, there are three points that seem to stand out here.
One of them is certainly an outlier that's the one at the top, and these other two you might agree or disagree with those in terms of how much they stand out from the rest.
And with this last scatter graph, you may have different answers to other people who have tried this because it's quite a messy scatter graph.
These two points here, these are definitely outliers, but there might be some other ones you thought as well.
For example, any of these could be argued to be an outlier for one reason or another, especially when basing it just on how it appears on the page.
Alternatively, you could argue that none of these points are outliers because the data is so spread out throughout the scatter graph.
And in question two, "The scatter graph shows data about the population and greenhouse gas emissions for a set of towns." And there are four outliers which are circled, which you need to explain in the context of the data.
Let's start off with A.
Town A has much more greenhouse gas emissions than other towns with a similar population.
B has a much higher population and higher greenhouse gas emissions than the rest of the data.
C has much less greenhouse gas emissions than other towns with a similar population, and D has a much lower population and greenhouse gas emissions than the rest of the data.
Great work so far.
Now let's move on to the second part of today's lesson, which is all about deciding what to do with outliers.
An outlier could indicate that there's an error in the dataset.
For example, here we have a scatter graph that shows data from the Office of National Statistics, the ONS, where each point represents a region in Wales.
Along the horizontal axis we have the population for each region, which is in the thousands.
And on the vertical axis we have the number of cell towers that can be found in each region.
And what we can see from this scatter graph is that the majority of the data points are clustered together very tightly in the bottom left corner, whereas there's one point which is far, far away from the rest of them in the top right corner.
Now this point is definitely an outlier, but it's also an error as well.
What do you think might have caused this error to occur? Pause the video while you think about it and then press play when you're ready to continue.
Well, if you imagine a data table for this bi-variate data, you'd have the population and the number of cell towers for each region, where each row in this table represents a different region.
And it's not uncommon for the bottom row of a table to include data for the total of all the data put together, the total population and the total number of cell towers for the entire of Wales in this case.
And this point was a plotting of that particular data.
So in this case, the error's caused by including the data from the totals row in the scatter graph.
The error is removed from this data and a new scatter graph is created and here it is, however, it still contains an outlier.
So does that mean that this point is also an error? Well, no, this outlier is not an error.
So why might the values of this data point be significantly greater than the rest of data, if it's not an error? Pause the video while you think about it and press play when you're ready to continue.
Well, let me tell you that this point represents Cardiff, which is the capital city of Wales.
So the date of this point is correct and it just represents a region that has a much higher population than the rest of the regions, and many more cell towers than the rest of the regions as well.
So sometimes an outlier can just be an unusual result rather than an error.
It is also worth noting that errors and scatter graphs do not always present themselves as outliers.
This scatter graph does actually contain an error, but it's not the outlier.
It's this point here, this point is an error.
This point shows a population of 210,000 and 3,000 cell towers, but this data is wrong and the point should really be plotted here.
This region really has a population 120,000 and has 3,000 cell towers.
So what do you think might have caused this error? Pause the video while you think about it and press play when you're ready to continue.
This error was a typo, typos are extremely common.
People make them when they are typing on their phones, on tablets, and on computers as well.
In this case, just the numbers two and one were transposed of each other.
They accidentally pressed the two key before the one key when they were typing the data in, and that's caused the data to be plotted in the wrong place, but it still wasn't an outlier.
It was still within the rest of the data range.
So not all outliers are errors and not all errors are outliers.
Let's take a look at another example.
This scatter graph shows data from the office of national statistics where each point represents a region in the UK.
The horizontal axis shows the median annual income for each region and that's in the thousands.
And the vertical axes shows the average house price in each region as well.
And you can see that this scatter graph has an outlier in the top right corner.
This outlier is a region that has a much higher median annual income than the rest of the regions, and also has much higher average house prices.
And this outlier is not an error.
So what should we do with it? Decisions about what to do with an outlier may depend on what the scatter graph is going to be used for.
Keeping the outlier visible can be helpful if you want to highlight the difference between this region and the rest of the UK.
However, keeping this outlier visible can cause the rest of the data to be squeezed together into a small space.
And this can make it difficult to take precise readings if you're using the scatter graph for interpolation.
Another option could be to zoom in on the rest of the data points.
Zooming in on the majority of data points can make the outlier no longer visible, but it's worth noting that it's not been completely removed from the dataset.
Its data hasn't been deleted from the table, so it can be retrieved at any point.
All we need to do is just zoom back out again on this scatter graph.
So why might we want to zoom into a point where we can't see the outliers? Well, this can be helpful for taking more precise readings if you are using the graph for interpolation, and it can also make the trend for the majority of data clearer to see.
But remember, the outlier is still in the dataset so it can be revisited if needed.
Some people worry about the presence of an outlier can have a detrimental effect on the line of best fit.
In words, the outlier makes the line of best fit less accurate, but the position of an outlier can have very effect on the line of best fit, so long as there is a sufficiently large amount of data.
For example, here we have the same data on the scatter graph, again with a line of best fit drawn, and that line of best fit has been calculated by a computer, so it is in the most optimal place and it includes the outlier.
Let's see what happens as we move that outlier further up the scatter graph.
So that outlier just moved up one big square, which represents 100 pounds.
And as it moved, did you spot how the liner best fit changed? I wouldn't blame if you didn't because it barely changed at all.
Each time the height of the outlier increases by 100, the gradient of the line of best fit only changes ever so slightly.
Let's watch again.
It's only a subtle difference and again, and again, and if we remove it completely, it has very little effect on the line of best fit too.
And that's because there is a sufficiently large amount of data.
If you only have a small amount of data, then the outlier will have a bigger effect on the line of best fit, which is why it's always helpful to get as much data as you can.
Now, if you want to explore this for yourself and you have access to the slide deck, you can click on this link here, which takes you to a (indistinct) version of what you've just seen, where you can move that outlier around the page wherever you like, and see what effect it has on the line of best fit.
If you want to do that, pause the video and do it now.
Otherwise, continue watching.
So let's check what we've learned.
True or false, an outlier means there must be an error in the data set.
Pause the video while you choose either true or false, and then press play for the second half of this question.
The answer is, false, so what's our justification? Here are two options.
Pause the video while you make your choice and press play when you're ready for an answer.
It's false because an outlier might just be an unusual result in the data.
True or false, outliers should never be deleted from a data set, is that true or is it false? Pause the video while you choose either true or false, and then press play when you're ready to see some justifications.
The answer is, false, so what's our justification? Here are two options.
Pause the video while you choose and press play when you're ready for an answer.
It's false because an outlier should only be deleted from a data set if it was recorded in error.
True or false, outliers should always be presented visibly in a scatter graph, is that true or is it false? Pause while you choose and press play when you're ready for second half of this question.
The answer is, false, so why is that? Here are two options to choose from.
Pause the video while you make a choice and press play When you're ready for an answer.
It's false because you may choose to zoom in on a specific part of the scatter graph without the outlier depending on what you are using the graph for.
Okay, it's over to you now for task B, this task contains two questions, and here is question one.
You are presented with a scatter graph that shows data from the Office of National Statistics, the ONS, where every data point represents a region in England or Wales.
The horizontal axes shows the population for each region in the thousands, and the vertical axes shows how many properties were sold in that region in the thousands.
Now we have Izzy and Alex who want to use a scatter graph for different purposes.
Izzy wants to use a scatter graph to estimate the property sales for a region with different populations by using interpolation.
Whereas, Alex wants to use a scatter graph to highlight regions that are highly populated, but there are not many properties sold.
So based on this answer questions A, B, and C.
You may also notice that next to the scatter graph there's an icon for a link, which means if you have access to slide deck, you can click on the scatter graph and open up a Desmos version if you want to explore this in more detail.
But you don't need to use that to answer the questions.
Pause the video while you work through this and press play when you're ready for question two.
And here is question two.
We have another scatter graph with data from the Office of National Statistics where each point represents a region in England or Wales, but this time the horizontal axes shows the median disposable income, which is in the thousands.
And the vertical axes shows the mean life expectancy, which is in years.
Once again, the most extreme outlier is circled.
And you have two questions to answer about this outlier.
Pause the video while you work through this and press play when you're ready to go through some answers.
Okay, let's see how we got on with that.
So in part A, we have to draw a line of best fit, which would look something a bit like this.
Yours might be a tiny bit different, but hopefully it's pretty similar.
And then you have to explain whether each person may want to zoom in so that the outlier can't be seen.
Well, Izzy wants to use the graph to estimate the property sales for regions with different populations by interpolation.
Izzy may want to zoom in so that the outlier cannot be seen because it would allow her to take more accurate readings from the line of best fit.
Whereas Alex wants to use the graph to highlight regions that are highly populated, but there are not many properties sold.
And what you notice about that outlier is, it is below the line of best fit.
So Alex may want to keep the outlier visible because it is an example of a region of a very, very high population, but the number of property sales is below the line of best fit.
And then in part C, you have to circle some other points that may be useful for Alex's intentions with the graph.
Well, you are looking for points that are below the line of best fit, but there are lots of points that are below the line of best fit.
Approximately half of them are usually below.
So you're looking for points that are quite distinct from the rest of the dataset.
Here are some options.
Now these are not necessary outliers, but they do stand out for being lower than other points with a similar population.
And then question two, where we have median disposable income and mean life expectancy for each of these regions, and we have one extreme outlier circled.
In part A, you have to give a reason why someone may want the outlier to be visible on the graph.
One reason could be to highlight an example of a wealthy region that has a similar mean life expectancy as the majority of the other regions in the dataset.
Or it could be to show that life expectancy does not keep increasing at a constant rate as the median disposable income increases.
And in part B, you had to give a reason why someone may want the outlier to not be visible on the graph.
Well, that could be so they can zoom in on the rest of the dataset, and that would make it easier to take more precise readings when using interpolation.
Or, to zoom in the rest of the dataset and observe the trend for the majority of data points more clearly.
Wonderful work today, now let's summarise what we learned.
An outlier is a data point that is extremely large or small compared to the rest of the data sets.
And outliers can be visually seen on a scatter graph because they lie far away from where the majority of points are clustered.
But outliers do not necessarily occur at the start or end of a dataset.
When an outlier is identified, they must be considered carefully and only removed if they were recorded in error.
You could however, zoom in on the rest of the data, so when you're presenting a scatter graph, you may want to zoom in on the majority of data points so that the outlier is not visible, but remember that the outlier is not removed and can be revisited again if needed.
Thank you very much, have a great day.