Lesson video

In progress...

Oh, hello, my name is Dr.

Rolinson, and I'm happy to be helping you through today's lesson.

Let's get started.

Welcome to today's lesson from the unit of graphical representations of data with scatter graphs and time series.

This lesson is called interpolation versus extrapolation.

And by the end of today's lesson, we'll be able to explain the difference between these two terms, while also acknowledging the limitations with each.

Now, during this lesson, we're going to use the words interpolation and extrapolation a lot.

And you may have come across these terms before, you may not have done.

We'll explore these words in more depth later, but what you might wanna do is pause the video while you read what these two words mean, and then press play when you're ready to continue.

This lesson contains two learning cycles.

In the first learning cycle, we're going to use interpolation and extrapolation for ourselves.

Then in the second learning cycle, we're going to think about the problems with interpolation and extrapolation, and what they may mean in context.

Well, let's start off with using interpolation and extrapolation.

Here we have a scatter graph that shows data from the Met Office about the weather in Heathrow for each month from 1941 to 2022.

So each point on this scatter graph represents a month.

The horizontal axis shows us how much sunshine there was in each month, that's measured in hours.

And the vertical axis shows us what the mean of the daily maximum temperatures were in degrees Celsius for that month as well.

And what we can see here is that this graph appears to show a positive correlation between the total sunshine duration and the mean daily maximum temperature.

Therefore, we can draw a line of best fit.

When bivariate data has correlation, a line of best fit can be used to make predictions about one variable based on the other.

Now, what you may notice with this scatter graph is that part of this line of best fit is going through the data points that have previously been recorded, but another part of this line of best fit is beyond any of the data points that have ever been recorded.

And that's what we're talking about here when it comes to interpolation and extrapolation.

Let's take a look in a bit more depth.

When predictions are between values that are inside the range of existing data, the process is called interpolation, where the prefix inter in the word usually refers to things being inside something.

In this case with our scatter graph, all of our data points are between these two dash lines you can see on the graph now.

So interpolation is estimating values that lie within that range.

For example, for a month with 200 hours of sunshine, we could predict that the mean maximum daily temperature would be approximately 21 degrees Celsius by using the line of best fit.

And because 200 is inside the range of existing data, that process was interpolation.

However, when predictions are based on values that are outside the range of existing data, this process is called extrapolation, where the prefix extra usually refers to something being outside.

So in this case, extrapolation is referring to estimating values that are on the outsides of these dash lines here, because that is outside the range of data.

For example, for a month with 360 hours of sunshine, we could predict that the mean maximum daily temperature will be approximately 36 degrees if we use that line of best fit.

And because 360 hours is outside the range of existing data, it is extrapolation.

Now, you may already be thinking about some problems with that prediction, but we'll come to that later.

So let's contrast these two terms with each other a bit more.

Interpolation may used to make predictions for situations which are similar to what has previously been observed.

In other words, a typical case.

When we use interpolation, we are estimating values inside the range of existing data, we are looking at cases that are similar to ones we have previously seen.

Whereas extrapolation may be used to make predictions for situations which are beyond, or unlike what has previously been observed.

In other words, extreme or unusual events.

When we use extrapolation, we are estimating values outside of the range of data.

In other words, we're estimating values based on things we have not seen before.

For example, the scatter graph here shows some data taken from the Office of National Statistics, the ONS, where each point represents a different region in the UK.

The horizontal axis shows us the median disposable income for each region.

Disposable income usually refers to, after you finish paying your rent or your mortgage and bills and so on, how much money you have left over.

And on this scatter graph, each number represents so many thousands of Pounds.

And the vertical axis shows the mean life expectancy for people in each region, and that's in years.

Aisha says, "In a region where the median disposable income is 12,000 Pounds, we could predict the mean life expectancy to be 77 years." Has Aisha use interpolation or extrapolation here? Pause the video while you think about it, and press play when you're ready to continue.

Well, according to data that is available here, 12,000 Pound is lower than the median disposable income for any of the regions that are plotted.

Therefore, this would be extrapolation.

Sam says, "In a region where the median disposable income is 21,000 Pound, we could predict the mean life expectancy to be 81 years." Is that interpolation or extrapolation? Pause the video while you think about it, and press play when you're ready to continue.

Well, 21,000 Pound is in the range of existing data here.

We can see we have medium disposable incomes of lower and higher than 21,000 Pounds here.

So this would be interpolation.

And Jun says, "In a region where the median disposable income is 27,000 Pound, we could predict the mean life expectancy to be 84 years." Is this interpolation or extrapolation? Pause the video while you think about that, and press play when you're ready to continue.

Well, for the data that we have here, there are no regions where the median disposable income is as high as 27,000 Pounds.

So Jun is estimating based on values outside the range of existing data, and that means this is extrapolation.

Extrapolation occurs when predictions are based on values that are either less than the lowest recorded data, or greater than the highest recorded data for the independent variable.

In other words, if these two dash lines represent the lowest and the highest recorded value in the independent variable, when we make predictions of values in between these, it is interpolation.

And when we make predictions based on values that are lower than the lowest value, or higher than the highest value, it is extrapolation.

So let's check what we've learned.

Blank is a process of estimating unknown values that are outside the range of existing data.

What word goes in the blank? Pause the video while you write it down, and press play when you're ready for an answer.

The answer is extrapolation.

Extrapolation is a process of estimating unknown values that are outside the range of existing data.

Here we have a scatter graph with a line of best fit, and it's been used to obtain estimates with values A, B, C, and D.

Which of these values have been obtained by interpolation? And it may be more than one answer.

Pause the video while you make your choices, and press play when you're ready for an answer.

The answers are B and C.

For both A and D, those values are taken from readings that are outside the range on the independent variable.

Okay, it's over to you now for task A.

This task contains two questions, and here is question one.

Here we have a scatter graph that shows data taken from the Office of National Statistics, the ONS, where each point represents a region in the UK.

The horizontal axis shows us the population for each region, which is given in thousands.

And the vertical axis shows the amount of transport emissions recorded in that region, with its units given there.

And the line of best fit is provided on this scatter graph.

You need to use the line of best fit to estimate the transport emissions for regions with the populations given below.

But as well as writing down your estimate for each answer, also write whether you used interpolation or extrapolation.

For each answer, you wanna give a number and then a word, interpolation or extrapolation.

Pause the video while you do this, and press play when you're ready for question two.

Here is question two.

Here we have a scatter graph that shows data from the ONS, where each point represents a region in the UK.

And if you want to, you can click on the image in the slide deck to open up a Desmos version of it, to explore it in more detail, but you don't necessarily need to.

In part A, you need to predict the median house price for each region listed based on its median annual income, and state whether each prediction is based on interpolation or extrapolation.

And then part B, I want you to consider, why could this data not be used to predict the median house price for a region with a median income of 18,000 Pound? Pause the video while you do that, and press play when you're ready to go through some answers.

Okay, let's see how we got on with that then.

Here's question one.

We need to use a line of best fit to estimate transport emissions, and say whether we're using interpolation or extrapolation.

For part A, we would get an estimate of 300 of those units, and it would be interpolation.

For part B, our estimate will be around 550 units, which would be interpolation.

For part C, 680 units, and that would be interpolation.

For part D, 800 units, and this now would be extrapolation.

And for part E, it is 1,050 units, and this would be extrapolation.

Parts A, B and C are interpolation because they are within the range of data, but 600,000 is just beyond the range of data in the independent variable.

So that's extrapolation, and so is anything that is greater than that.

And then question two, we need to use the line of best fit to predict the median house price for each region based on its median annual income, and state whether it is interpolation or extrapolation.

In Oakville, the median annual income is 21,000 Pound.

So if we used our line of best fit, it would give us an estimate of 50,000 Pound for its median house price.

And that would be extrapolation, because 21,000 Pound is lower than the lowest recorded value in the independent variable.

For Acornfield, the median annual income is 33,000 Pound.

So using our line of best fit, it will suggest the median house price is 400,000 Pound.

And that would be interpolation, because 33,000 Pound is inside our range of existing data.

And in Ashton, the median annual income is 50,000 Pound.

So using our line of best fit, it will suggest that the median house price will be 890,000 Pound.

And this would be extrapolation, because 50,000 Pound is greater than the highest recorded value in our independent variable.

And then we need to consider why this data could not be used to predict the median house price for a region with a median of 18,000 Pound as its income.

Well, we could either base our answer visually, from a graph, or contextually.

In other words, the line of best fit only starts at 19,000 Pounds, so we can't really use it to make a prediction for 18,000 Pounds.

Or we could also talk about how, even if we could extend that line of best fit further so it does cover 18,000 Pound, it would be in the negatives.

It would be a negative value, and house prices cannot be negative.

Okay, great job so far.

Hopefully you're feeling really confident now with using interpolation and extrapolation, but we don't wanna get too overconfident because there are some problems with each which we're going to acknowledge now.

Here we have a scatter graph which shows data from the Office of National Statistics, the ONS, with each point representing a region in the UK.

The horizontal axis shows us the median disposable income for each region, and each number on the scale represents that many thousands of Pounds.

Disposable income refers to how much money is left over in the year after you've paid for your rent or mortgage and bills and so on, how much money you've got to spend.

And the vertical axis shows us the mean life expectancy, and that is in years as well.

Lucas says, "If we keep extending the line of best fit, we could use extrapolation to predict the mean life expectancy in regions where the median disposable income is over 25,000 Pound." I wonder if you can anticipate what problems there could be with Lucas's idea, what could be wrong with it? Pause the video while you think about this, and press play when you're ready to continue.

The problem here is it can be difficult to predict how trends might continue beyond the range of available data.

Therefore, we should be really cautious when using extrapolation.

Just because we can see a pattern or a trend within the data we have, it doesn't necessarily mean it'll continue like that in the future.

For example, the dependent variable might continue to increase at a constant rate, like it has done so far.

It might continue to increase at a constant rate, but maybe at a slower rate once it gets to a certain value.

Or it might reach a maximum level of some sort, that it can't get any higher than or is unlikely to get higher than.

So that causes a bit of a problem when it comes to using extrapolation, because we don't really know what the data is likely to do once it gets beyond the highest point, or beyond the lowest point of observed data so far.

One thing that can help us is context.

Context could be helpful when considering whether a trend is likely to continue in the same way beyond the range of available data.

Lucas says, "Mean life expectancy is unlikely to keep increasing at a constant rate, otherwise millionaires would live for thousands of years." And in fact, this scatter graph doesn't really contain all the points from the dataset from the ONS.

These ones have just been chosen to illustrate this point.

If we plot the rest of the points, the scatter graph now contains additional data from the ONS, which includes regions of a median disposable income of over 25,000 Pounds.

And what we can see here is that trend does not continue in the same way for this additional data.

Lucas says, "Now there is more data, "I can see that the mean life expectancy stops increasing once the median disposable income for a region reaches around 25,000 Pound." I can also see now that correlation is not a good model for the entire dataset, and that's because it's not increasing at a constant rate throughout.

Now, if you would like to explore this dataset some more, you can click on the image in the slide deck to load up a Desmos version, where you can zoom in and zoom out and look at different aspects of datasets that interest you.

Here we have another scatter graph.

This scatter graph shows data taken from the ONS, where each point represents a region in England and Wales in 2023.

We have the median annual income across the horizontal axis, and that's the entire income they have, not the disposable income.

And we have the median house price going up the vertical axis, and both of these are showing thousands of Pounds on the scale.

Alex says, "There is a region in the USA where the median annual income is equivalent to 30,000 Pounds." He says, "By using interpolation with this graph, I can predict the median house price in that region is equivalent to 300,000 Pounds." What could be wrong with Alex's idea here? Pause the video while you think about it, and press play when you are ready to continue.

Well, the problem here is really to do with the limitations to this particular sample.

All this data was taken from regions in the UK, it doesn't include data taken from other parts of the world.

Therefore, using interpolation might only provide valid predictions for situations where the context or conditions are similar to the data where it has been collected.

House prices can vary between countries.

Sometimes it's more expensive to buy a house in one country than it is to buy in another.

And the data in this scatter graph is for regions only in the UK, so it might not transfer directly to house prices in the USA.

So based on that, you might now be thinking that maybe you can't predict other countries using this data, but we could accurately predict data for within the UK using interpolation.

Well, there's a slight problem to that too.

And that is even within the UK, house prices differ depending on where a region is.

So, for example, this data here, the green spots represent regions in the south of England, and the purple crosses represent regions in the north of England.

And you can see there's quite a bit of a difference in the house prices and median annual incomes for these two regions here.

And if we were to plot a line of best fit for each subgroup separately, they wouldn't have the same line of best fit.

So that means using interpolation can still be problematic, even if we are making predictions within the UK.

Because, for example, if a region has a median annual income of 30,000 Pounds, and we are trying to predict what the median house price might be in that region, well, if it's in the north, then the median house price is likely to be around 250,000 Pounds.

And if it's in the south, it's likely to be around 375,000 Pound.

That's quite a big difference in money, which shows that if we just look at the entire UK as a whole with a single line of best fit, and make predictions based on that line of best fit, our estimates may be quite imprecise depending on whereabout in the YK that region is.

Therefore, the more contextual information you can have about your data, the more confident you can be with the validity of any predictions you make from it.

Now, if you'd like to explore this dataset in a bit more depth for yourselves, maybe look at different regions of the UK that interest you, feel free to click on the image in the slide deck and access a Desmos version of it.

So, let's check what we've learned.

True or false? Extrapolation always produces valid predictions when a set of bivariate data has correlation.

Pause the video while you choose true or false, and then press play when you're ready for the second part of this question.

The answer is false, so what's your justification? Here are two options to choose from.

Pause the video while you make your choice, and press play when you're ready for an answer.

It's false because it can be difficult to predict how a trend might continue beyond the range of available data.

Here we have a scatter graph, which shows data from the ONS, where each point represents a region in the UK.

We have population in the thousands across the horizontal axis, and transport emissions of the vertical axis with its units given.

Jacob wants to use this data to predict the transport emissions for a region in Denmark that has a population of 200,000.

What problem could there be with this? Pause the video while you write down what you think the problem is, and then press play when you're ready for an answer.

Once again, this is a problem in terms of the limitations of the sample.

This sample only focuses on regions in the UK, where Jacob wants to use a region in Denmark.

So transport emissions may not be the same in other countries such as Denmark.

Okay, it's over to you now for task B.

This task contains one question, and here it is.

We have a scatter graph that shows data from the Met Office about the weather in Armagh, which is in Northern Ireland.

And each point represents a month from 1941 to 2022.

Across the horizontal axis, we have the total duration of sunshine for each month, which is in hours.

And in the vertical axis, we have the mean of the daily maximum temperatures for each month in the degrees Celsius.

And we have Andeep and Sofia.

Andeep plans to use this data to predict what the mean daily maximum temperature would be in Armagh in a month which had 300 hours of sunshine.

Whereas Sofia plans to use this data to predict what the mean daily maximum temperature would be in Spain for a month with 150 hours of sunshine.

Write a sentence describing a problem with each person's plan, a sentence for Andeep and a sentence for Sofia.

Pause the video while you do that, and press play when you're ready for answers.

Okay, let's take a look at this together.

So what is wrong with Andeep's plan? He wants to use the data to predict what the mean daily maximum temperature would be in Armagh, so the same place this data is for, in a month that had 300 hours of sunshine.

Well, the problem here is that this data only has values for up to 255 hours of sunshine.

Andeep is trying to use extrapolation.

However, he cannot be certain that the data would continue to follow the same trend beyond this point in the data.

So temperatures may not continue to increase at the same rate.

And then what's wrong with Sofia's plan? Sofia plans to use this data to predict what the mean daily maximum temperature would be in Spain for a month with 150 hours of sunshine.

Well, 150 is within our data range here.

However, Spain is much closer to the equator than Northern Ireland and tends to be hotter.

So the data for Armagh will not necessarily transfer to a region in Spain.

Great job today.

Now, let's summarise what we've learned.

Interpolation can be used to estimate values inside a dataset, whereas extrapolation can be used to estimate values outside a dataset.

However, there are reasons to be cautious when using either interpolation or extrapolation.

Extrapolation may not be valid, as this is outside the observed set of data values, and we don't necessarily know how trends in data will continue beyond that range.

Whereas interpolation may not be valid due to limitations within the dataset.

And either way, contextual information may help you judge the validity of any predictions that are made with either interpolation or extrapolation.

Thank you very much, have a great day.

I've finished the video