video

Lesson video

In progress...

Loading...

Hello there.

You made a great choice with today's lesson.

It's gonna be a good'un.

My name is Dr.

Rownson and I'm gonna be supporting you through it.

Let's get started.

Welcome to today's lesson from the unit of graphical representations of data with scatter graphs and time series.

This lesson is called Estimating from Scatter Graphs, and by the end of today's lesson, you've guessed it, we'll be able to estimate from scatter graphs.

This lesson will introduce a new key word and that is line of best fit.

A line of best fit is a line where the distance between each data point and the line is minimalized, and we'll see lots of examples of that during today's lesson.

This lesson contains two learn cycles.

In the first learn cycle, we're going to focus on drawing lines of best fit.

And in the second learn cycle, we're going to use those to estimate from scatter graphs.

Let's start off with drawing a line of best fit.

Here we have a scatter graph that shows data from the Met Office about the weather in Heathrow for each month from 1941 to 2022.

On the horizontal axis, we have the total sunshine duration for each month, which is measured in hours.

On the vertical axis, we have the mean of the daily maximum temperature, which is measured in degrees Celsius, and each point represents a month.

And we could use this scatter graph to make some estimations.

For example, what range of maximum daily temperatures might we predict for a month with 200 hours of sunshine? Well, we could look at the horizontal axis and look at the number 200 for 200 hours, go up to where the majority of the data points are above 200.

You can see that there are some data points which a bit lower, but the bulk of the data seems to be between these two ends of the arrow, and then see where are those on the vertical axis.

All those points seem to be between 15 and 25 on the vertical axis.

So we could predict that a month with 200 hours of sunshine would have maximum daily temperatures between 15 and 25 degrees Celsius.

Now, that's not a very precise estimate.

A difference between 15 degrees Celsius and 25 degrees Celsius, it's quite big.

It's a difference between whether you gotta wear your coat when you go outside or wanna just put on a T-shirt.

So Laura says, "This is quite a broad range.

What if a more precise estimate is required?" So let's explore that a bit more together now.

When bivariate data has correlation, the rate of change is approximately constant throughout the data, and that's why we can see the data here.

The majority data seems to sit between these two straight line segments, and these two line segments can be helpful for estimating a range of data for different data points, but aren't very precise.

So, rather than drawing two line segments on the outside data, we can think about drawing one line through the middle of a data.

When there is a correlation, a line of best fit can be drawn to show the general pattern that the data follows.

And this is our line of best fit.

A line of best fit can be drawn for either positive correlation or negative correlation.

For example, here we have negative correlation and here's its line of best fit.

The hardest thing about drawing a line of best fit by hand is knowing where to draw the line of best fit and how steep to make it.

Because what you're trying to achieve with a line of best fit is you're trying to draw a line with the minimal possible distance between it and the data points in the scatter graph.

For example, with these data points, if we place our line of best fit here and then take a look at how far each point is away from that line in a vertical direction.

We should try and draw the line as close to as many of the points as possible.

So in other words, we wanna make those distances as small as possible, and there are a couple of ways we can do that.

We could think about where that line is positioned, or we can think about how steep that line is, or the angle it takes.

Now what we're about to do is adjust this line in both of those ways to see how it affects the distance from the points.

But this slide deck also contains a link to a GeoGebra version of what we're about to do if you'd like to have a go at yourself before we continue.

If you'd like to do that, pause the video, click on the link and have a go at it now.

If not, continue watching.

The height of the line can affect how far it is away from each the points.

Let's look at what happens when we adjust the height of this line.

If we make it higher, yes, it's got closer to some points, but it's got further away from other points.

And now the line's above all the points, which means it's definitely not a minimal distance away.

We can make it lower and we have the same problem again.

Generally, we want it to be going roughly through the middle of the points.

So what about the steepness? The gradient of the line can also affect how far it is away from each the points.

Let's take a look at that now.

As I make the line steeper, it's got closer to some points but further away from other points.

And if I make the line less steep or flat in this case, we can see the same thing has happened.

So generally, we want it to go through the middle of points and go roughly in the same direction, as the majority of the points seem to going as well, and that will create the minimal distance from it to each of the points.

Now, generally when plotting scatter graphs outside of the classroom, you may be using computer software such as a spreadsheet or something dynamic like GeoGebra or Desmos, and those pieces of software can calculate the optimal line of best fit for you, and it will do that with some quite complex calculations.

What we're go look at today is how to draw a line of best fit by hand.

When drawing a line of best fit by hand, we can approximate its position by doing the following.

First, take your ruler or a straight edge and place it over the scatter graph, and then rotate it until it follows the general direction of the majority the data points, and then adjust the height of your ruler until the top of it is near the middle of the data points and draw your line of best fit there.

Where people go wrong with this is they just draw the line of best fit in the first place where the ruler lands on the page.

That's not always the best place for it.

Drawing a line of best fit may require a little bit of trial and error when you're doing it by hand.

You may need to just make lots of little adjustments to your ruler, adjusting your angle, adjusting the height until you are perfectly happy with where it is on the page, and only then draw your line of best fit.

So let's think a little bit more about some do's and don'ts when it comes to your line of best fit.

A line of best fit may be drawn to the edge of the graph, but it shouldn't really extend beyond the edges of the graph, and the line of best fit does not need to go through the origin, the 0.

0, unless that is the best way to fit the data points.

So for example, the scatter graph on the left, you can see the line of best fit doesn't go through the origin, but it is a pretty good line of best fit.

It goes through the middle of data points and generally follows the direction it's going in.

But the graph on the right, the line of best fit goes through origin and you can see it's not a great line of best fit.

Majority of those points are above it.

Also, a line of best fit can be drawn when bivariate data has correlation.

For example, we have four scatter graphs here where someone has attempted to draw a line of best fit for the data, and you can see that some are better than others.

For the ones with correlation, the line of best fit is a good model for that data, but for the other two, not so much.

As a line of best fit is straight, it is not an appropriate model for data that has a non-linear association or has no association at all.

So let's check what we've learned there.

True or false? The line of best fit always goes through the origin.

Is that true or is it false? Pause the video while you make a choice and press play when you're ready for the second part of this question.

The answer is false.

So what's our justification? Here are two options to choose from.

Pause the video while you make a choice and press play when you are ready to continue.

The answer is b.

The line of best fit can start anywhere on a vertical axis, depending on the distribution of the data points.

Here we have three scatter graphs.

They are the same scatter graphs, but with a line of best fit in different places.

Which graph has the most appropriate line of best fit? Pause the video while you choose either a, b, or c and press play when you're ready for an answer.

The answer is b.

We have a similar thing again, but with this time we are changing the gradient of a line of best fit.

Which graph has the most appropriate line of best fit? Pause the video while you make a choice and press play when you're ready for an answer.

The answer is a.

And in which graphs would it be appropriate to draw a line of best fit? You can choose more than one.

Pause the video while you make your choices and press play when you're ready for an answer.

The answer is b and c.

Both of these show correlation, whereas with graph a, it seems to be more like a non-linear association.

Okay, it's over to you now for task A.

This task contains one question and here it is.

You need to draw a line of best fit on each scatter graph.

Pause the video while you do it and press play when you're ready to see some answers.

Let's take a look at some answers then.

Here they are.

Check yours against these.

Yours might not be exactly the same as these, but hopefully you've got the general idea.

Your line is going roughly through the middle of the data points and is following the general direction that the points are travelling in.

Pause the video while you check it and then press play when you're ready to continue with the lesson.

Great work so far.

Now let's move on to the second learn cycle.

We're going to use our lines of best fit.

Here we have a scatter graph that shows data from the Met Office about the weather in Heathrow for each month from 1941 to 2022.

The horizontal axis shows that the total sunshine duration in hours, and the vertical axis show the mean of a daily maximum temperature in degrees Celsius, and each point represents a month.

Now, we saw this graph earlier and we used it to estimate ranges of values.

For example, for a month with 200 hours of sunshine, we could expect the maximum daily temperature to be between 15 degrees Celsius and 25 degrees Celsius.

Now earlier, Laura didn't really like how broad that range was.

She says, "Could we obtain a more precise estimate by using a line of best fit?" Let's do that together then.

Here's our line of best fits.

And we can use this line of best fit to estimate the value of one variable given the value of another variable.

So for example, what maximum daily temperature could be predicted for a month with 200 hours of sunshine? We can do that by looking at 200 on the horizontal axis, going up to the line of best fit, and then going across until we get to the vertical axis, and reading off what we have.

It'll be helpful if you have a ruler or a straight edge to help you do that.

And we can see here the mean of maximum daily temperature, which is in line with that part of the line of best fit is 20.

5 degrees Celsius.

Lucas says, "Does that mean the maximum daily temperature will definitely be 20.

5 degrees Celsius if there are 200 hours of sunshine?" Laura says, "It is not guaranteed to be 20.

5 degrees Celsius.

This is just an estimate." And we can see it's not guaranteed from the data.

Laura says, "We can see that some months have previously been warmer than 20.

5 degrees Celsius and some months have been colder." But, 20.

5 is a reasonable estimate if we want just a single value for our prediction.

So let's check what we've learned.

Here we have a scatter graph that shows data from the ONS, where each point represents a region in the UK, and the line of best fit is provided.

Now, a value has been marked on the vertical axis and it's been marked with an x.

What is the value of x? Pause the video while you write it down and press play when you're ready for an answer.

The answer is 800.

A point has been marked on the line of best fit.

What population is represented by this point? Pause the video while you do that and press play when you're ready for an answer.

It's the population.

So we're looking at the horizontal axis and the population is in thousands, so it is 600,000.

Use the line of best fit to estimate the transport emissions for a region with a population of 200,000.

Pause the video while you do that and press play when you're ready for an answer.

So if you're looking for the number 200,000 in that horizontal axis, you won't find it because the numbers represent thousands.

So we need to look at the number 200, which represents 200,000, go to the line of best fit, go to the vertical axis, and we can see it is 300.

Okay, it's over to you now for task B.

This task contains three questions, and here is question one.

We have a scatter graph that shows data from the ONS about regions in the UK for population and the number of cell phone towers, and those are both in the thousands.

Each point represents a different region in the UK.

And if you want to, you can click on the image in the slide deck to open up a Desmos version of this graph where you can zoom in and zoom out and look at things in a bit more depth, if you'd like to.

Don't necessarily need to though to answer this task.

What you need to do is use the line of best fit to estimate the number of cell towers required for regions with the populations given in parts a, b, and c.

Pause the video while you do this and press play when you're ready for question two.

Here is question two.

Once again, we have a scatter graph with data from the ONS about regions in the UK, but this time our horizontal axis tells us about the median annual income, which is in the thousands for people in each region.

And the vertical axis shows the median house price, which is also in the thousands of pounds for each region as well.

You need to first draw a line of best fit and then use your line of best fit to estimate the median house prices based on the median annual incomes that are given in parts one, two, and three.

And once again, if you would like to, you can click on the image in the slide deck to open up a Desmos version to explore it in more depth, if you'd like to do so.

Pause the video while you're doing this and press play when you are ready for question three.

And finally here is question three.

We have another scatter graph, again with data from the ONS, where each point represents a region in UK.

This time, we have population on the horizontal axis and we have the number of properties sold in 2023 in the thousands on the vertical axis.

And this time you are given a contextual problem to deal with.

A property sales company is forecasting property sales for the following year in order to manage its staff numbers.

For example, if the company has one office in a region where there aren't that many sales, and another office in a region where there are lots of sales, the company might want to make sure there are more staff in one office than there are in the other, and might use some kind of forecasting to help 'em decide that.

Now, I'm not gonna ask you to work out staffing numbers or anything quite as complex as that.

What I'd like you to do, please, is use the graph to estimate the difference between the number of properties that are likely to be sold in a region with a population of 150,000 and a region with a population of 280,000.

Pause the video while you do this and press play when you're ready to go through some answers.

Let's see how we got on with those then.

In question one, we need to use our line of best fit to estimate the number of cell towers required for regions with the following populations.

For a, a population of 100,000 people, an estimate would be 2,500 cell towers.

For b, a population of 200,000 people would require approximately 4,500 cell towers.

And for c, a population of 300,000 people would require approximately 6,500.

Now, your answers may vary slightly to these depending on the accuracy of the readings.

And then for question two, you need to start by drawing a line of best fit, which may be somewhere around here.

However, as we're doing it by hand, yours might be slightly different to the one on the screen here, but hopefully, it's generally in a similar place.

And then we need to use our line of best fit to estimate some values.

Now, depending on precisely where you drawn your line of best fit, your answers may vary slightly to the ones you're about to see here, but hopefully you've got something similar to what we're about to see.

In part one, 165,000 pounds.

In part two, 280,000 pounds.

And in part three, 540,000 pounds.

Once again, your answers may vary slightly depending on your line of best fit and how accurately you were able to read off the values.

And then in part three, we want to use our graph to estimate the difference between the number of properties that are likely to be sold in a region of a population of 150,000 and a region with 280,000.

You'll need to start by drawing a line of best fit, go to 150 and 280 on the horizontal axis, go up to the line of best fit, read off your values from the vertical axis, and then do a subtraction.

You'll be subtracting two numbers, roughly around 3,100 and 1,800 to get an answer approximately around 1,300.

I say approximately because your answer might vary depending on how you draw your line of best fit, and also how accurately you were able to read off the values from the graph.

But hopefully, you followed the same method and you got an answer which is roughly similar to this one.

Excellent work today.

Now let's summarise what we've learned.

When bivariate data has correlation, you can draw a line of best fit on its scatter graph.

The line of best fit should represent the trend of the data, and it can be used to estimate the values of a data point.

And the line of best fit should really only be used to estimate values in the dependent variable, like we've been doing in today's lesson.

Thank you very much.

Have a great day.