video

Lesson video

In progress...

Loading...

Hello, my name is Dr.

Rowlandson, and I'm delighted that you'll be joining me in today's lesson.

Let's get started.

Welcome to today's lesson from the unit of numerical summaries of data.

This lesson is called "Correlation", and by the end of today's lesson, we'll be able to recognise relationships between bivariate data that is presented on a scatter graph.

Here is a reminder of some keywords that you may be familiar with already, and will be using during today's lesson.

And here are new keywords that will be introducing and unpacking during today's lesson.

This lesson has two learn cycles.

With the first learn cycle focusing on understanding what correlation is, and how to recognise it on a scatter graph.

And a second learn cycle, we'll be looking at cases where there is no correlation.

However, let's start with the first learn cycle on recognising correlation in bivariate data.

Here we have two scatter graphs.

Looking at the points in the scatter graphs, it's clear that the points aren't plotted randomly all over the grid.

They seem to be forming some kind of shape in each graph, but the shapes are different on the left graph and the right graph.

Let's consider how do the shapes of these points differ between these graphs? Pause the video, have a think about this and press play when you're ready to continue discussing it.

On the left-hand graph, the points seem to be forming a bit of a upward sloping line.

It's not an exact line, but the points all seem to be going in an upward direction.

And in the right-hand graph, the points all seem to be going on a downward sloping line.

Again, it's not a perfect straight line, but the points all seem to generally follow that same direction of going downwards when you go from the left-hand side of the graph to the right-hand side of the graph.

So what does that mean? When the points in a scatter graph appear to form some kind of shape or pattern, it suggests that there may be an association between those two variables.

For example, here we have the points forming an upward sloping line.

What that means is, as the independent variable increases, the dependent variable also increases as well.

In other words, as we go further to the right on the horizontal axis, the points seem to be getting higher and higher on the vertical axis.

Here, we have a downward sloping line and what that means is, as the independent variable increases, the dependent variable decreases.

In other words, as we go further to the right on the horizontal axis, the points are getting lower and lower on the vertical axis.

So we've had one case where the points form an upward sloping line and another case where the points form a downward sloping line.

How can we describe these associations using mathematical terminology? The word "correlation" describes the extent to which an association is close to being a linear association.

In other words, how closely the points follow a straight line going in one direction or another.

When the points appear to form a straight line, it suggests that the rate of change is constant throughout the data.

For example here, the points form roughly a straight line that is upward sloping.

And the fact it's a straight line, suggests that every time we increase the horizontal axis by an amount, the vertical axis increases roughly about the same amount each time.

For example, every time we go across 10 on the horizontal axis, the data seems to go up 50 in the vertical axis.

So the rate of change is constant throughout this data.

Whereas here we have some points that form a line that is sloping upwards, but it's not a straight line, it's a curved line.

That means that the rate of change is not constant throughout this data.

To begin with, when we go from 10 to 20 in the horizontal axis, the data increases by about 30 in the vertical axis.

But then later on, when we go across from 30 to 40 on the horizontal axis, the data goes up by a 100 in the vertical axis.

So, the rate of change is not constant throughout that data.

That's because it's not a straight line.

When the variables change at the same rate throughout the data, there is correlation.

Here, where we can see the rate of change is constant on this upward sloping straight line.

We can describe this as having correlation.

Whereas in this graph, there seems to be some sort of association between the variables because those points aren't random.

They are forming some kind of pattern, but it's not a straight line.

It doesn't have a constant rate of change.

So we say that there is a non-linear association.

There seems to be an association because they're forming some kind of shape and pattern, but it's not a straight line, so it's non-linear.

Here are two more examples on the left-hand side, we have a downward sloping straight line.

The rate of change is constant, every time we go across 10, we go down 65 roughly.

So we can describe that as having correlation.

Whereas on the right-hand side, yes the points seem to follow some kind of pattern or shape, so that suggests there's an association, but it's not a straight line, so it's not linear.

Therefore, one way to describe that data could be to say it has a non-linear association.

Now, correlation can be positive or negative, depending on the gradient of the line.

For example here, when we look at this graph, and the points are forming roughly an upward sloping straight line.

We can see that straight line has a positive gradient if we think of it in terms of gradients on a graph.

Therefore, when the points are forming an upward sloping line, we can describe the variables as having a positive correlation.

In other words, each time the independent variable increases by 10, the dependent variable always increases by 50.

In this example, where the points seem to be roughly forming a downward sloping straight line.

That straight line has a negative gradient.

Every time we go across 10, we go down in the vertical axis.

So when the points form a downward sloping straight line, we can describe the variables as having a negative correlation.

Each time in this case the independent variable increases by 10, the dependent variable decreases by 65.

Now, the patterns and shapes of data are not always as pronounced as the examples we've seen so far.

Correlation can be clearer to see in some bivariate data than in others.

For example, here we have two scatter graphs based on weather data taken from the Met Office about the town of Eastbourne.

Each point on these scatter graphs represents a month from 1941 to 2011.

The left-hand scatter graph, along the horizontal axis, we have the mean daily minimum temperature.

And the way that was worked out was, they got the minimum temperature for each day of the month and then found the mean of it, and that's the data that's plotted on a scatter graph.

And the vertical axis has the mean daily maximum temperature.

Again, worked out in a similar way, getting a maximum temperature for each day, find the mean and plotting that on a scatter graph.

On the right-hand graph, the horizontal axis is the total sunshine duration.

So, for how many hours during a month was sunshine visible? That's the data that's plotted there.

And again, it has the mean daily maximum temperature on the vertical axis.

Let's look at how these graphs are different to each other.

If we look at the left-hand graph, we can see that those points very clearly follow the direction of an upward sloping straight line.

So, the correlation is very clear to see in this situation because the points very tightly follow that direction of the straight line.

In the right-hand graph, the correlation is less clear to see.

The points still have some kind of shape to 'em, and that shape still seems to be sloping upwards, roughly in a straight line, but it doesn't quite follow that straight line as tightly as in the left-hand graph.

So, it does have correlation and it is positive, but it's less clear to see even in the left-hand graph.

Okay, let's check what we've learned so far.

What word describes the extent to which the association is close to a linear association? Pause the video, write down the word that we've learned and press play when you're ready to continue.

The word is "correlation." Correlation describes the extent to which an association is close to a linear association.

Which scatter graph shows a linear association between the variables? Is it A, B or C? Pause the video, write down a letter and press play when you're ready for an answer.

The answer is "C", because the points in that scatter graph follow roughly a straight line.

Fill in the missing word to describe the correlation that is shown in this scatter graph.

The variables have a "blank" correlation.

Pause the video, write down the missing word and press play when you're ready for an answer.

The variables have a "positive" correlation.

That's because they are sloping upwards in a straight line.

Fully describe the association shown in this scatter graph.

You'll need a couple of words for this one.

Pause the video, write down the description and press play when you're ready for an answer.

The scatter graph shows "negative correlation." We say "correlation" because it's a linear association, and we say "negative" because it's sloping downwards.

So we've learned so far that correlation describes the extent to which the relationship or association is close to being a linear relationship.

Now, in some cases, that might mean that one variable is having an effect on the other variable.

As one variable changes, it causes the other variable to change.

Let's look at some examples of that.

Here we have a scatter graph that shows data taken from the ONS, that stands for the "Office of National Statistics." And gov.

uk for 2019.

Each point on this scatter graph represents a region somewhere in either England or Wales.

Along the horizontal axis, we have the population for each of these regions, and that's in thousands.

So if says a 100, it means a 100,000.

On the vertical axis, it shows the number of property sales in 2019 within that region.

And by property, it means, for example, houses, number of house sales for example.

Aisha says, "There is a positive correlation between the population of a region and the number of properties sold." And we can see that because the points are forming an upward sloping straight line.

And not every single point perfectly follows that straight line, we can see there are some points further away from that straight line than others, but generally, the majority of 'em tend to follow that pattern.

Looking at this, Aisha says, "It looks like having more people in a region causes more property sales." And the graph would suggest that as the population increases, the number of property sales seems to increase as well.

And in this case, we can probably see why an increase of population may cause the number of property sales in that region to increase as well.

A greater population means there are probably more properties in that town or region in the first place.

There are more people to sell properties and more people there to buy properties as well.

So, it makes sense why an increased population would cause the number of property sales to increase as well.

Here's another example.

We've got here, we have data taken from gov.

uk for 2019.

Again, each point represents a region somewhere in England or Wales.

The horizontal axis, again, shows us population.

The vertical axis in this case, shows us greenhouse gas emissions in that region.

And what we have in brackets there are the units for measuring the greenhouse gases.

Jun says, "There is a positive correlation between the population of a region and the amount of greenhouse gases." Now, that's not as clear as in the previous graph, but we can see the points do tend to be following an upward sloping trend.

Therefore, Jun says, "It looks like having more people in a region leads to more greenhouse gases being emitted in that area as well." In other words, Jun is suggesting that increase in the population causes an increase in greenhouse gas emissions.

And in this context, we can imagine why that might be the case.

The more people there are in the region, the more things there are in that region to emit greenhouse gases.

Therefore, as one increases, the other one will increase as well.

However, correlation does not always mean that one variable is affecting the other.

Two variables may be associated without one directly affecting the other.

For example, here we have a scatter graph with data from the Office of National Statistics and gov.

uk for its 2019.

And each point represents a region in England and Wales again.

On the horizontal axis, we have the number of property sales in that region.

And the vertical axis shows us the total greenhouse gas emissions for that region.

Alex observes that there is a positive correlation between the number of property sales and the greenhouse gas emissions.

And we can see that because it forms roughly an upward sloping straight line going from left to right.

Alex then suggests, "It looks like selling properties causes greenhouse gas emissions.

But that doesn't really seem right." Selling a property might cause some greenhouse gas emissions, but it's unlikely to be such a direct cause on the total greenhouse gas emissions for that region in that year.

So, why is it a positive correlation still here? When two variables correlate, it does not necessarily mean that one variable directly affects the value of the other variable.

The explanation for why two variables correlate can sometimes be because there is a third variable that is having an effect on each of them.

For example, we saw earlier that the population of a region affects the number of properties sold.

We can see that with a positive correlation and understanding why that might be the case in that context.

So we can suggest that population affects property sales, and also the population of a region affects the amount of emissions in that region as well.

Again, we can see it with a positive correlation and an understanding in the context why one might affect the other.

Therefore, population causes emissions.

So, an increase in the population would cause an increase in the property sales, and also an increase in the emissions.

Therefore, if both of these things are increasing at the same time because the population is increasing, this could explain why there is a correlation between the property sales and the greenhouse gas emissions without one directly affecting the other.

These two things are associated because they are both affected by population size.

As population size increases, both of these things will increase.

Therefore, if we're seeing more property sales, it's likely we're going to see more greenhouse gases, but that is because there is likely to be a greater population in that region.

So let's consider what correlations we might expect from different situations.

Andeep and Sofia collect data from a sample of music students.

They asked them what their score was on the last musical performance exam they did, and how many hours they spent practising it.

So our two variables are "score" and "hours practising ." Assuming that there is a correlation between these variables, whose prediction seems to be most likely? Andeep says, "I expect there to be a negative correlation.

This would mean that the more time someone spends practising , the lower they will score on the exam." Sofia says, "I expect there to be a positive correlation.

This would mean that the more time someone spends practising , the higher they will score on the exam." Who do you agree with out of these two? Pause the video, have a think and press play when you're ready to continue.

It seems most likely that Sofia's prediction is correct here.

The more time that you spend practising on a musical instrument, the more comfortable you are likely to be playing it when it comes to your exam.

Izzy and Lucas collect data from a sample of 100-meter runners.

They ask how much time it took them to finish their last 100-meter race.

And how many hours they spent training for that race.

So here our variables are "time within the race" and "time spent training." Assuming that there is a correlation between the variables, whose prediction seems most likely? Izzy says, "I expect there to be a negative correlation.

This means that the more time someone spends training, the less time it'll take them to finish the race." And Lucas says, "I expect there to be a positive correlation.

This would mean that the more time someone spends training, the more time it will take from to finish the race." Who do you agree with? Pause the video, have a think and press play when you're ready to continue discussing.

In this case, it seems likely that Izzy's prediction is correct here.

The more time you train, the faster you get as a runner, and therefore the faster you are as a runner, the less time it'll take you to finish a race.

Therefore, as we increase the amount of training time, we are decreasing the amount of time it takes to run the race.

Therefore, it'll be a negative correlation.

Okay, let's check what we've learned there.

True or false? Correlation always means that one variable is having an effect on the other variable.

Choose true or false, and then choose a justification.

A is, the correlation can mean that two variables are associated without one directly affecting the other.

And justification B is, variables can only correlate when one variable is affecting the other.

Pause the video, choose your answers and press play when you're ready for 'em.

The answer is "False." Correlation does not always mean that one variable is having an effect on the other.

And our justification is, correlation can mean that two variables are associated without one directly affecting the other.

Assume these two variables have a correlation.

Which type of correlation would you expect? Our two variables are, "The number of brick layers in a team." And, "The number of bricks laid in a day." What type of correlation would you expect? Pause the video, write an answer and press play when you're ready for it.

We'd expect it to be "positive correlation." The more brick layers there are to lay bricks, the more bricks we expect to be laid in a day.

Again, assume these variables have correlation.

We've got brick layers in a team and the time it takes to complete a project.

Which type of correlation would you expect? Pause the video, write down an answer and press play when you're ready.

Here we'd expect "negative correlation." The more brick layers there are on a team, the more bricks that are likely to lay in a day, but that means the number of days it'll take to complete the project would get less.

So, the time it takes to complete a project would decrease as a number of brick layers increase.

So there'd be a negative correlation.

Okay, on to you now for task A.

This task contains three questions, and here are the first two.

In question one, you've got two scatter graphs and you need to fully describe the correlation in each scatter graph.

In question two, you've got a table where each row provides you with a pair of bivariate data, and you need to decide whether you expect the correlation in that bivariate data to be positive or negative.

Tick the box of positive or negative.

Pause the video, have a go and press play when you're ready to continue.

Here is question three.

Here we have three scatter graphs that all show data taken from the Office of National Statistics.

And once again, each point in these scatter graphs represents a region somewhere in England or Wales in 2015.

Now, Jacob looks at the third scatter graph here, which shows cell phone towers in a horizontal axis, and transport emissions on the vertical axis.

And Jacob thinks that the third scatter graph shows that cell phone towers cause more traffic pollution because there is a positive correlation between those variables.

Use all the information on this slide to explain why Jacob could be wrong.

You may need to write down a sentence or two to answer that question.

Pause the video, have a go and press play when you're ready for answers.

Okay, well done.

Here are the answers to questions one and two.

Question one A shows "positive correlation" because we have an upward sloping straight line.

One B shows "negative correlation" because it's a downward sloping straight line.

In question two, when we have children's heights and their shoe sizes, we'd expect a "positive correlation" because as children get taller, their feet tend to get bigger, which means they need larger shoe sizes.

And the steepness of a hill and the walking speed up the hill.

Would expect that to have "negative correlation" because the steeper hill tends to be the slower people tend to walk up it.

So as one increases the other one decreases, that'll be negative correlation.

And in question three, explain why Jacob could be wrong.

Well, there's a few ways we could describe this.

The correlation does not always mean that one variable is affecting the other, so it does not necessarily mean that cell phone towers is causing traffic pollution.

Correlation may suggest that there's an association between those variables, but not necessarily that it's a cause and effect relationship.

We could also say that the number of cell phone towers and the amount of traffic emissions could both be affected by the population of the region.

And we can see that in those previous two scatter graphs, where each of those correlate with population.

So, any explanation that is similar to any of these answers will be perfect for answering that question.

Well done so far.

We're now on to learn cycle two, which is interpreting bivariate data without correlation.

Now, a lot of the examples we've seen so far as scatter graphs have quite clear correlation, but bivariate data does not always have correlation.

In some cases, the points may not actually form a shape or any kind of distinct pattern, and in these cases, the variables can be described as having no correlation.

For example, here we have a scatter graph that shows weather data taken from The Met Office each month for a Tiree from 1941 to 2022.

We've got total rainfall in millimetres along the horizontal axis.

And the mean of the daily minimum temperature going up the vertical axis.

And what we can see here is, it doesn't, the points do not form a shape that is a line going either upwards or downwards.

We look at, for example, 50 millimetres in the rainfall, and we can see that the points above that are somewhere between five and 15, 16 degrees on the vertical axis.

But if we go further across on that horizontal axis to 150, we can see that most of the points are still between five and 15 and 16 degrees on the vertical axis as well.

So, though correlation can suggest that there may not be an a linear association between these variables.

So we've seen cases where the correlation forms a straight line, and we're seeing cases where there's no correlation whatsoever.

But sometimes an association is better described by a curve than a straight line.

Now, whilst correlation could be used to model these data, inspection of the scatter graphs would suggest that a better description of the relationship could be achievable.

For example, here we have a scatter graph that shows the results taken from an experiment.

The experiment was timing how long it takes for a kettle to boil with different volumes of water.

Along the horizontal axis, we have the volume of water in litres each time.

And on the vertical axis, we can see the amount of time it took for the kettle to boil, in seconds.

And what we can see is, there is a shape to this.

As the volume increases, the time it takes for the kettle to boil also increases as well.

But that shape is not a straight line, it's more curved than that.

In cases like this, the variables have a clear association or relationship.

You can see that they're following a pattern, but correlation is not the best model for this association because correlation describes the extent to which variables follow a straight line.

For example, is correlation a good model for this data? Now we can see that the points have a clear shape and pattern to them, so that would suggest that there is an association, but that shape is not a straight line.

It's more like a curve line.

So would say, "No, the data shows a clear curve, so there is likely to be a better model for this than correlation." Alternatively, we could say, "There appears to be an association or relationship between the data, but it's not linear." Does this graph show a correlation? Here, the points seem to be all over the place.

They don't seem to have much of a shape to them.

They're not sloping upwards or downwards, and neither a straight line over a curve line.

So would say, "No, the data does not have a correlation." "There does not appear to be any kind of relationship between the variables on this scatter graph." Right, let's check what we've learned there.

Which two scatter graphs should not be described using correlation? Choose two out of A, B and C.

Pause the video, have a go and press play when you're ready for an answer.

A and B, should not be described using correlation.

B, because the points are all over the place and don't really have much of a direction or pattern to them.

And A, there is an association, but it's not linear, so correlation is not our best description for that.

Which scatter graph shows no association whatsoever? Pause the video, choose A, B, or C and press play when you're ready for an answer.

Our answer is B.

The points are all over the place, they're not really following any kind of pattern with any kind of direction, so they show no association.

Okay, on to now for task B.

This task contains just one question.

Here, we have four scatter and part A, B, C, and D.

Each ask you to choose one scatter graph as your answer.

Pause the video, have a go and press play when you're ready for answers.

Let's go through some answers.

Which scatter graph shows a positive correlation? That will be scatter graph two.

It's an upward sloping straight line.

Which scatter graph shows a negative correlation? That'd be graph three.

It's a downward sloping straight line.

Which graph shows an association which is not linear? That'll be graph four.

The points are following a pattern, but it's a curved line.

And which graph shows no association? That would be graph one.

Great work today.

Here's a summary of what we've learned in this lesson.

The distribution of points on a scatter graph may suggest a positive correlation between the variables, and would see that if the points are forming an upward sloping straight line.

Or the distribution of points may suggest a negative correlation between the variables, and would see that if the points are forming a downward sloping straight line.

Alternatively, the distribution points may suggest that there is no correlation between the variables.

If there is correlation, it does not necessarily suggest that one variable is affecting the other.

It may just mean that those two variables are associated without one having a direct effect on the other.

Well done today.