video

Lesson video

In progress...

Loading...

Hello, my name is Dr.

Rowlandson and I'll be guiding you through today's lesson.

Let's get started.

Welcome to today's lesson from the unit of numerical summaries of data.

This lesson is called Statistical Problems with Statistical Summaries, and by end of today's lesson, we'll be able to choose appropriate statistical measures to explore a statistical problem.

Here are some keywords that you may already be familiar with and I'll be using again during today's lesson.

This lesson contains two learn cycles and the first learn cycle we'll be focusing on problems with statistical measures based on the data, and the second learn cycle, we focus on choices, choosing statistical measures based on the context of the data.

But let's start now with problems with statistical measures based on the data.

Here is a data handling cycle, the steps that someone may take when they are completing a data investigation.

At some point during that data handle investigation, after you've decided what it is you wanna investigate, you've collected your data and you've organised it, the next step is to decide how precisely you want to analyse that data.

And this is what today's lesson is going to focus on.

Now during maths lessons, we sometimes use small datasets while we are learning a new statistical process of some kind.

That way we can practise that process on a small dataset before trying to apply it to a much larger dataset.

In data investigations, we may find ourselves working with large datasets, and the problem with large datasets is that it can be very difficult to get a sense of what the data is telling us.

This is because there's just so much information to try and look at it all just with yourself and your eyes and you try and get some sort of meaning out of it can be quite difficult.

For example, here we have some data taken from the Met Office.

It shows the total rainfall per month in millimetres for two regions in the UK, Cardiff and Stornoway.

The data is only taken from two years, 2021 and 2022, which is why there are 24 numbers for each region.

Look at these numbers, because it's a big set of numbers for each region, it's quite difficult to get any kind of meaning or comparison out of these things just by looking at the numbers.

For example, it's not obvious which region gets more rainfall.

We could probably figure it out by looking really carefully at all, but it's not obvious which one gets more rainfall and it's not obvious in which region the amount of rainfall is more varied, for example.

And this is only data from two years.

If we had used data from 1942, which is where the dataset originally starts from, all the way up to 2022, there would be 960 numbers in each dataset.

So looking at those numbers in their entirety is quite difficult for us to do.

This is where statistical summaries can come in handy.

Statistical summaries can be used to represent an entire dataset with just a small amount of information.

Statistical summaries make it easier to make comparisons or draw conclusions from that data because you're just focusing on a small amount of information, rather than trying to process the entire dataset in one go, for example, rather than showing all 960 data points for Cardiff and Stornoway, I could present this summary.

The mean rainfall per month in Cardiff is 91.

3 millimetres, whereas the mean rainfall in Stornoway is 105.

7 millimetres.

With just having one number for each region, we can now make a bit more of a comparison between the data of these sets.

It is now much easier to see that Stornoway got more rainfall on average than Cardiff did during these years.

Statistical summaries may include averages and measures of spread for datasets.

An average is a single number that expresses the central or typical value of that set of data.

And averages may include the mean, which is calculated by adding up all the data points and then dividing by the number of data points.

Averages include the median, which is the most central or middle piece of data, or an average might be the mode, which is the data value which has the highest frequency in a set.

So those are averages.

And measures of spread on the other hand, spread of a dataset can be measured using the range, which is done by subtracting the lowest value from the highest value.

It tells you how far apart those extreme ends of a dataset are.

For example, the statistical summaries below are based on weather data from the Met Office for two regions in the UK.

The data here is about maximum daily temperatures from 1957 to 2022, for these two places here.

We have the median maximum daily temperature and the range in the maximum daily temperatures.

Looking at this data summary of these two regions, which town tends to be warmer and in which town are the temperatures more consistent? Pause the video, have a think, and press play when you're ready to continue.

Based on these statistical summaries, if we look at the median, which is an average, we can see that Heathrow tends to be warmer than Armagh.

That doesn't necessarily mean that every single day in Heathrow is always warmer than every single day in Armagh, it might be warmer one way or the other, but, generally overall, it tends to be warmer in Heathrow because the median is higher.

And in which town are the temperatures more consistent? Well, if the data has a big range, it means there's a great difference between the warmest maximum daily temperature and the coldest maximum daily temperature.

Whereas if it has a small range, it means that those two extremes of the data point, dataset are closer together and more consistent.

So we're looking for one with the smallest range, and that means in Armagh, the temperatures tend to be more consistent, more closer together than Heathrow.

Now, not all types of averages are appropriate to use in every dataset.

There are many factors that may affect which type of average we want to use.

For example, the data below is from a survey of 10 people about how they travel to work.

And the data that we've collected are things like car, business, walk, and so on.

The mean and the median can only be used to represent numerical data.

So these are not appropriate to summarise the data in this set.

For example, you can't add up a car, bus, walk, and so on, and then divide by 10 to get some kind of number like you do for the mean.

So the mean and median are not necessarily appropriate here, whereas the mode would be an appropriate type of average, this data, 'cause the mode looks at which data value or data point has the highest frequency.

Let's check what we've learned.

True or false, the mean is always the best average to use? Choose true or false, and a justification.

Justification, A, the usefulness of an average can depend on the data that has been collected and justification, B, the mean performs a calculation using every data point in the set.

Pause the video, make your choices, and press play when you're ready for the answers, The answer is false.

The mean is not always the best average to use.

Sometimes it is, but not always, and that's because the usefulness of an average can depend on the data that has been collected.

Lucas collects data about the favourite genres of books from a sample of pupils in his school.

Which would be the most appropriate measure of average for Lucas's data? Your options are A, the mean; B, the median; or C, the mode? Pause video, make a choice, and press play when you're ready for an answer.

The answer is C.

The mode would be the most appropriate measure of average for Lucas's data because what we've got here is categories of data, comedy, horror, fantasy, and so on.

So the mode tells us which one is the most frequent out of those, which in this case would be fantasy.

We've seen so far how the usefulness of an average may depend on the type of data that is being used, but it may also depend on the shape of the data as well.

For example, this dot plot represents the scores of a class on a test of some sort, and we can see that everyone has scored between 40 and 49 marks in this test.

There are two people who scored 41, two people scored 42, and so on, but there's three people who scored 40.

Now, we should remember that an average aims to give us a snapshot of where the set of data is.

Averages represent typicality or the central tendency of a data.

So if you told someone the average, they'll just get a sense of, ah, okay, most people in their class scored this or around this.

Now, the median and the mean do this pretty well.

The median's 44, the mean's at 44.

3.

So if you told someone the average for this class was 44, then they would get a sense of people scored around 44 in this test.

But the mode in this case gives us a slightly different impression because the mode here being 40 is actually the lowest score that someone got.

So it doesn't really represent the class in an accurate way.

It gives a bit of a false impression that the class did worse than they probably did.

So the mode does not give an accurate representation of this particular class.

In this dataset, which also represents test scores for another class, we can see here that the vast majority of pupils scored either 40, 41, 42, or 43 marks.

There are a couple of people who scored really highly.

To try and get a sense of how well this class scored, if you wanted to summarise this class to someone else who couldn't see all this data, then the mode and median give a good sense of where the vast majority of pupils were in this test.

The mode being 41, the median being 42, and most people get in between 40 and 43.

So the mode and median are pretty representative averages here, but the mean in this case is higher than what the vast majority of students scored.

There's only two students who scored higher than what the mean is.

So the mean does not give an accurate representation for this particular class.

In this case, the reason why the mean is so high compared to the other data is because of those two extremely high values dragging the mean up.

In this example, we can see here that the mode, the median, and the mean are all the same.

They all show us where the middle of the data is, where the centre of it is, which also is the most frequently scored scored on that test as well.

So all the averages give a similar representation of the class.

Let's check what we've learned there.

Which measure of average will be different to the others within this set of numbers? The options are A, mean; B, median; and C, mode.

You might get a sense of it just by looking at the numbers or you're not sure, you could work out the mean median mode and see which one's different.

Pause the video, have a go at this question, and press play when you're ready for an answer.

The answer is C, the mode.

The reason why the mode is different to the others is the median tell you where the middle or centre of the data is, but the mode tells us the most frequent current data and that just happens to be the lowest value in this set of numbers here.

So the mode would be drastically different to the mean and the median, Which measure of average is least representative for the vast majority of data in this dataset? Your options are A, mean; B, median; and C, mode.

Pause video, make a choice, and press play when you're ready for an answer, The answer is A, the mean is the least representative for the vast majority of data here because we can see the vast majority of data is somewhere between 19 and 22, but the mean is lower than that.

The mean is 18.

9, so it doesn't represent the vast majority of this data as well as the others.

Okay, over to you now for Task A.

This task contains two questions which are both visible on this slide.

In Question 1, we've got four datasets and all four of these datasets have a mean of 7.

You need to decide in which datasets do you think that this average of 7 provides a good summary of the information in the set.

Choose A, B, C, and D and give reasons for your answer.

And in Question 2, you need to find a mean, median, and mode for that dataset, and then decide which average out of those is the least representative of the central tendency of this data and explain why as well.

Pause video, have a go, and press play when you're ready for some answers.

Well done with that.

Let's work through Question 1 together now.

So we've got our four datasets, and we decide for which datasets do we think an average of 7 provides a good summary of the information.

Well, let's look at dataset A.

All those numbers are 1 except for one of them, which is 43.

So an average of 7 probably doesn't represent dataset A very well.

43 is much bigger than all of those other numbers, which are all equal to 1.

In dataset B, all those numbers are 7, so to say that the average is 7 sums that dataset up pretty well.

So yes, all the numbers are 7.

In C, we've got lots of decimal numbers from 1.

5 to 14.

4.

If we say that the average is 7, well, yeah, could be 7 is around about the centre of that data and many of the numbers in that list are actually quite close to 7, so to say that the average is 7 sums that dataset up pretty well.

And D, we can see we've got numbers from 4 to 10.

Seven is in the centre of that data and each number that is below 7, has a matching pair, which is the same amount above 7, so yes, to say the average is 7 summarises that data pretty well as well.

And Question 2, we need to find the mean, median, and mode for this set of data.

Our mean is 6, our median is 4, and our mode is 9.

And which average is least representative? Well, the mode is the least representative because it happens to be equal to the highest value in this dataset so it doesn't really represent the whole dataset very well.

Well done so far.

Let's now move on to Learn Cycle 2, which is choosing statistical measures based on their context.

The choice of average we use in a statistical summary can depend on the context in which it's being used.

For example, the type of data used can cause some measures of average to make less sense than others in that particular context, particularly if the data's non-numerical, the mean doesn't really make a lot of sense.

The shape of the data as well can also cause some measures of average to be less representative than others for that particular data too.

And sometimes the measure of average that someone chooses to use, may depend on what message they're trying to communicate to their audience.

For example, a shoe shop sells pairs of shoes in lots of different sizes.

The manager is planning how many pairs of each size to stock.

They don't wanna stock loads of every single type of shoe because it's gonna him cost them a lot of money to stock all that and there's a lot of storage space.

So the manager's trying to be savvy about which sizes to have more of than others in their stock.

They intend to use averages to inform their decision.

So the manager collects some data about shoe sizes sold.

The data below shows the sizes for the last 19 pairs of shoes that were sold, and they're very handily put in order for us.

Let's find the mean, median, and mode of this data.

Pause the video, have a little think, and then press play when you're ready to continue.

So here are our averages.

We've got the mean of 7.

1, a median of 7, and a mode of 5 and 9.

Let's think about which measures of average do you think will be most helpful to this manager in this context.

Pause the video while you think about this and press play when you're ready to continue.

We could probably make more sense of this question by looking at a dot plot of the data.

Here we can see the last 19 shoe sizes sold represented by dots above the numbers.

In this case, we can see that the median was actually one of the least popular shoe sizes.

So it would not be helpful for the manager to stock lots of shoes which are size 7 based on the median being 7, because not many people bought a size 7 shoe in this data collection.

The mean being 7.

1, suggests that the manager should probably stock shoes which are size 8 and 9, but again, these shoe sizes were the least popular.

So that average also isn't particularly helpful to the manager in this context.

The mode, on the other hand, tell the shop manager which is the most typical shoe sizes sold.

Now this would be appropriate measure of average for this particular context because the manager wants to stock up on the shoes that sell the most.

So not all three of these types averages were as useful as each other in this particular context.

The median and mean weren't particularly useful, but the mode was.

The dot plot highlights to us why using the mean and median could have resulted in making an inappropriate decision in this particular context of the problem.

Here's another context.

A company has 37 employees and the dot plot shows the salary of each employee rounded to the nearest thousand.

So for example, we can see the lowest two points are above the 21.

Those two employees earn 21,000 pounds rounded to the nearest thousand.

We can see that the mid, that most people earn between 21,000 pounds and 29,000 pounds.

There is someone who earns 37,000 pound and someone who earns 52,000 pound, presumably the manager or owner.

And we've got the mean, mode, and median all presented on there as well.

Now, based on this data, which measure of average could be used to give the impression that people are paid well in this company? Mean, median, or mode? Pause the video while you think about this and press play when you're ready to continue.

The type of average that gives the impression that people are paid well in this company or more well than the other averages, is the mean.

You can see that the mean, 25.

1, is higher than the median and the mode of 24.

The mean is also higher than what the vast majority of people in this company earn.

There are not that many people who earn more than 25,000 pounds.

So the mean gives a quite a favourable impression of the salaries of these employees.

The reason why that mean is dragged up so high compared to the other averages, is because there are two people who are paid much more highly than the rest of the company.

So which measure or measures of average could be used to give the impression that people are paid less well? Pause the video while you have a think and press play when you're ready to continue.

The median or mode give the impression that people are paid less well than the mean does because they are a bit lower.

So this is an example of where the choice of average that someone may make, may depend on the message that they're trying to portray to people.

For example, if we want to give the impression that the employees of this company are paid very well, then we may use the mean 'cause it's the higher of the three averages, whereas if you want to give the impression that the employees are not paid so well, we may choose the median or mode because it's a lower average than the others.

So let's check what we've learned there.

The data shows Sophia's last 21 scores on a video game.

Which average will be most useful to suggest that Sophia has scored well? A, the mean; B, the median; or C, the mode? Pause the video, make a choice, and press play when you're ready for an answer.

In this case, if Sophia wants to give the impression that she's scored very well in this video game, she could choose the mode as the average fact she reports to people.

Same scenario again, which average will be most useful to suggest that Sophia has not scored so well? A, the mean; B, the median; C, the mode.

Pause the video, make a choice, and press play when you're ready to continue.

In this case, if we wanted to give the impression that maybe Sophia hasn't scored so well, we'd probably choose the mean because that's the lowest of the three averages.

Okay, over to you now for Task B.

This task contains two questions.

And here's Question 1.

Here we have a table of data taken from the Met Office about weather in Cambridge during the year of 2004.

What the data shows is the number of days in each month where there was frost.

So we got the months going across the top, in the top row, and the bottom row tells you how many days there was frost in each of those months.

And underneath you've got four questions to consider.

Pause the video, have a go at these, and press play when you're ready for Question 2.

And here is Question 2.

We have a scenario where a homeowner is selling their house.

The data below here shows the prices for the last five houses that were sold on the same street and underneath you've got five questions to consider.

Pause the video, have a go at these, and press play when you're ready for some answers.

Great work.

Let's go through Question 1 together now.

So find the mean, giving your answer two decimal places, that would be 2.

75.

The median would be zero and the mode would be zero as well.

And then part D, which measure of average would be the least useful for describing the amount of frost that could be expected during a typical month in Cambridge? Well, out of those three averages, the mean would probably be the least useful for describing a typical month in Cambridge.

The mean was 2.

75, where as you can see here, the vast majority of months there are no days of frost.

So 2.

75 doesn't really give the right impression.

2.

75 would suggest that each month typically gets two to three days of frost, but most months don't have any frost.

So that average doesn't really represent a typical month in Cambridge very well.

And Question 2, part A.

We need to define the mean and the median house price.

Our mean is 284,000 and our median is 350,000.

So part B, outta these two averages, which one suggests that houses are worth more money here? That would be the median because that's 350,000 and which average suggests that the houses are worth less money? That's the mean, that's 284,000.

The seller would probably prefer to use the median because it gives the impression that the houses are worth more so that the seller would get more money for their house, whereas the buyer might prefer to refer to the mean because that suggests the house is worth less money, so it'd be cheaper for them to buy Marvellous work today.

Here's a summary of what we've learned in this lesson.

The overall theme of the lesson has been about how some statistical measures may be more appropriate than other statistical measures in a scenario based on a range of different factors.

That could be that the type of data that we're handling may mean that certain statistical measures are not appropriate.

It might be that the nature of the data could mean that a certain statistical measure is not appropriate, or it could be that certain statistical measures just don't make sense in the context of that data.

Great job today.