Lesson video

In progress...

Hello, my name is Dr. Rowlandson, and I'll be helping you with your learning during today's lesson.

Let's get started.

Welcome to today's lesson from the unit of numerical summaries of data.

This lesson is called statistical problems with drawing conclusions, and by end of today's lesson we'll be able to effectively summarise and communicate conclusions to a statistical investigation.

Here are some keywords that you may be familiar with and we'll be using again during today's lesson.

This lesson contains two learn cycles.

In the first learn cycle, we'll be focusing on how to go about drawing conclusions from a statistical investigation.

And a second learn cycle will be acknowledging the limitations and the problems that we may face with statistical conclusions.

But let's start with learn cycle one, just drawing conclusions from a statistical investigation.

Here is a data handling cycle.

The steps that someone may take when they are conducting a data investigation.

Let's go through these steps now one at a time.

A data investigation needs to start with some sort of question or problem in mind that the researcher wants to solve.

The whole reason for doing the investigation in the first place.

Following this, they then will want to collect some data.

Once they know what the question is they wanna focus on, it becomes much easier to know what data to collect.

And while we're collecting that data, we should be careful not to do anything that might cause unnecessary bias within the data.

Once we've collected the data, we then need to organise it and present it in some sort of format that makes it easy to analyse and interpret.

For example, it might get put into a table or a graph of some kind.

Following this, we'll then analyse the data by using whichever methods are appropriate for addressing that particular problem.

It might be by calculating averages, or it might be by calculating the range, or plotting some sort of graph, whatever it is that addresses the question we have in mind.

After this, after we've created our statistical summaries, or graphs, or wherever it is we need, we can then go about drawing conclusions from our data and also present the data in a way that illustrates our conclusions to other people.

And finally, we have outcomes.

Our data investigation may lead to some decisions or actions being taken that address the problem that prompted the whole investigation in the first place.

And it's these last two steps that we'll be focusing on during today's lesson.

Let's start with an example in the context of a restaurant, and maybe imagine that we are a restaurant manager about to make some decisions.

A restaurant can make money by serving lots of customers while also keeping their outgoing expenses low.

They also need to ensure that while they're doing this, they still provide a good service because if they don't provide a good service, customers won't come back and then they won't make money.

A restaurant manager needs to plan the number of waiters to work each evening of the week.

So let's say we understand the context of the manager's problem here.

What problem could be caused by not scheduling enough waiters on a particular evening?

And what problem could be caused by scheduling too many waiters on a particular evening?

Pause the video, have a think about these questions, and press play when you're ready to continue.

So problems that could be caused by not scheduling enough waiters could be that they don't serve as many customers and therefore they make less money.

Also, if they don't have enough customers, they might end up providing a bad service to the customers.

The customers might be waiting for a long time for their food, for example, and then they don't come back, which means in the future, the restaurant makes less money.

So you might think, "Well, let's just put on loads of waiters every single evening, more than we need, and that way we can get customers through the door and out the door quickly and they all get a good service.

Why not do that?

" Well, the problem with that could be that the restaurants then spends money on wages, and if they're spending more money than they need to on wages, then they don't make money that way as well.

They start to lose money on outgoing expenses.

So to address this problem, the manager collects some data.

They collect data about the number of customers they serve each day for a month, and that's represented in the table here.

The restaurant manager then plans to calculate some averages, but we've got a couple of options here.

Option one would be to calculate the average number of customers for each week.

The average number of customers in week one, the average number of customers in week two.

Option two, we could calculate the average number of customers for each day of the week.

On average, how many customers do they get on a Thursday?

On average, how many do they get on a Friday?

Which of those options will be the most helpful for the manager for deciding how many staff to schedule on each evening?

Pause the video, have a think, and press play when you're ready to continue.

To address the manager's problem here, thinking about how many staff to put on in the future on each day of the week, probably option two will be most helpful because it might be that they get more customers on one day a week than the other, so they need to have more staff on one day a week than the other.

So the mean number of customers is calculated for each day of the week, and we can see that on the bottom row of the table there.

So this shows the manager how many customers they might might expect to serve on each evening of the week.

So based on that, what conclusions could be drawn from these averages?

And how could these conclusions help the manager decide how many waiters to schedule?

Pause the video, have a think about this, and press play when you're ready to continue.

One conclusion we might make now we can see the averages from this, is that Saturdays are the busiest day of the week for the restaurant and Thursdays are the least busy.

So what actions might this lead to?

Well, we could schedule more waiters on the busier days and fewer waiters on the less busy days.

So perhaps schedule more waiters on Saturdays and save a bit of money by having fewer waiters on a Thursday.

So statistical summaries can be a useful way to make sense of a large amount of data.

This can be helpful for drawing conclusions and deciding what actions may be needed.

Graphical representations of data can also be useful tools for drawing conclusions.

They can help you to investigate findings in more depth by allowing you to make comparisons between different aspects of a dataset.

For example, a town council is investigating ways to manage the levels of greenhouse gas emissions in their town.

They collect data about the volume of emissions from different sectors of the town.

We've got a bar chart here.

We can see we've got industry, commercial, domestic, being like home, transport, and agriculture.

There's a bar for each of those things.

And the vertical axis shows the amount of greenhouse gas emissions from each of them.

So what conclusions could we draw from this bar chart?

Pause the video, have a think, and press play when you're ready to continue.

Well, one thing we could conclude is that transport causes the most emissions in this particular town here.

We can also conclude that domestic causes the second most amount of emissions.

So this gives the council a good indication of what they might wanna focus on in this problem here.

So the council chooses to examine the emissions from transport further.

They plot a scatter graph.

The scatter graph here shows data from the ONS, that's the Office of National Statistics, about the population sizes and transport emissions for regions around the UK in 2019.

Along the horizontal axis, we've got the population in thousands.

So where we can see 100 on the axis, it means 100,000.

And up the vertical axis, we've got the transport greenhouse gas emissions, and in the brackets is the units for measuring greenhouse gas emissions.

What correlation does this scatter graph show then about population and transport greenhouse gas emissions?

And what does that tell us about the relationship between population and transport greenhouse gas emissions?

Pause the video, have a think, and press play when you're ready to continue.

This scatter graph shows us positive correlation, and that means the greater the population of a town or region, the more transport greenhouse gas emissions they tend to emit.

So let's compare this town now to the data from the entire UK.

This town has a population of 101,000 people, and their greenhouse gas emissions from transport were 307 of those units.

What transport emissions might you expect from a town with this population of 101,000 people?

Look at the scatter graph to get a sense of where 101,000 people is and the sorts of emissions that towns with a similar population emit to that.

And our second question we might wanna think about is, is this town's transport emissions unusually high for its population?

Pause the video, have a think about these questions, and press play when you're ready to continue.

If we look at 101 on the horizontal axis, roughly where it's just after 100, and then look at the points that are already there and how much greenhouse gas emissions they tend to emit, it's around about here.

That's between 50 and 450 of those units.

So while for this town transport was the greatest contributor to greenhouse gases, what we can see is their greenhouse gas emissions for transport is not unusually high for the amount of people who live in that town.

It kind of fits with the rest of the country.

So the council then look at ONS data to examine domestic greenhouse gas emissions 'cause that was the second highest emitter for this town.

This town's domestic emissions were 281 units.

So based on this scatter graph, what domestic emissions would you expect for a town with a population of 101,000?

And once you've found that, decide is this town's domestic emissions unusually high for its population?

Pause the video, have a think, and press play when you're ready to continue.

If we do the same thing again, and look at 101,000 on the population and look at the points that are above that, we can see for all the other towns in the UK and regions, the domestic emissions is somewhere between 120 and 180 units.

So then when we plot this town's data on this scatter graph, we can see it's up here, which is way away above all the other towns with a similar population to this one.

So, yes, this town's domestic emissions is unusually high for its population.

Let's now summarise what we've seen in this particular scenario.

First, we saw that transport emissions were the town's greatest cause of greenhouse gases.

However, when we compared this data to the rest of the UK, what we saw was that the amount of greenhouse gas emissions from transport was not unusual for a town of this particular population size.

So then we looked at the second highest emitter, which was domestic emissions.

And when we looked at that one on the scatter graph, we saw that, yes, the domestic emissions for this town is unusually high for a town of this population size.

So what actions might the council take following this investigation?

Pause the video and have a think about this, then press play when you're ready to continue.

Well, there's a few things I might consider here.

The council might try to maintain or reduce transport emissions because while it's not unusually high for a town of this population, transport emissions are still the greatest cause of greenhouse gases in this town, and they probably wouldn't want the amount of greenhouse gases from transport to get any higher.

So by trying to maintain that, or taking steps to reduce it, might be a helpful thing for the council to focus on.

But in particular, they might wanna focus on reducing the domestic greenhouse gas emissions because they are unusually high for a town of this size there.

Let's check what we've learned so far.

The table shows the number of customers per evening at a restaurant.

Based on this table, which week was the busiest?

Is it a, week one, b, week two, or c, week three?

Pause video, make a choice, and press play when you're ready for an answer.

The answer is a, week one was the busiest, and we can see that because the total for week one was greater than the total for the other two weeks.

How many customers should the restaurant expect on a Sunday?

Your options are a, 132, b, 142, c, 143, and d, 151.

Pause the video, make a choice, and press play when you're ready for an answer.

The answer is b, 142.

And we can see that because the mean number of customers that they serve on a Sunday is 142.

That represents a typical week or the central tendency of data on a Sunday.

In which scatter graph is the point that's marked with a across inside the usual range of data?

Is it a, b, or c?

Pause the video, make a choice, and press play when you're ready for an answer.

The answer is b.

Okay, it's over to you now for Task A.

This task contains two questions with each question presenting you with a scenario and some data to consider with that scenario.

In question one, a bus company has 98,000 passengers per day and the table shows the mean number of passengers they have at each time of the day.

And then you've got two questions to consider with this table.

Pause the video, have a go at these two questions, and then press play when you're ready for question two.

And here is question two.

A mobile phone provider manages the number of cell phones per town.

Now, the table shows us the population and a number of cell phone towers for three towns, town A, town B, and town C.

And the scatter graph meanwhile shows us data from the Office of National Statistics about other regions across the UK, and each point on that scatter graph represents a region.

You've got the horizontal axis showing your population and the vertical axis showing you the number of cell phone towers.

Then you've got two questions to consider based on this data.

Pause the video, have a go at these, and press play when you're ready for some answers.

Well done with that.

Let's now work through question one.

The bus company wants to encourage more people to travel at unpopular times of the day by offering cheaper tickets.

Which time period should tickets be cheaper?

That would be between 6:00 PM and 10:00 PM.

And we can see that because the mean number of passengers at this time is at its lowest.

The bus company wants to improve profits by increasing ticket prices at the most popular times of the day.

For which time period should prices be increased?

That'll be between 6:00 in the morning and 10:00 in the morning.

And we can see that from the data because that is when it has the greatest mean number of passengers during that period of time.

And question two.

In which town should they instal more cell phone towers and explain why?

Well, town B would be a good place to instal more cell phone towers.

And that's because 4,000 cell phone towers is less than what most other regions have with a similar population.

In which town can they afford to switch off some cell phone towers based on the population and if they could?

That'd be town C.

And the reason for that is there are 6,000 cell phone towers in town C, and when we compare that to other towns with a similar population size, that's much higher than other towns with a similar population size.

Well done so far.

We've been focusing on how to draw conclusions from a statistical investigation.

Now let's acknowledge the limitations and the problems that we may face with our conclusions.

Here we have a scenario where Lucas has surveyed a sample of 10-year-old children about which genres of books they like to read, and so far he's presented his data in a tally chart.

Sofia is organising a book collection for an elderly care home.

And Sophia looks at Lucas's data and thinks, "Lucas' data shows that fantasy and comedy books were the most popular, therefore I'll try and find lots of those for the collection.

" What could be wrong with Sofia's conclusion?

Pause the video, have a think, and press play when you're ready to continue.

It's great that Sofia is trying to use data to inform her decision.

She's not just choosing books at random.

She's trying to base her decision on what people might like, but elderly people might not like the same genres of books as children.

So to take the data that Lucas has collected and try and apply it to her scenario might not be the most helpful thing.

So when we make conclusions from a data investigation, we should consider the limitations to those findings.

Care should be given not to over generalise the results from a statistical investigation.

Just because there's a conclusion found in one context doesn't mean that conclusion will apply to all contexts.

For example, Izzy says, " I surveyed 20 people and found that they spend on average one hour and 32 minutes watching TV per day.

" Then Sam says, "Wow, this must mean that everyone in the world watches TV for one hour and 32 minutes every day.

" Sam's conclusion may not be valid because the time watching TV may vary according to age.

It may vary according to country or other factors.

Over generalising can be caused by extrapolating findings beyond the range of a dataset.

For example, if we have a scatter graph here which has a horizontal axis showing us the total sunshine duration in hours, that is for how many hours per month was a sunshine visible.

And the vertical axis shows us the mean daily maximum temperature in degrees Celsius.

And that's calculated by finding the maximum temperature each day of the month and find the mean of it.

And each of these dots on the scatter graph represent a month in this particular region.

We can use the scatter graph to make predictions about one variable based on the other, but if we do that beyond the range of the data that's already been collected, that is extrapolating and is not as reliable.

For example, Andeep says, "If next June has 450 hours of sunshine, then we should expect the maximum temperature to be somewhere between 40 and 50 degrees Celsius.

" Andeep's conclusion may not be valid because he has extrapolated beyond the data range.

We've never seen data like that before.

We've never seen a month with 450 hours of sunshine on this record, so we don't necessarily know how that amount of sunshine will affect the daily temperatures.

We should also consider anything that might have biassed the investigation before we go about drawing any conclusions from it.

For example, Jacob says, "I went to a gym and surveyed a sample of adults about how much time they spend exercising per week.

The mean amount of time was four hours and 28 minutes.

" And Laura says, "This must mean that every adult exercises for around four hours and 28 minutes per week.

" Can we see a problem with this?

Laura's conclusion may not be valid because people at the gym may excise more than the general population.

Conclusions that are made about data from a sample may only be valid for the sample of data itself or a population that is fully represented by that sample.

For example, Aisha says, "I surveyed 20 people age 11 years old about how much time they spend doing homework.

The mean was five hours per week.

" Alex says, "People in the school spend around five hours per week doing homework.

" Whereas Jun says, "11-year-olds in this school spend around five hours a week doing homework.

" Who do you reckon has the most valid conclusion here, Alex or Jun?

Pause the video, have a think, and press play when you're ready to continue.

The most valid conclusion here would probably be Jun's.

That's because Alex has taken Aisha's results and extrapolated it to the full school.

Alternatively, Aisha says, "I surveyed four people from each year group in my school about how much time they spend doing homework.

The mean was 6.

2 hours per week.

" Alex's conclusion was, "People in this school spend around 6.

2 hours a week doing homework.

" And Jun's conclusion was, "School children in the UK spend around 6.

2 hours a week doing homework.

" Who has the most valid conclusion this time, Alex or Jun?

Pause the video, have a think, and press play when you're ready to continue.

This time, Alex has the more valid conclusion, and that's because Jun has extrapolated Aisha's findings beyond the school and applied it to all school children around the UK.

Let's check what we've learned there.

True or false?

Conclusions made from an investigation can always be applied to the whole population.

Choose true or false and one of the justifications.

Pause video, make your choices, and press play when you're ready for an answer.

The answer is false because the conclusion is only valid for the sample or a population that is fully represented by the sample.

Let's now explore some limitations and problems by looking at a scenario in depth.

A chain of five ice cream parlours each sell the same flavour of ice cream.

The manager is looking to improve sales by replacing a flavour.

The pie chart shows the proportion of sales for each flavour in one of these parlours.

We can see we've got strawberry, vanilla, mint, and chocolate represented in this pie chart.

So based on this, which flavour is the most popular and which flavour is the least popular?

Pause the video, have a think, and press play when you're ready to continue.

We can see that the most popular flavour is vanilla.

It has the biggest sector in this pie chart.

And the least popular flavour in this parlour is strawberry 'cause it has the smallest sector in this pie chart.

So the manager investigates this further by collecting data from all five of their ice cream parlours.

The stacked bar chart shows the percentage of sales for each flavour at each of the parlours.

Based on this bar chart, which flavour tends to be the most popular across the chain, and which flavour tends to be the least popular across the chain?

Pause the video while you think about these and press play when you're ready to continue.

We can see that the flavour which is most popular across the chain tends to be vanilla.

That's because in most of the bars, vanilla has the biggest section of the bar.

We can see there's a lot of blue in each of those bars.

And which flavour tends to be the least popular across the chain?

Well, that would be the strawberry flavour.

So after recognising that strawberry flavour is unpopular in quite a few of the parlours, the manager collects the percentages of sales that are for strawberry in each of the parlours.

What is the mean percentage of sales that are for strawberry flavour across the chain?

Pause the video while you calculate this mean and press play when you're ready to continue.

The mean is 14.

42%.

You may have rounded it differently.

So what conclusions might the manager then draw from this investigation bearing in mind that he's looking to replace one of the ice cream flavours to improve sales?

Pause the video, have a think, and press play when you're ready to continue.

The manager might draw the conclusion that the strawberry flavour is not a very popular flavour across the chain.

So based on the conclusions of the investigation, the manager then decides to replace strawberry with a different flavour across all five of the ice cream parlours.

If we look at a stacked bar chart again, what problem could there be with this outcome?

Pause the video while you think about this and press play when you're ready to continue.

Well, one problem is that while strawberry flavour is unpopular in four of those parlours, strawberry flavour is the most popular in one of those parlours, in parlour E.

You can actually see that the strawberry flavour takes up over half the sales in that particular parlour.

So if strawberry flavour was taken out of all of those ice cream parlours, parlour E will be losing the most popular flavour.

So what could an alternative outcome to this investigation be?

What could a manager decide differently?

Pause the video, have a think about this, and press play when you're ready to continue.

An alternative outcome could be to replace the strawberry flavour in all the parlours except for the one where it's most popular.

Keep it there perhaps.

So when we make conclusions from data investigations, care should be taken not to over generalise the results.

Using statistical summaries, such as averages, can provide an impression of what the data is showing and also direct attention towards things that could be important.

However, when we create statistical summaries, such as averages, some details can get overlooked if we just solely rely on those statistical summaries.

That's why using visual representations can give a broader impression of how the data looks overall.

And then conclusions from those statistical investigations can benefit from considering statistical summaries, and graphs, and the context of the data.

Let's check what we learned there.

We've got a stacked bar chart for four ice cream parlours and there are four ice cream flavours being sold at each parlour.

Which flavour is the most popular for the majority of these parlours?

Is it a, caramel, b, pistachio c, raspberry, or d, vanilla?

Pause the video, make a choice, and press play when you're ready to continue.

The answer is d, vanilla.

We can see that the blue section of the bars tends to be the biggest for most of those parlours, so that's the vanilla flavour.

Which flavour is the least popular for the majority of these ice cream parlours?

Same options again.

Pause the video, make a choice, and press play when you're ready for an answer.

The answer is a, caramel tends to be the least popular for the majority of these parlours.

We can see that 'cause there's only a small section for caramel in the majority of these bars.

Which parlour would make the biggest loss if they stopped selling the caramel flavour assuming those customers then didn't buy other flavours instead?

Would it be parlour A, B, C, or D?

Pause the video, make a choice, and press play when you're ready for an answer.

The answer is parlour C would make the biggest loss because that's their most popular flavour.

Okay, it's over to you now for Task B.

This task contains two questions with each question providing you with a scenario and some data to think about.

In question one, the pie chart shows data taken from the ONS from a survey in 2021 about how people travel to work in the city of London.

And Alex says, "If we use the data from London as a sample then we can conclude that around 21% of people in the UK also travel to work by underground rail.

" Explain why Alex could be wrong.

Write a sentence or two to describe that.

Pause the video, have a go, and press play when you're ready for question two.

And here is question two.

The table shows data from the ONS about the number of property sales each year in Liverpool.

And then in parts A and B, you've got conclusion from Andeep and Laura.

I need you to explain why each of them could be wrong.

Again, write a sentence or two for each question.

Pause the video, have a go, and press play when you're ready for an answer.

Great work with that.

Let's go through some answers now.

Explain why Alex could be wrong.

He said, "If we use this data from London as a sample, then we may conclude that around 21% of people in the UK also travel to work by underground rail.

" Well, this sample only uses data from a single location, the city of London, and modes of transport may differ depending on where people live and whether they live in a town or a city.

And we need to bear in mind that not all towns and cities have an underground rail network.

And in question two, so Andeep concludes that most cities in the UK must sell around 5,837 properties per year because that's the mean number of properties sold in this data.

Explain why he could be wrong.

Well, cities vary in size.

There are some big cities and there are some quite little cities.

Bigger cities may sell more properties each year, and small cities might sell fewer properties each year.

Laura concludes that Liverpool will always sell around 5,837 properties per year 'cause that's the mean for the last five years for Liverpool.

Explain why Laura could be wrong.

The number of property sales might change in the future.

Yes, that's been the mean for the last five years, but if the population of Liverpool increases, then they might sell more houses because they might build more houses for the population, or if it decreases, then it might have the opposite effect.

Also, you might spot here that the year 2020 had an unusually low number of sales.

So there might be contextual reasons what might affect the number of sales that happened one year compared to other years.

So we can't guarantee that Liverpool always sell around the similar number of properties every year 'cause it hasn't happened in the past either.

Fantastic work today.

Let's summarise what we've learned in this lesson.

The general theme of the lesson has been about drawing conclusions from a data investigation.

And conclusions can be drawn from statistical summaries and also from graphical representations made during an investigation.

However, we've also been considering the problems and limitations we might find in our conclusions.

Therefore, care should be taken not to over generalise the results from a statistical investigation, and conclusions should be understood to be limited to any biases.

Conclusions may also only be valid for the sample of data or a population that is fully represented by that sample.

And context may also be an important factor in determining the outcome of a statistical investigation.

I've finished the video