Lesson video

In progress...

Hello, my name is Dr. Rowlandson, and I'll be helping you with your learning during today's lesson.

Let's get started.

Welcome to today's lesson from the unit of numerical summaries of data.

This lesson is called "Statistical Problems with Data Collection.

" And by end of today's lesson, we'll be able to choose what data needs to be collected to explore statistical problems.

Here are some keywords that we'll be introducing during today's lesson.

Each of these words will be unpacked later.

This lesson contains two learn cycles.

The first learn cycle will focus on what data can be collected, and the second will focus on where that data can be collected from.

Let's start with choosing what data to collect.

Here is the data handling cycle; the steps we usually take when we are completing an investigation with data.

Usually, working with data starts with having some sort of question in mind that we want to investigate; that is planning statistical inquiry and asking a question of some kind.

The question that we are investigating affects what data we should collect and also how we go about collecting it as well.

And this is what we'll be focusing on in today's lesson.

There are different types of data that could be collected.

Some data are integers.

For example, ages tend to be integers.

At a young age, people may use half years, for example, to describe ages; but usually, after a certain point, all ages tend to be whole numbers.

Also, scores in a sports competition tend to be, in lots of cases, integers.

Some data can be decimals though.

For example, if I wanted to collect data about the distance I travel somewhere, then that doesn't have to be a whole number; that could be any kind of decimal, depending on how accurately I choose to measure it.

It could be 3.

7 kilometres, or 3.

71 kilometres, and so on.

Also, the weights of things, such as weights of animals, are examples of data that could be decimals.

Some data are frequencies.

Frequencies tend to be where we count something; for example, the number of people who travel by bus, or how many workplace accidents occur in a year.

So, there are different types of data, but also, data can be collected using lots of different methods and tools.

For example, we could collect data from people by asking them questions through a questionnaire.

Usually, when we use a questionnaire, we either give them multiple choices to choose from, or we expect quite short responses.

But questionnaires tend to be a way of getting information through a series of short questions.

Whereas, if we want to get more in-depth information out of people, we might use an interview instead to get longer, more detailed answers out of them.

We could use a tally to collect data; that's usually very good for counting things when you are in the middle of collecting some data of some kind.

We might use weighing scales to collect data about weights, or a stopwatch to collect data which involves time, or a ruler to collect data that is about length of some kind.

So, the methods we may use to collect data really depend on what data it is we want to collect.

For example, Andeep wants to know whether the modes of transport used to travel to school differ between younger and older pupils.

What data do you think would be most helpful for Andeep to investigate this?

Pause the video, have a think, and press play when you're ready to continue.

The data that will be helpful would probably be frequencies for different modes of transport; the frequency for people travelling by bus, by walking, and car, and so on.

So, if Andeep wants to collect frequencies, how could he collect that data?

What methods could he use?

Pause the video, have a think, and press play when you're ready to continue.

Andeep could ask pupils how they get to school; and while he's asking people, he could tally on some sort of list: bus, walk, car, and so on.

Or he could send a written questionnaire out to pupils in his school for them to complete and send back to him.

Here's another example.

Laura wants to know whether people who are fast runners are also good at lung jump.

So, what data will be most helpful for Laura to investigate this?

Pause the video, have a think, and press play when you're ready to continue.

The data that would most help with this will be race times, perhaps in seconds; and long jump distances, perhaps in metres.

By collecting that data, then Laura can see whether or not fast runners are also good at long jump.

So, if that's the data Laura wants to collect, how can she go about collecting it?

What methods might she use, or what tools might she use to collect this data about race times and long jump distances?

Pause the video, have a think, and press play when you're ready to continue.

Laura could time how quickly people finish a 100 metre race, for example, using a stopwatch; and then measure the long jump distances using a measuring tape, for example.

But there are other options.

So, let's check what we've learned so far.

Which tool will be most helpful to collect data about the length of leaves?

A: a stopwatch; B: a ruler; C: weighing scales; or D: a questionnaire.

Pause the video, make a choice, and press play when you're ready for an answer.

The answer in this case would be B: a ruler will be the most helpful tool out these ones for collecting data about lengths of leaves.

Which tool would be most helpful to collect data about how many people take each mode of transport to travel to work?

A: a stopwatch,; B: a ruler; C: a weighing scales; or D: a questionnaire.

Pause the video, make a choice, and press play when you're ready for an answer.

In this case, the most helpful tool out of these ones would be a questionnaire.

Send out a questionnaire, ask 'em to write down or choose which transport they use to travel to work, and give it back to you.

Jacob has designed a puzzle, and he wants to know the average time that people take to solve his puzzle.

Which method would be the most useful for Jacob to collect this data?

A: sending out a questionnaire to people about his puzzle; B: timing people doing his puzzle; or C: interviewing people about his puzzle.

Pause the video, make a choice, and press play when you're ready for an answer.

The answer to this one would be B: timing people doing his puzzle.

Using a questionnaire or interviews might be good for Jacob to get opinions about his puzzle, but not necessarily about how long it takes people to complete his puzzle.

Yes, you could ask people on the questionnaire or interview, how long did it take you to do this puzzle last time?

But they may not have timed themselves, or they may not be honest in their answers.

So, for Jacob to time people himself doing the puzzle, that might be the best way for him to collect this data.

Now, sometimes in a data investigation, we need to collect data for ourselves, but not always.

Sometimes the data that we want may have already been collected by somebody else.

So, in some cases, it can be quicker or easier to use data that is already available rather than collecting the data ourselves.

Sometimes the data might be collected by somebody else in a much more accurate way than I could possibly collect it.

Perhaps they've got some tools that I don't have for collecting that data.

So, we don't always need to collect data ourselves every time we want to do an investigation.

We saw this example earlier where Laura wants to know whether people who are fast runners are also good at long jump, and we talked about how she might collect this data herselves.

However, where might this data already be available?

Pause the video, have a think, and press play when you're ready to continue.

It might be that the PE teachers at Laura's school may already have records of this data from a previous sports day or previous PE lessons, for example.

Or if she's not particularly interested in children at her school, she may look at data from professional sports competitions, which may be available online.

Sometimes it can be impractical, or even impossible, to collect the data we want for ourselves.

And so, we have to use data that is collected by others.

For example, it would be really difficult for a school-aged pupil to collect data about the weather in the Antarctic, but there may be some scientists who have already collected that data, and they might share that data somewhere with the public.

There are many large data sets that can be found on the internet.

For example, gov.

uk has lots of data sets, and so does the Office of National Statistics, which is all publicly available.

Here's an example.

Sam wants to investigate whether there is a correlation between the number of houses in a city and the number of public parks there are.

So, they decide to collect the following data for 20 cities: the number of houses per city, and the number of parks per city.

Explain why Sam may not be able to collect this data themselves.

Pause the video, have a think, and press play when you're ready to continue.

It could be really difficult for Sam to count all the houses and all the parks in each city.

Sam could try, but it would be very difficult to do that and very, very time consuming as well.

So, what could Sam do instead?

Pause the video, have a think, and press play when you're ready to continue.

Well, one thing that Sam could do is see if that data is already available somewhere; for example, online, or at the councils for each city.

It may be that that data has already been collected and is available for Sam to use.

Right, let's check what we've learned there.

True or false: You always have to collect new data for a statistical investigation.

Choose true or false and a justification.

Justification A: people are only allowed to analyse data that they have collected themselves; and justification B: sometimes the data you want to analyse has already been collected and made available to you.

Pause the video, make your choices, and press play when you're ready for answers.

The answer is false, because sometimes the data you want to analyse has already been collected and made available to you.

Okay, over to you now for task A.

This task contains one question.

A supermarket manager collects data to investigate ways to save money, increase sales, and also improve the customer's experience.

Now, down the bottom on the left hand side, you've got the data that the manager wants to collect.

On the right hand side, you've got the methods or tools that the manager may use to collect that data, but it's all jumbled up, so you need to match the data with the most appropriate tool for collecting it.

You can draw lines from one to another to match them up.

Pause the video, have a go, and press play when you're ready for an answer.

Great work with that.

Let's go through some answers.

To collect data about customers' opinions about a product that is for sale, that could be done with a questionnaire.

To collect data about the number of customers visiting at each time of day, well, that could be done with a tally chart; for example, looking at the customers coming through the door and counting them as they come in, using a tally.

If we want to collect data about the weights of vegetables that are left unpurchased, the best tool for that would be weighing scales.

And if we wanna collect data about the time that customers spend queuing at the checkout, the best method for that would be to use something like a stopwatch.

You're doing great so far.

Let's now move on to learn cycle two, which is looking at choosing where to collect the data from.

When collecting data, we should consider what population the investigation is about.

In a data investigation, the population is the entire set of people, or creatures, or plants, or items that make up the whole group that is being studied.

For example, the population might be all the people in a particular region, or the population of a data investigation might be all the fish in a lake, or it could be all the trees on an island, for example, or it could be all the subscribers to an online product.

The population is not necessarily everyone in a particular place; it is defined by what the investigation intends to study.

For example, in an investigation about how much time school children spend doing their homework, the population is defined by the research question in particular.

If the research question is, "How much time do 16-year-old pupils in the UK spend doing homework per week," then the population won't be children of any age, it'll be all 16-year-old pupils in the UK, and that population is likely to be in the tens of thousands.

However, if the research question was, "How much time do pupils from my school spend doing homework per week," the population is defined by all the pupils in the school that's being studied, and that is likely to be in the thousands.

Whereas, if the research question was, "How much time do 16-year-old pupils from my school spend doing homework per week," here the population is defined as being pupils in school, and 16-year-olds.

So, the population would be all 16-year-olds in that school, and that's likely to be in the hundreds.

For example, Alex wants to investigate how much time pupils in his school spend doing homework per week.

The population of Alex's investigation includes all the pupils in his school.

Alex considers collecting data by asking every pupil in the school how much time they spend doing homework per week.

What problems could Alex have if he tried to do this?

Pause the video, have a think, and press play when you're ready to continue.

It could be very, very difficult for Alex to speak to everyone in a school; not impossible, but difficult.

It could also be very time consuming for Alex to speak to everyone in the school, but also it could be very time consuming to organise that data after it's been collected, and then analyse it.

So, sometimes it is impractical to collect data from the entire population.

In these cases, data might be collected from a smaller subset of the population instead, and this would be called a sample.

A sample is a subset of the population.

When we collect or use data from a sample, we make an assumption that the results would've been similar if we have used data from the entire population instead.

So, we're saying here that the data summaries for the sample are approximately similar to the data summaries for the population.

That's why a sample can help us in investigation.

We collect data for a small subset, analyse that, and then assume that the data would've been similar for the entire population.

So, how large should the sample be?

Well, there's no right or wrong answer to that, really, just lots of pros and cons.

The larger the sample, the more difficult it can be to collect that data.

And if the sample gets really, really large, it gets to the point where it's pretty much the population, then it kind of loses its benefit of being a sample.

However, if the sample is too small, then it might not then provide enough data to make any kind of reliable conclusion.

So, the size of your sample may really depend on what data is you're collecting and what methods you're using to collect it.

For example, if we are interviewing people for detailed opinions about something, where it's their opinions about a product or their experiences about visiting a particular place or something like that, well, interviewing can take a long time to do.

And usually, the information that we get, it can be quite detailed and worded, and that could take a long time to organise and then analyse.

So, if we are using interviewing as a method, then we might wanna use a small sample for that.

Whereas if we are using a questionnaire, such as an electronic questionnaire that asks people to rate something on the way out by pressing a button, well, that's really quick to collect that information.

And 'cause it's just a single question with a multiple choice answer, it's very quick then to organise it and also analyse it, so we can collect information from a lot of people there for that one, and use a much larger sample.

Let's look at this example again.

Alex wants to investigate how much time pupils in his school spend doing homework per week.

The population of Alex's investigation includes all the pupils in his school.

Alex decides to collect data from a sample of 20 pupils.

He chooses 20 pupils from the youngest year group for his sample.

What problems could there be with Alex's choice of sample when we think about what it is he's trying to investigate?

Pause the video, have a think, and press play when you're ready to continue.

One problem with Alex's sample is that all the pupils are in the same year group.

The reason why this is a problem, or could be a problem, is really down to what it is that Alex is aiming to investigate.

Alex wants to investigate how much time pupils, in general, in his school spend doing homework per week.

That's pupils in general, not pupils from the same year group, or a particular year group.

And it might be that older pupils get more homework than younger pupils.

So, if Alex collects data only from the youngest pupils at school, it might give him the impression that pupils spend less time doing homework across the school than they really do.

So, the sample does not accurately represent the entire population of his school; only represents a very narrow age group of his school.

So, can you propose a different way that Alex could choose his sample of pupils?

Pause the video, have a think, and press play when you're ready to continue.

If the issue with Alex's previous example was that they're all from the youngest year group and it might not necessarily represent all the ages, then Alex could choose perhaps four pupils from each year group.

Therefore, he's got a mixture of different ages from the start of the school right away to the end of the school.

The problem that we discussed there with Alex's data collection relates to a thing called bias.

Bias can be present in any sample collected from a population and may affect the results of a statistical investigation.

Bias can be caused by collecting data from a sample that is not representative of the population.

So, usually, when you collect data from a sample, you assume that the data summaries you get from that, for example averages, will be similar to the data summaries that you'd get from the entire population.

What bias can do is cause that not to be the case.

For example, if we collect data from the youngest year group, the average amount of time we spend doing homework in that sample might be lower than the average amount of time spent doing homework from the entire population.

So, that's an example of bias.

Let's check what we've learned so far with this.

Izzy wants to investigate which genres of books pupils in her school enjoy reading.

She chooses a class in the oldest year group and gives them a questionnaire about which books they read.

In Izzy's investigation, what is the population that she's investigating?

Is it A: all the pupils in the country; B: all the pupils in Izzy's school; C: all the pupils in the oldest year group of Izzy's school; or D: all the pupils in the class that were given the questionnaire.

Which of those describes the population in Izzy's investigation?

Pause the video, make a choice, and press play when you're ready for an answer.

The answer to this one is B: the population of Izzy's investigation is all the pupils in Izzy's school.

We can see that because the first sentence in this question says, "Izzy wants to investigate which genres of books pupils in her school enjoy reading.

" So, the population is pupils in her school.

So, in this situation, what is the sample in Izzy's investigation?

Same choices again.

Pause the video, have a go, and press play when you're ready for an answer.

The answer is D.

All the pupils in the class who were given the questionnaire, they are in the sample of Izzy's investigation.

Final question about this scenario is: Why might Izzy's results be biassed?

Pause the video, write down a sentence, and press play when you're ready for an answer.

Here are some reasons why Izzy's results may be biassed.

For starters, all the pupils in the sample are the same age.

So, why might this cause a problem?

Well, the genres of books that people enjoy reading may differ with age.

Older pupils may enjoy reading different sorts of books to younger pupils.

Therefore, the opinions of their sample may not represent the opinions of the entire population.

In other words, the opinions of that particular class in the oldest year group may not actually be similar to the opinions of the entire school with all the year groups.

Okay, over to you now for Task B.

This task contains two questions where each question is a scenario about data collection.

Here is question one: Jun wants to find out the average time people take to run a 100 metre race in his school.

He assists in a PE lesson for a class in the youngest year group by timing how long each pupil takes to complete the race, and he then calculates the mean time.

So, that's the scenario.

And there are four questions for you to consider based on that scenario.

Pause the video, write either a few words or a sentence or two about each of these questions, and then press play when you're ready for question two.

Okay, here's question two.

Laura wants to investigate how much time adults in UK spend exercising and per week.

She designs a questionnaire about people's exercise habits.

She goes to her local gym and asks people to complete her questionnaire.

Once again, you've got four questions to consider.

Answer each one either with a few words or a sentence.

Pause the video, have a go, and press play when you're ready for answers.

Great work with that.

Let's work through question one together now.

A: what is the population in Jun's investigation?

The population is all the pupils in Jun's school.

That is what he wants to find out about.

B: who does Jun use in his sample?

He uses a class in the youngest year group; that is the class for which he assists in the PE lesson.

C: why could Jun's data be biassed?

Well, he's only collected data from a single age group; the youngest age group.

And it might be that older pupils run faster than younger pupils because they've grown more, they've trained more, or so on.

Therefore, the average time for the sample could be much higher than the average time for the whole school, if we assume that older students do run faster.

So, part D: propose an alternative way for Jun to collect this data.

Got a few options here.

He could use a sample that includes a mixture of people from each year group.

That's one way.

However, it does come with a complication that he would have to assist in lots of different lessons if he wants to collect that data himself, and it could be difficult to do.

So, he could see if that data's already available.

The PE teachers in his school might already have collected that data for the entire school, and he could analyse the population data rather than a sample data.

And now let's work through question two.

Part A: what is the population in Laura's investigation?

The population as all the adults in the UK, because Laura wants to investigate how much time adults in the UK spend exercising.

Who does Laura use in her sample?

She uses the people who attend the gym where she went to and also agree to take part in a questionnaire.

We should remember that, just because Laura's doing a questionnaire at a gym, doesn't mean that everyone will stop and agree to take the questionnaire.

A lot of people might walk past her or just say, "Not today, thank you," and so on.

So, the sample is only the people who actually agreed to do the questionnaire at that particular gym.

Part C: why could Laura's data be biassed?

She's only collected data from people who are members of a gym, or people who are visiting the gym.

Why is that a problem?

Well, if they're going to the gym, the chances are they go in there to do exercise.

Therefore, people who go to the gym may do more exercise than people who don't go to the gym.

So, her results might suggest that adults in the UK spend more time doing exercise than the population actually does.

Part D: propose an alternative way for Izzy to collect data using her questionnaire.

Well, the problem we've had in part C is that she's only collecting data from members of gyms, and not everyone in the UK is a member of a gym.

So, she wants to collect the data in a way that gets both members and non-members of gyms.

How could she do that?

She could conduct her questionnaire in a place where both gym members and non gym members might go to; for example, a town high street, or a supermarket.

Fantastic work today.

Here's a summary of what we've learned in this lesson.

Different statistical problems may need to be investigated in different ways; that might be collecting different types of data, which may then mean using different methods or different tools to collect that data.

It all depends on what it is we're trying to address.

Sometimes it's not always practical to collect data from the entire population.

So, when it's impractical to collect data from the entire population, we may collect data from a sample instead.

But when we do that, we need to just bear a few things in mind.

When sampling from a population, the sample size needs to be practical, and we need to be aware of any factors that may cause bias, and attempt to reduce those where possible, but at least just be aware that bias could be there in our results.

Well done today.

Thank you.

I've finished the video