video

Lesson video

In progress...

Loading...

Hi there.

My name is Chloe and I'm a geography field studies tutor.

This lesson is called "Using Statistical Data in Geography," and it forms part of the geographical skills unit of work.

We're gonna be looking at all kinds of different ways of analysing data using statistics.

But don't worry, it's not gonna be a maths lesson.

We're gonna break it down into simple steps, and we're really gonna show you how this can form meaningful ideas in geography.

Let's get started.

By the end of today's lesson, you'll be able to use some simple statistics to better understand the values in a set of data.

There's some key words we're gonna be looking at first of all.

Univariate data.

That's probably not a term you've heard before, but you definitely will have used this type of data.

It's data that represents values against one variable.

Frequency data is data values in a category that show the actual count of something within the data set.

A percentile is the number of values within a data set that fall below a certain percentage of the total number of values.

This lesson is in three parts.

First of all, we're going to look at how do geographers find the most likely value.

How do geographers find proportions of data, that comes next, And then finally, how do geographers find the spread of data? So let's start with that first one, how do geographers find the most likely value? The most simple form of data that geographers can analyse is univariate data.

This is data that applies to one variable only.

For example, Izzy here is studying the precipitation data for a number of sites in a hot desert environment.

And you can see she's got 12 sites that she's looking at, and we've got the average July precipitation in millimetres, and you can see how that varies across the 12 different sites.

Now this is univariate data because there's only one variable, in this case, it's precipitation, that's being measured.

Now there are a number of things Izzy could do with this data to make it mean something more.

At the moment, it's just the list of values.

There's something more she can gain from this.

And one thing that she could do is to place the data into categories, and this means that she would actually create frequency data.

She would be talking about the frequency of data within each of those categories.

Now there's a couple of different ways that you could create these categories.

Now the first one is that she could create the categories themselves as equal in size in terms of precipitation.

So you can see here 0, 5, 6 to 10, 11 to 15, 16 to 20, five-millimeter difference in each of those categories.

That means that there's going to be variety in the number of sites that apply to each category.

So you can see here that there's only one site which falls into the 6 to 10-millimeter category.

It's site number nine.

There's nine millimetres of precipitation there.

Likewise, in the 16 to 20 category, there were two sites for that one.

So that's going to be site number two and site number eight.

Now, another way that she could categorise is by making the frequencies themselves the equal value.

So she's got 12 different sites here.

So she said, "Right, I want three different categories, four sites in each category." This means she has to then manipulate the amount of precipitation for each of those category titles.

So one category has zero, naught to nine millimetres for the next, and then 10 to 17 millimetres for the next one.

What you can see there is that the range of data has changed so that you can match the amount of sites to be equal across all of the categories.

An advantage of categorising the data is that it makes large data sets much easier to analyse.

Geographers can more easily see where there are concentrations of data.

However, if the data set is really small, as it is here, there's only 12 different sites, creating frequency data means the geography is likely to make unnecessary generalisations about it.

Over categorising something actually takes the meaning away.

So let's check our understanding so far.

It is a good idea to apply categories and frequency data to all data sets.

Is that true or false? Pause the video here and have a think about it.

So hopefully you recognise that that is false, but why is it false? Well done if you've realised that creating frequency data by placing the data into categories works only really well if you've got large data sets.

This makes those data sets much more manageable.

With small data sets, it can encourage unnecessary generalisations, and we don't want to be doing that.

There are more things that Izzy could do with her data set.

She could find the mean or the average of the data.

This would summarise the data and it would allow her to make general statements about deserts.

To find the mean, she adds together all the values and divides them by the total number of values.

So in this case, the mean is 75, which is the total amount of precipitation, divided by 12 to 12 sites.

So she has a mean or an average of 6.

25 millimetres.

However, quoting the mean can leave out important information, Stating the mean precipitation in the desert in this example as 6.

25 millimetres ignores the fact that there are four sites where there's no precipitation at all.

It kinda makes it seem like there's precipitation in all of her desert sites, but actually a third of them, There's no precipitation at all.

Instead, Izzy could quote the mode, and this is the most common value in the data set.

And in this case, the most common value is zero millimetres.

It appears in four of the sites, and all of the other sites have different values.

However, this still gives a really misleading impression of precipitation in deserts.

Now it looks like there's no rainfall at all happening in the desert regions, but of course we know that that's not the case.

So a third option that's open to Izzy is that she could find the median value.

This is the central value when the data is placed in numerical order.

So here we have the same data, but you can see it's in a different order to how we've been looking at it previously.

Here it's been ordered from our smallest values on the left hand side of the slide to our largest value on the right hand side.

This means our site numbers now appear to be jumbled up.

In this case, the median value falls between two values.

We've got 12 sites.

So our median value is going to be that which falls between our sixth and seventh position in that order.

When this happens, we find out the mean of those two values.

So we've got three plus four, that's gonna be seven, so therefore the median is going to be 3.

5 millimetres.

The median value can be meaningful in very large data sets, but in this example, the set is too small for the median to be representative of the data.

So when you are thinking about doing a mean mode or median value, do look at the size of your data set because it might actually not help you to come to any conclusion.

Let's check our understanding again, Complete the sentence with the missing words.

Pause the video and have a look at the paragraph in the slide, and then come back to me with the right answers.

Okay, let's look at what you got.

So calculating the mean mode and median values allows geographers to summarise the data in a set.

None of the methods are perfect.

The mean and mode values means the geographer might ignore important detail in the data set.

Median values are only useful if the data set is large.

Hope you got those.

Our first practise task of today's lesson.

Here we've got a data set, a data set that shows the heights in metres of a selection of trees in a mixed woodland.

So you can see you've got 10 different trees that have been measured in this particular woodland.

Your first part of the task, calculate the mean height of the trees.

Second part, state the mode height.

And then finally, calculate the median height as well.

So the three methods that we've just looked at, work those out for that particular data set.

You're definitely going to need to pause the video here, maybe find a calculator if you need to, and then come back to me, and I'll tell you the answers.

So let's have a look at your answers.

First of all, we're looking at the mean for that data set.

26.

3 metres.

The mode is 28 metres, and the median, 27.

5.

Hope you got those correct.

Now moving on to the second part of today's lesson.

We're looking to be answering the question, how do geographers find proportions of data? Now, proportional data is that which can be expressed as part of a whole and the most common proportional data that geographers use, and we use them all the time, percentages of course.

Jacob here wants to know the percentage of a country's GDP that comes from agriculture.

So he's got some data here.

He's got the total GDP.

And in this case, it's in US dollars, in millions.

And then he's got how much money the US makes from agriculture, the income levels there, in the same units.

So he wants to know the percentage of a country's GDP that comes from agriculture.

So in order to work that out, he's going to take his agriculture value, divide that by his total value, and then times it by 100.

Now I think we can see that Jacob's gonna need a calculator here.

8.

5% is the answer.

If you're not sure, have a go yourself at that, and just check your workings out on your calculator as well.

So sometimes data is expressed as a percentage, but what the geographer actually wants to know is the real value.

So what we're actually gonna be doing here is almost a reverse of what Jacob's just done.

So now we've got a different country and it's got the following data.

Its total GDP, 9,053 million US dollars, but we know that 31% of its income comes from agriculture, quite a high value there.

How are we gonna work that out? Now we take our percentage as the starting point in our equation.

So 31 divided by 100, and then we're gonna times it by our total GDP, so in this case, 9,053, and again, in US dollars in millions, 2,806.

Hope you can see how we've changed the positioning within that equation to find the right answer.

Do note that in both examples there's a consistent unit that's being used.

In this case, it's US dollars in millions.

And so this can be kind of excluded from the calculations.

You haven't got to add a whole load of zeros into the calculation to make up the fact that we're working in millions.

Instead, we take the millions out of the equation and we work with the pure numbers.

Now, if your units are going to be different, then we might have to actually use the full numbers when you're working things out.

Let's check our understanding.

So here's Jun.

He's calculating the number of people in a population who are literate.

It's in one of these classic development indicators.

The country has a population of 70.

9 million people, and he's found out that 67% of them are literate.

They can read and write.

Which is the correct calculation for Jun to use? So remember he's trying to find the actual number of people who are literate in that population.

You've got three options here.

Is it 67 divided by 70.

9 times 100, 67 divided by 100 times 70.

9, or 70.

9 divided by a hundred times 67? Think about what Jun is trying to find out, and therefore what should form the first part of the equation? Do pause the video and have a think about it, and I'll come back to you in a moment.

So we're gonna leave Jun to actually do the actual calculations here, but we need to know which equation is the best one to use.

Yes, it's B.

So here we've got the percentage appearing first in the equation, 67 divided by 100, another way of saying 67%, and then we're timesing that by the total number of people in the population, 70.

9.

So as data changes from year to year, geographers find it useful to calculate percentage increase and percentage decrease in the data values.

Lucas is studying the size of the tropical rainforest in a country and how it's changed over time.

So he's got two dates here.

We've got 2010 and 2020, and you can see that there's been deforestation in that time.

It's reduced in time.

So we're looking for a percentage decrease.

He wants to know by what percentage the rainforest cover has decreased, and we're looking at this in thousands of hectares.

To calculate the percentage change, he looks at the difference between the values, he divides that by the original value, and he times it by 100.

So we are translating that into the data that we've got here.

We're gonna take our 2010 data from our 2020 data to find the difference between them.

We're then dividing it by the original value.

Now, the original value in this case is the 2010 data.

That was our starting point, and then we're times it by 100.

And that comes out as a 7.

6% decrease.

Geographers may prefer using other types of proportions such as data measured per capita or per person.

This can make the data more easily, more fairly comparable between places.

It also helps them to make the numbers more manageable when the values are very large.

So here Laura is looking at carbon dioxide emission data for two different countries, and we've got some pretty big numbers to be dealing with here.

We're looking at tonnes of CO2 per year.

One country has got 960,000 tonnes.

And the other, 21,300,000 tonnes.

So we're looking at a lot of carbon dioxide going into the atmosphere from these two countries.

Now, it would be very sensible for Laura to point out, yeah, country B is polluting far more than country A.

You can't deny that that is true, but there's more to the data than just that.

What Laura then does is calculate the carbon dioxide emissions per capita.

So here's our original data for the two countries.

She then looks at the total population for those two countries, and you can see here that country A, quite small compared to country B, just 1.

1 million people in country A, country B, 4.

5 million people.

So four times as large.

Then she wants to work out the amount of CO2 per capita for that year.

So she's taking her amount of CO2, and she's dividing it by the 1.

1 million people.

So how much CO2 each person in that country is producing? In that case, it's 8.

7 tonnes for country A, but we then go to country B and we get maybe a little bit of a surprising result, 4.

7 tonnes.

Now, remember what Laura said originally she noted quite rightly that country B is producing a lot more CO2 than country A, but now she can see that each person in country B is, on average, less polluting than each person in country A.

While in total, Country B is producing a lot more CO2, when you break it down for the amount of CO2 per person, in fact, country A is a lot more polluting when it comes to carbon dioxide emissions.

Let's check our understanding.

For which of these geographical concepts would it not be suitable to calculate percentage change? Four options, total population, energy consumption, precipitation levels, or the location of earthquakes.

Which one of those would not work with percentage change? Do pause the video and have a think and then come back to me.

So hopefully you've realised that, yeah, the location of earthquakes, that's not something which you can calculate percentage change for? All of the others, you can.

Population levels go up and down as does energy consumption, as does precipitation levels.

So all of those would work really well with a calculation for percentage change.

Let's do a practise task now.

look at the data in the table.

Explain why geographers might prefer to use calories per capita per day as a measure of food intake, rather than the total daily calories consumed.

Let's take a quick look at the data before we reconsider that question.

We've got two countries, again, country A, country B.

Country A 62.

3 million people live there.

Country B, much smaller, 22 million people.

Then we've got the total daily calories consuming.

So how many calories the entire country eats in one day, 215 billion for country A, 76 billion for country B.

When you then look at the calories per capita per day, 3,458 for country A, 3,455 for country B, why would geographers prefer to use calories per capita per day as a measure of food intake rather than the total daily calories consumed? Think back to our example that we had about CO2 emissions, and then think, why would geographers prefer to use per capita? Have a chat with somebody nearby if you're not sure, and I'll come back to you with my ideas in a moment, Right, so we'd all probably word this slightly differently, but his roughly the kind of idea that you should be mentioning.

The idea is that by using per capita, it helps to make the larger values easier to manage, and it makes the data easier or fairer to compare.

It makes it much more comparable.

You might have actually used data from this example, well done if you have, and noted the fact that despite all of the data being quite different, at the end of the day, each country is pretty much equal in terms of how many calories each person is consuming each day.

Now let's move on to the third part of this lesson, answering the question, how do geographers find the spread of data? Let's look at some different data now.

We've got life expectancy data for 42 European countries.

The data shows the number of countries whose life expectancy is at each of the ages listed.

So you can see the bottom row there, it's got the life expectancy ranging from 71 on the left to 85 on the right.

Then we've got the number of countries for whom have that life expectancy.

So one country has a life expectancy of 71, eight European countries have a life expectancy of 83, and so on and so on.

Now, this means we are dealing with frequency data 'cause it shows the frequency of countries in each age category.

Now, I could try and look at the spread of data by drawing a graph.

So here I've got quite a simple dot chart.

I've got a a dot representing each European country.

I've got the life expectancy along the x-axis, and so I can see how many countries appear frequency-wise in each of those age categories.

As you might expect, a few countries have really poor life expectancy, some countries have really excellent life expectancy, and then there's some countries which kind of fall in the middle.

So there's a spread of data, but is it an even spread or is it actually skewed towards one end of the spectrum? Geographers don't always have to draw a graph to understand the spread of data.

They might instead decide to calculate percentiles within the data to show how it's spread out.

In this example, they would look at a certain percentage of the 42 countries and analyse what age categories these represent.

So if they wanted to look at very particular percentiles, particular percentages, they might look at 10% of those 42 countries, and that would be called a decile.

If instead they wanted to look at a quarter of the countries, so that's 25%, it would be called a quartile.

So certain percentiles have very specific names.

So true or false, a decile is the first 10% of the category values in the data set.

Think really carefully about that statement.

A decile is the first 10% of the category values in the data set.

Is that true or false? Really think carefully about the language that's been used there.

Pause the video and then come back to me.

Well done if you worked out that that is false, but why is it false? It's not an easy thing to explain, so have a really good think about this.

So yeah, a decile is 10% of the data being placed into the categories.

It's not 10% of the categories themselves.

In our example, a decile would be 10% of the countries, not 10% of the age values.

It's quite a hard concept to get your head around, but it has to be the idea of not the categories that you you're placing the data into.

It's all about the data itself.

The other point is it doesn't have to be the first 10%.

A decile can be 10% of the data in any position within the range of data.

Likewise, a quartile can be 25% in any position within the data set.

Now let's put our 42 countries to one side for a moment and just think a little bit more broadly about the spread of data.

Think of the data within a set laid out in sequence.

So we haven't got numbers attached to anything here.

We're just thinking about data more broadly.

On my left-hand side of the screen is going to be my lowest value.

On the right hand side of my screen, it's going to be my highest value.

The first 25% of the data is known as the lower quartile, and it's marked as Q1 at its maximum extent, and you can see that in the diagram here.

A line has been intersected into the sequence, and that line is marked as Q1, and it's the maximum position where we can say that our lower quartile sits.

The marker of the second 25% is the median.

Now in effect, you kind of could call this a Q2 'cause it's the second quartile, but we would normally just be thinking of this as the median.

Remember we've laid out our data in sequence.

The middle position is going to be the median.

The final 25% is known as the upper quartile, and it's marked as Q3 at its minimum extent.

So let's look at our three markers here.

Again, Q1 marks the maximum extent of the lower quartile.

You then have your median as your second marker, and then your Q3 marks the minimum extent of your upper quartile.

The values between the upper and lower quartile markers is known as the interquartile range.

Think about the word interquartile, between the quartiles.

So the interquartile range, and you often see it written as IQR.

This is a measure of the spread of data.

So to calculate the position of the Q1 and Q3 markers, the data is written in sequence from lowest to highest.

So we're gonna go back to our 42 countries here.

Now, each of the countries has a life expectancy, and those have been written in sequence from lowest on the left hand side to highest on the right.

So you can see, for example, three countries had a life expectancy of 73 years.

So 73 appears three times in that sequence.

In our examples, remember there's 42 countries to consider.

The median value will therefore be that which it is halfway between the 21st and the 22nd value.

So that's gonna be the exact middle of our sequence.

This means that the median value is effectively at the 21.

5 position.

So the 21 1/2 position within that sequence.

Here on the 21st and the 22nd value, they're both 81 years.

So we know.

Quite easy maths here.

We know that our median value is gonna have to be 81.

Now to find the value of a quartile, one must find its position in the spread of data.

We use this equation.

N plus 1 where N is the number of the whole positions either side of the median position, and we divide that by two.

So let's just look at that again.

The number of whole positions either side of the median position.

Remember, our median position is 21.

5 position.

So in this example, n is gonna be 21.

There's 21 values to the left of my median.

There's 21 values to the right of my median.

So the using that equation, the quartile is 21 plus 1 divided by 2.

And that comes out as 11.

Now remember that is telling us the position of our quartile values.

So define the Q1 marker position.

We count 11 positions from the start of the sequence.

So 11 positions in from one end, and we can see that our Q1 value is therefore 77 years.

Now remember a quartile in this example is 11 positions.

So we know if we need to find our Q3, we need to count 11 positions from the other direction.

So going from the other end of the sequence, we'll count 11 positions, and that gets us to our Q3 value, which is 83 years.

This means the interquartile range is the difference between 77 and 83, the difference between our lower quartile value and our upper quartile value.

And this means that our IQR, our interquartile range, is six years.

Whew, there was a lot to take in there.

So let's check our understanding of that.

What is the interquartile range a measure of? Is it A, the most likely value in the data set, is it the proportion of the data set that has changed, or is it the spread of the data set? Hopefully not too difficult, this one.

Pause the video.

Come back to me with the right answer.

Well done if you recognise that it's all about the spread of data.

Interquartile range shows you how bunched up the data is in one position within that sequence.

Let's move on to our final practise task of today's lesson.

Look at the following wind speed data in miles per hour that was collected at 15 different sites around the UK at exactly 12:00 PM on two consecutive days.

So we've got day one and day two there in the table, and you can see there are 15 values for each day.

Calculate the interquartile range for each day's data.

And then secondly, comment on which of the two days was the wind speed data more spread out.

You will definitely need to pause the video here.

Do find your calculator again, 'cause you're going to need it.

Remember the sequence of tasks that it takes in order to find out the interquartile range, and take your time to make sure the accuracy is there.

Come back to me when you're ready, and I'll tell you your answers.

Well done.

It's not easy on the first attempt, is it? So let's see what your answers are.

First of all, you needed to calculate the interquartile range for each day's data.

Day one, your IQR was seven miles per hour.

And on day two, you should have got two miles per hour.

Now, if your answers are different to mine, it is probably worth pausing the video here and having another go.

It's not uncommon for you to be wrong on the first attempt.

Don't worry.

Lots of people would be.

It's quite a complicated little calculation to make, but it's worth trying to get it right now before you move on.

Your second task, on which of the two days was the wind speed data more spread out? It was on day one.

How would you know that? Because your interquartile range was much larger on that day, it's seven miles per hour compared to the two.

Let's summarise our learning from today's lesson.

Geographers can summarise univariate data sets by finding the most likely value.

To do this, they may calculate the mean, mode, or median value.

Percentages can be calculated to express proportional data, and changes in this data can be shown as percentage increases or percentage decreases.

The spread of data can be calculated as the interquartile range, and this is the difference in value between the lower and upper quartiles.

Really well done for trying so hard in that lesson.

Maths does not come easily to many people, myself included, but with lots and lots of practise, it becomes like another language.

So do have another go at some of those interquartile range calculations if you need to.

Well done.