video

Lesson video

In progress...

Loading...

Hello and welcome everyone to another numerical summaries lesson with me, Mr. Gratton.

Thank you for joining me in a lesson where we will identify the limitations of each measure of central tendency and consider when each can be used.

Pause here to check some of the important keywords for this lesson.

First up, let's look at some features of mean, median and mode.

Remember, all three of the measures of central tendency are methods of summarising a dataset using a single value.

The mean considers the sum of all the data points then divided by the number of data points.

Whilst the median considers the middle value in an ordered list, or the midpoint of the middle-most two values if there are an even number of data points in the dataset.

Finally, the mode considers the most frequent data value.

Andeep notices that for this dataset, these three summary statistics all give very similar results.

So there's no point in calculating all three, right, when one gives you enough of a picture.

Well, Sam isn't so sure.

For some datasets, each measure of central tendency will be wildly different from each other.

This is because each measure calculates something different and presents a slightly different summary of the data.

For example, the mean takes the sum of all the data points and represents that sum shared equally between all the data points.

The size of this sum is affected by very asymmetrical data or data with outliers.

For dataset like this, the mean may be quite different from the median or the mode.

For example, in this dataset, the median and the mode are both 36, whilst the mean is slightly different at 38.

4.

Andeep asks, how can you tell if a dataset is symmetrical? That is a very reasonable question, especially for large datasets where it is very difficult to get a sense of how the data are distributed.

Sam's solution is very helpful.

Sometimes it helps to represent a dataset on a dot plot in order to see how symmetrical or not symmetrical the dataset is.

Speaking of the size of a dataset, the influence and extreme value or outlier will have on the value of the mean will change depending on how many data points there are in a dataset.

Any extreme values will have a lot less of an impact if the dataset is large.

For example, in this a dataset of 10 data points, the extreme value of a plus 1000 will have a massive influence of the mean, the mean is then a plus 100.

However, in a dataset of a lot of data points like this one, that's 1000 data points with 999 of them being the value A.

Therefore the mean is then a plus 1, a lot less impacted than the a plus 100 of the mean of the other smaller dataset.

Whilst the mean may be impacted a lot by one extreme value, if there are a handful of extreme values both greater than and less than the more central values, then the extreme values may balance each other out a bit.

One extremely large value may then cancel out the impact of one extremely small value.

Furthermore, in our examples above, the differences in the means of a plus 100 and a plus 1 between two datasets is only significant if A is small.

For massive values of A, that 99 difference really isn't significant at all because a plus 1000 in cases like this will not be an extreme value.

For this right hand dataset, we have a median of 14 and a mode of 13.

But again, the mean is more significantly different at 17.

18.

Let's have a look at the dot plot to visualise why this mean may be different.

So there's the dot plot.

Oh, hang on, this doesn't look quite right.

Ah, that's better, this is now the whole dot plot, including the two outliers of 34 and 37.

The mean will be influenced by these two values that lie well above the rest of the data.

For this check, pause here to consider in which of these datasets is the mean more likely to be influenced by outliers or unsymmetrical data.

B has an outlier of 49, C has an outlier of 3, and E is very asymmetrical.

And pause again here, this time for datasets represented as a dot plot.

Which of these dot plots has a mean that is likely to be different from its median and mode? The answers are A, B, and D.

And pause here to consider what the reasons are for these datasets to have different means from the medians and modes.

A and B are not symmetrical, whilst D is very symmetrical, if you ignored those outliers of 3 and 4.

So we've seen datasets where the mean is different from the other measures of central tendency.

But are there times when the median and mode are different from each other? Definitely, especially when we have bimodal data.

Bimodal data is where there are two values in a dataset that occur the most frequently.

For example, in this dataset, we have bimodal data at 40.

2 and 40.

8.

On the dot plot for this dataset, we can identify the two modes by these two peaks in the data.

From both representations, we can see that the data is bimodal at these two data values.

With bimodal data, it is common for the mean and median to be somewhere between the values of the two modes, somewhere in the middle of the two peaks in the dataset.

In this particular dataset, the mean is 40.

54 and the median is 40.

55, which as you can see, is somewhere roughly in the middle of these two modes.

For this next check, pause here to consider in which of these datasets is the median equal to its mode.

The modes are here.

Notice how some of the data are bimodal, and the medians are here.

The correct answers are B and C, the two datasets which are not bimodal.

And pause here to identify in which of these datasets is its median not close to its mode or modes.

Here are the medians.

C and D have modes not close to the median.

For C, the data are bimodal at 108 and 122.

Whilst D isn't bimodal, the mode is at one extreme end of the dataset with some quite frequent values at the other end of the dataset.

In cases like this, the median will still be somewhere a little bit more central.

Furthermore, the mode is unique amongst the mean, median and mode.

This is because modal data can occur in types of data where other summary statistics cannot.

The mode is the only measure of central tendency that can be used to summarise qualitative data.

For this dataset consisting of colours, the mode is clear to see, the most frequently occurring colour is pink.

However, the mean is impossible because, well, you simply cannot add colours together to get a sum total colour.

Furthermore, there is no defined order that makes finding the mean value even possible.

And also for data that have been grouped such as quantitative continuous data, it is not possible to calculate one single modal value.

However, a modal class can be found.

The modal class is a group of values that appears more frequently than any other distinct group in a dataset.

The modal class for this dataset is 40 to 60 with a frequency of 289.

For this check, pause here to match the words or phrases to the sentences that complete the summary statistics for each dataset.

The modal class of this frequency table is 0.

2 to 0.

4, whilst the mode of this raw data is car.

Because this raw data is qualitative data, the mean does not exist.

Brilliant, onto the practise.

Here are three datasets with some of its measures of central tendency calculated.

Any missing measures are shown on the right hand side.

Pause here to match the missing measures of central tendency to the correct dataset.

And for question two, pause here to match the mean to the correct dataset and explain how you know that you've matched the means correctly.

And finally, for question three and four, for question three, find the mode or modes if it's bimodal or modal class of each of these three dataset.

And for question four, find an estimate for the median or median class of each dataset if possible.

Pause now to do both of these questions.

Okay, onto the answers.

For question one, dataset A is bimodal at 3 and 10, dataset B has all three measures of central tendency at 4, and dataset C has a median of 2.

5 and a mode of 1.

For question two, pause here to compare the means for each dataset to the ones on screen and check whether your explanations are similar to the ones on screen.

For question three, dataset A has a modal class, in this case of 30 to 50.

For dataset B, the mode of this qualitative dataset is geometry, and for dataset C, the data are bimodal at 205 and 217.

And finally question four, A has a median class of 10 to 30, B has no median as it is qualitative, not quantitative data.

And for C, we can estimate the median to be anything between 209 and 213.

So now that we have a sense for how each measure of central tendency differs, let's see which ones are best for a given context.

Andeep acknowledges the differences between the mean, median and mode and says they're always all useful.

However, Sam disagrees.

Whilst they are all different from each other, they may not all be useful depending on the context that they are used in.

The purpose of the mean is to share the total value of a dataset equally amongst each data point in that dataset.

Sometimes this is really helpful, but sometimes doing so just doesn't make sense.

For example, here are the number of days of air frosts each month in Eastborne.

If we were to find the mean of this dataset, then we could say that each month has an average of one day per air frosts, like so, redistributing the 12 days of air frosts in total to once per month.

But is this really a helpful thing to do? Is it fair to distribute the days of air frost across all the months including summer months, like July and August? Personally, I don't think there's really a purpose to this.

On the other hand, the mode considers which data point most frequently occurs.

If the modal value has a very high frequency, much more frequent than any other data point, then the mode may suggest a common trend in the data.

Well, here we can see that the mode is zero, and zero appears nine times, far, far more frequently than any other value.

Pause here to think about or discuss, is it more or less helpful to say that the modal number of days of air frost was zero because 9 out of the 12 months had no air frost at all, or is it more helpful to say that the mean number of days of air frosts was once per month? And lastly, the median considers the most central value in a dataset that has been ordered in terms of the size of each value.

So let's reorder the data in terms of its value rather than the month to get this.

How valuable is the median of zero for this dataset, which already has an order to it, the order of the months? And finally, pause here to think about or discuss, is the median of zero any more helpful than the mode where the majority of the months have this modal value? As you can see, there is a lot to consider when evaluating how useful each measure is in a given context.

And of course the arguments for and against each one will differ depending on the context that we are dealing with, such as this context about the number of engines produced each day for 10 days.

Pause here to identify which of these statements are factually correct.

The most frequent number of engines produced in a day is seven.

We see this three times.

However, there are also a handful of other data values that appear twice, not quite three times.

And again, pause here to identify which of these statements linked to the mean this time are factually correct.

The total number of engines produced across the 10 days is 94.

Therefore, per one day, the mean amount of engines produced is 9.

4.

And once more, pause here to identify which of these statements now linked to the median are factually correct using now the ordered list of data.

The median, number of engines produced in a day is 9.

5.

Let's use the three measures of central tendency to compare the performance of the 11th day to these three summary statistics.

Pause now to consider which of these statements are accurate summaries of how many engines were produced on the 11th day.

On the 11th day, nine engines are produced.

Nine is fewer than the mean and the middle number of engines produced per day.

And finally, pause here to consider which of these may be suitable reasons for why the mode is not as useful a measure of performance as the mean and the median in this context.

The reason is quite simply that the mode isn't very modal.

The frequency of the modal value of seven occurs only three times.

This isn't that much more frequent than many other data values in the dataset that each occur twice.

Right, get your evaluating hats on for this practise.

Pause here to consider which measure of central tendency is most appropriate for this context in question one.

And pause here to consider why the forecaster chose to use the modal rainfall for the context in question two.

And finally, question three, using just the information given in part A, make a decision for which job is likely to give the highest salary.

And then evaluate whether your choice will change or not if the knowledge that the boss of each company earns 155,000 pounds.

Pause now for the last question.

Great effort.

For question one, the mode is zero students.

This is because there are two weeks where the bus company didn't transport any students.

Furthermore, a mean of 2,322 shows that each week the school bus company transports on average 2,322 students.

The median number of students transported each week is 3060.

5 students.

The mode of zero is deceiving and isn't very useful at all.

Most weeks transported over 3000 students.

The two weeks at zero could have been holiday weeks, which are not representative of all of the other weeks.

Furthermore, the mean is not as useful as it distributes students across into these two holiday weeks, which isn't appropriate to do and is also influenced by those two outliers of zero.

Therefore, the median is the most appropriate measure to use in this context.

3060.

5 students is a fair representation of the number of students transported on all weeks that were not the holiday weeks.

A different thing that the company could have done is to simply recalculate all the measures whilst emitting the two data points of zero, making the remaining data points and the summary statistics from them more relevant to all of these non-holiday weeks.

Okay, for question two, the mean rainfall is higher than the amount of rainfall for each and every single day that we currently know the amount of rainfall for.

Therefore, the amount of rainfall on that one missing day must have been huge in comparison to all other days, possibly due to that day having torrential rain.

In fact, the amount of rainfall on that missing day is 58 millimetres.

The mean amount of rainfall has been affected a lot by that one massive outlier.

The mean of 5.

75 is not super typical of the majority of the days.

The mode of 0.

4 seems much more typical and representative as this much rainfall occurred on 3 out of the 12 days in this sample.

Each of the companies can be justifiably chosen from this information.

I gave the example of choosing Oakworks as the summary statistic given of the mean has a higher value than the summary statistic given for the other two companies.

But Oakworks is far less appealing now that we know the boss earns so much more than the mean.

If we reconsidered the mean of Oakworks to exclude the boss, the mean salary would be much, much lower.

On the other hand, the other two companies are not as affected by this one outlier value.

And pause here to check that some of the suggested extra pieces of information match your own, and know that there are many more other possible answers.

Thank you all so much for all the effort that you've put in to analysing and evaluating data in the lesson, where we have demonstrated that the mean, median and mode are all measures of central tendency that emphasise different properties of a dataset.

We've seen that different summary statistics can be more or less useful in different contexts, and it can be helpful to interpret what the value of a summary statistic means in that context.

But whilst some summary statistics are more helpful than others in a given context, the more summary statistics that you know about the more complete picture of a dataset you can build.

Once again, thank you all so much for joining me here today.

Take care and stay safe.

And until next time, goodbye.