Lesson video

In progress...

Hello, my name's Mr. Davison and I'm going to be guiding you through your learning today.

Today's lesson is called "Representing Text and Using ASCII and Unicode" from the unit Representation of Text, Images, and Sound.

By the end of the lesson today you'll be able to explain how computers represent text.

We'll also be learning about these keywords.

State, which is the value of data at a specific point in time.

Character set, a system that matches characters to a unique binary sequence.

ASCII, which is a method of character representation that uses 7-bits per character.

And Unicode, a method of character representation that uses up to 32 bits per character.

The lessons going to involve three learning cycles today.

Let's start with the first one, calculate the number of states of a sequence.

Binary sequences are made up of binary digits, or what we refer to often as "bits." They can vary a lot in length, but the state of an individual bit is its value at a certain point in time.

So if we consider an individual binary digit, that bit, it can only ever have the values of 0 or 1.

That may change over time, but it'll only ever be those two values.

Now let's just check that you've remembered that.

What is the total number of states that a single binary digit can have? Good, you've remembered.

It's two.

And remember that's 0 or 1 as a value.

If we extend that to a binary sequence where we mean multiple bits collected together, then we know that that binary sequence is going to have a number of different states.

If a single bit has two states, then a binary sequence which is made up of multiple bits is going to have a number of different states.

Now Sam's asking, "What are all the states a 2-bit sequence can be?" So we mean a sequence that has 2 bits collected in it, and the sequence is considered those 2 bits together.

Lucas finds that easy, and he's responding with the different four sequences that it can be.

It can have 00, so the first bit and the second bit of both 0.

01, where that last bit has changed, through to 10 and 11.

It's quite easy to figure out all the different states that we can have.

So we would say a 2-bit sequence is going to have four different states.

So let's extend that.

And Sam's asking now, "What are all the states a 3-bit sequence can be?" Again, Lucas is pretty confident he can list them all out.

Change one bit at a time, and come up with all the states that that 3-bit sequence can be.

So we would confirm that and say a 3-bit sequence will have 8 different states.

However we have a problem the longer the binary sequence becomes.

The longer the sequence is, it follows that the more states it can have.

So Sam extends this again and, he moves it up to 5 bits.

So what are all the states a 5-bit sequence can be? It's at this point Lucas realises that that's gonna take him a long time to work out.

It was easy where there was two or three bits, but because we've got those extra bits in the sequence, listing those combinations becomes difficult.

Now actually, if you worked it out, a five-bit sequence can have 32 different states.

The method Lucas uses for determining the number of states a binary sequence has is by listing them, and it helps to do it in counting order where you vary the bits in order starting with the right-most bit changing to a one, and then as we progress the next column along changes, and the first column resets itself back to zero.

That process carries on until you've listed all of the possible states of that binary sequence.

As Lucas found out though, if even if we have one extra bit put on to the left-hand side of this sequence, it's actually gonna take a lot more work to list all of those different possible sequence states.

The good news is that we can use maths to determine the number of possible states of a binary sequence.

So if we think we've got n number of bits, we can say that the number of states is 2 to the power of n.

And in our example, if we have a 4-bit binary sequence, the number of states would be 2 raised to the power of 4.

Because remember we said that the n is the number of bits.

We've said we are using 4 bits.

So we do 2 to the power of 4.

And that gives us our number of states as 16.

Now if you're not sure how that's worked out, we can do it on a calculator or we can consider that 2 to the power 4 is 2 times 2 times 2 times 2.

So multiplied by 2 4 times.

Let's check that you can work that out for yourself.

How many states would a binary sequence of 3 bits have? Well done.

You worked that out perfectly.

It's 8.

2 to the power of 3 is 8.

I want you to try and practise some of that now.

First task I'm gonna get you to do is to complete a table listing all the possible different states for different amounts of bits.

You can confirm you've got the right number of states by calculating the possible number of sequences mathematically as well.

So in my first example I've realised that with 1 bit I've got the states 0 and 1.

I can check that by using my formula of 2 to the power of n, where n is the number of bits.

So 2 to the power of 1 gives me 2.

I want you to do the rest yourself.

Once you've done that, there's a second part.

So remember the change in the total number of states follows a pattern as the number of bits increases.

Have a look back through your answers in the first part, and then describe how the total changes and explain why this happens.

Pause the video and have a go now.

Well done.

There was a lot of working out there to do.

Let's check the answers that you've got.

So you can see I've listed the 4 different possible states for 2 bits, and check mathematically, 2 to the power of 2 is 4.

I've then gone on with 3 bits and written out the 8 possible states, again, checking mathematically I've got the correct number of states.

And then finally the difficult one where we've got 4 bits and we've got a lot of 1's and 0's there to write.

But again, I've done the right amount of states and checked it because 2 to the power of 4 gives me 16, and I've written the 16 different sequences there correctly.

And for part two of the task, did you spot a pattern? When the number of bits used in the sequence increases by 1, the total amount of possible states doubles from the previous total.

And if you look back and check that, 2 bits have 4 states, 3 bits have 8 states, and so on.

Each time the number of states are doubling.

So we could say as another bit is added to the sequence the existing states are joined with this new bit, which itself has 2 states.

This has the effect of doubling the previous total, therefore every bit added doubles the amount of states.

Let's move on to the second part of the lesson now where we're going to describe how ASCII is used to represent characters.

Computers have to represent lots of different things, and certainly one of those things is textual data such as words.

Words are formed from an alphabet.

So we can see here the English alphabet starting from A all the way through to Z.

Now if a computer is going to represent those characters, each of the characters is going to need its own unique binary sequence.

That means we've got to represent all 26 letters.

And if we look through the number of sequences that we had, it's likely that 5 bits are needed because 2 to the power of 5 provides 32 different states.

A little bit more, but if we used only 4 bits, we wouldn't have enough for 26 letters.

So just see if you can remember what we've just discussed.

Let's check how you got on.

So all 26 letters of the alphabet can be represented by a binary sequence of 5 bits.

2 to the power of 5 provides 32 different states.

4 bits would not be enough as 2 to the power of 4 would only provide 16 different states.

Computers though need to represent more than just the 26 letters of the alphabet.

For example, we've got both upper and lower-case letters that need to be represented, and they need to be treated differently.

And we're not just limited to letters.

Other symbols also need to be represented such as punctuation, numbers, and spaces between words.

In this example sentence that I've got here, "This sentence uses 17 different characters!" actually includes 17 different characters if we list them.

Some of them are repeated, but we need 17 separate sequences to represent those 17 different characters, including the space and the exclamation mark.

But if you think about it a bit more, character representation includes more than just the letters, numbers, punctuation, and spaces used in typical sentences.

We might have to represent the Enter character to move the cursor down to the next line.

So that's like when you press Enter in a word processing document.

We need to represent tab spaces and the Tab key when we press it creates a larger space to indent texts like we would have in paragraphs, or perhaps you've done it in Python where you need to indent a block of code.

We sometimes have to represent special characters like mathematical symbols.

Mathematical symbols, such as pi, epsilon, and theta are all used in formulas so we need to be able to represent them in a computer.

A standard English keyboard typically has 104 keys on it, give or take a few.

So to be able to give 104 different binary sequences, we are going to need to use a 7-bit sequence as 2 to the power of 7 provides 128 different binary sequences.

We've also got to consider that of those 104 keys, some of them have double symbols, meaning we can change what symbol we produce with the same key.

So 128 is probably just about right.

Now we need to be able to remember which symbol is matched to which binary sequence and that's where something called a character set comes in.

Now a character set is a system that matches binary sequences to characters.

And, as Alex knows, a computer can represent characters by picking long enough binary sequences and assigning the character it wants to each sequence.

But Izzy has thought about this.

She's asking, "What happens if another computer chooses a different order for its character set?" It's all gonna get pretty confusing if everyone has a different character set for the different characters and the binary sequences that represent them.

So it would help if we could find a way of standardising these character sets.

Computers all need to use an identical character set to know with which sequence goes with which letter.

That means that using the same character set makes sure that the binary sequence matches the same character on all devices.

If we were sending an email for example, we'd then know that the characters are transferred between computers as binary sequences, and we'd be confident that using the same character set ensures that the sequences are matched to the same characters that you sent.

So let's just check that you remember what that key term is.

What is a record of characters matched to a unique binary sequence known as? Well done.

It's a character set.

In order to provide consistency for character representation, in 1963 the first edition of the ASCII standard, which is short for American Standard Code for Information Interchange, was created and used.

It listed all the possible characters that needed to be represented in its character set and chose for each character a 7-bit binary sequence that would be used to represent each letter.

I'm gonna want you to put some of that into practise now yourself.

Firstly, I'm gonna want you to describe how the ASCII standard is used to represent characters stored on a computer, and it's important to be able to articulate how that standard works and how it affects the characters that can be represented.

Once you've done that, I want you to try and complete the table.

The table is part of the ASCII character set.

Now you'll notice that the letters go sequentially, they go in order, but also so do the binary sequences, those 7-bit binary sequences.

I want you to have a go and see if you can predict what the next three binary sequences are if they go up in counting order.

And for the last part of the text we're going to explore the difference between upper and lowercase letters in the ASCII character set.

Now if you look at the difference between the different cases of characters, the sequences are almost the same, but the 6th bit from the right is the one that changes.

Now in Python, a string can be converted to uppercase by using the.

upper() method.

I'm gonna want you to explain how this feature of the ASCII character set makes this method easier than if it had to change the ASCII value to a totally different sequence.

Pause the video at this point and have a go at these tasks, and then I'll go through the answers once you're done.

Well done.

Let's check the answers.

ASCII uses a 7-bit binary sequence to represent characters in a computer.

Each character has a standardised binary sequence, so all devices using ASCII know which character matches each sequence.

Now we saw before that letters and the binary sequences in a character set go up sequentially.

So the next three sequences are listed on the table there.

And for the last part we see that the.

upper() method only has to change 1 bit in our 7-bit sequence, not the whole sequence.

And this makes the methods algorithm simpler and easier to use.

If we think about how we design our data we can actually make our algorithms that process that data easier to operate and easier to programme.

Okay, let's get onto the last learning cycle, which is explained why Unicode representation is needed.

ASCII provides a character set of 128 possible characters.

However computers need to be able to represent more than 128 characters.

ASCII is only specified to work with Western alphabets and not other languages from around the world.

And if you think about it a little bit more, character representation also needs to include other symbols that aren't parts speech.

For example, a maths textbook might need to include mathematical equations with different symbols and different styles of symbols as well.

And we know if our ASCII character set is limited it wouldn't be possible to represent this fully using ASCII.

We may find over time as well that new characters need to be created and represented as part of a character set.

Emojis, for example.

Each emoji is represented by a binary sequence in a character set.

They are characters, after all, just like letters and numbers.

So what can you remember? Which of these three options cannot be represented in ASCII? Well done.

It's the Urdu characters we've got there.

The exclamation and the uppercase T both can be represented in ASCII, but we need something else to be able to represent other languages that aren't based on the same alphabet.

To get around this Unicode was created and first used in 1991, and that allowed more characters to be represented than is possible with ASCII.

Unicode uses up to a 32-bit binary sequence, which is actually 4 bytes to represent characters.

Again, if we work out how many sequences that is, this gives 2 to the power of 32 different sequences, which can be used to represent over 4 billion different characters.

Unicode was set up to be compatible with ASCII.

The first 128 characters in Unicode are the same as the first 128 in ASCII.

That means the first 7-bits in Unicode are identical to ASCII, and with Unicode, that means the other 25 bits that are possible are used for other characters.

Let's check if you remember what we've just seen.

How many characters in Unicode are the same as in ASCII? Well done.

It's 128.

So the first 128 characters in Unicode are the same as in ASCII.

Okay, it's time for our last practise tasks of the day.

First, I've given you a scenario and I want you to have a think and give an explanation as to why this scenario has happened.

A shipping company in Beijing, China has sent a parcel to Hull, UK.

When the parcel leaves China an automated confirmation message in Mandarin Chinese is sent via the internet to the destination.

When it arrives at the destination the tracking system in Hull is an old system and still uses ASCII encoding.

When the message from China is received it's full of missing characters represented by a rectangle.

Can you explain why the message received has these missing characters, even though it was sent correctly? Next, I want you to give Andeep some help.

He's saying, "My computer uses Unicode so I won't be able to read documents encoded in ASCII." Explain why Andeep is incorrect and why he will be able to open any ASCII documents even though he uses Unicode.

Pause the video and have a go at those tasks.

How did you get on? Let's check your answers against mine.

So in our first part we've got to understand that the message from China was probably encoded in Unicode to be able to represent Mandarin characters.

When it's received by the older system that still uses ASCII, the system can't recognise these characters and any unrecognised character appears as a rectangle.

It shows that the systems aren't compatible because ASCII can't show characters it doesn't have a record for in its character set.

For the next part, we helped Andeep by telling him the first 128 Unicode characters match the ASCII character set.

So a system using Unicode can still display ASCII characters correctly.

Unicode is designed to work with ASCII ensuring the same characters are represented.

Well done.

You worked really well today.

Let's just check what we learned.

Binary sequences have 2 to the power of n different states, where n is the number of bits in the sequence.

Character sets keep a record of characters against their unique binary sequence.

And we saw that ASCII uses 7 bits to represent characters, whereas Unicode uses up to 32 bits.

Remember as well, Unicode can represent a wider range of languages and symbols, but also still is compatible with ASCII.

I've finished the video