Computer Stuff They Didn't Teach You #2 - Code Pages, Character Encoding, Unicode, UTF-8 and the BOM

Sharing buttons:

hey friends I'm Scott Hanselman and this

is things they didn't teach you some of

the folks in the last video commented in

the comments that maybe they did teach

you this stuff I don't care that's the

title of the thing so maybe you learn

these things and maybe this isn't the

video for you but it's a fun title and I

like it so the idea is stuff that maybe

you forgot or didn't know or they didn't

have a class on this maybe you didn't

pick it up there's lots of folks out

there who are learning from boot camps

they're learning to be a programmer or

an IT person by picking it up and

sometimes you just don't pick these

things up as you go about your life

so in episode 1 we talked about carriage

return and line feed and we just touched

on a bit of character encoding I thought

it would be interesting to talk a little

bit about character encoding right here

so every once in a while you may find

yourself on the Internet's and you might

go and find a you know a character like

this like a Chinese character or a non

Latin character and then you go into

your notepad or you maybe have opened a

text file from somewhere and you paste

it in and you go oh my you know it's

happening it's it's all nonsense and

then you become frustrated or perhaps

you open up a text file and it's all

black squares and you don't know what's

going on well that gets us into the

question of character encoding in our

last episode we talked a little bit

about the ASCII character table in that

episode in the last episode I said the

word as Key ASCII ASCI I the American

Standard Code for information

interchange and I saw it all casual like

someone would know what that meant well

here's the deal back in the day back in

the day we had only a few bytes and

every byte mattered in fact every bit

mattered so someone came up with a way

to go and put the first 127 characters

that you might need into seven bits not

8 bits but 7 bit they made all that fit

in 7 bits that's 2 to the seventh which

is 128 if you added a single extra bit

if you had all the space in the world

and you would get 256 but let's talk

about 2 to the 7th it's an agreement

that this

fight means this and this bite means

that so they got around they said you

know a is gonna mean 65 base 10 well or

it's gonna be 41 in hex let's take a

look at this the way that I'm going to

teach you this is I'm gonna do something

silly I'm gonna write a computer program

let's go and make directory and we'll

call it write some bytes it doesn't

matter what language I use I'm going to

use C sharp but you can use whatever

makes you happy so I'm gonna say dotnet

new console we're gonna make a little

stupid console application I want to

make this point really really clear and

open a visual studio code with code dot

I'll open up this and we're gonna go to

our program right here and we've got

hello world we're gonna get rid of that

we're gonna say hey we're gonna need

some bytes will call them bytes and we

will make 128 of them so we've got a

byte array of 128 bytes okay

and what we're gonna do is we're gonna

make a for loop for from the number 0 up

until until what 128 mm okay

then we're gonna say bytes hey bytes put

in I what we're gonna do is we're gonna

take I which is an integer we're gonna

instead make it a byte we're gonna shove

it in there so we're gonna make a bunch

of bytes from 0 to 127 and then we're

gonna go we're gonna say hey we need

some I oh and we'll say dududu file dot

write all bytes we'll call it iron 27 28

fights dot text and then we're gonna

give it our bytes boom pow that cool now

that's our app very simple now I'm gonna

just pop out to the terminal here I want

to say run it and we're gonna go and

look at those bytes so watch on the left

hand side here because if I did it right

boom 128 bytes I'm gonna open it and

this thing says this is using an

unsupported text encoding what's


well why that's because the first byte

is 0 and the second part is 1 and the

etc etc etc so if we go in back in

at the table here right zero is no

Knowles not interesting but let's open

it anyway shall we it's a bunch of

schmutz until we got into recognizable

characters in fact if I delete all of

this stuff and hit save and then

right-click and look at it in a hex dump

we can see that the interesting bits

started around 21 and when we talked

about a there's 41 for upper case

there's 61 hex for lower case alright

now what if we took 256 bytes 256 bytes

256 bytes 256 look it's bigger than a

byte you're gonna go too big so we're

gonna switch this back to an int and

then when we're done here we're going to

shove it into a byte so we're gonna spin

through from the number 0 to 255 we're

gonna shove it into here and I'm gonna

say dotnet run do a little dupe alright

got bigger but it's all on one line

what's all this crap look at all these

things here okay what's happening let's

go and look at it in a hex dump we can

see we went from 0 all the way up to FF

but does this stuff on the right

actually reflect reality who decides

that this number means a it depends

that's called character encoding there's

lots of different character encoding now

all this time when I was saying ASCII is

just a 7-bit character encoding that

means it's bits all the way up to 127

there's a lot of code pages out there

code pages with the windows code page

for whatever reason is called codepage

1250 - is the one for graphical apps and

windows and code page 437 is one for

console applications and there's a bunch

of other code

Agis they're identical until you get up

past 127 okay so for example one code

page might say that's a euro character

and I wonder if I might say that's this

cool see and another one might say

that's a non-breaking space that might

be this is an a with an accent on top it

all depends these cool DOS looking

console deals will only show up like

that if you apply the right code page

okay so you've got to have a font that

supports it and you've got to know what

the code page is so the way another way

to think about this is that the string

that you have it doesn't mean anything

unless you have an Associated code page

okay now if we take one of these files

and we open it up in notepad what's that

it looks like crap what happened here is

notepad took a guess notepad took a

guess and said I think this is what it

is I think it's utf-16 we'll talk about

that in a second and it got it wrong

let's open it up in notepad to a

different note planet application it

also took a guess it took a guess it

said and see if I double click on that a

notepad to an C is in fact codepage 1250

- and the one that we saw that was for

the console is called om 437 who knows

why they named them those numbers it's

silly here's the deal though these are

all different views on how you can

present this stuff another common one is

ISO 8859 - 1 if I click that it'll say

wait a second if I switch it things

might go south if we find any characters

here we don't recognize we're gonna turn

that into something else we're gonna

turn it into default characters now in

this case nothing happened which is a

good thing but what if I switch it to

something like Unicode and we go and we

grab that Chinese character again I'm

just gonna grab character for mother I'm

gonna throw it in here a couple of

okay and then what we're gonna do is

we're gonna switch this to an C or in

this case the most basic 8-bit encoding

ASCII Windows it's gonna warn you hey

what everything went bad

now what if I just made a new file put C

look I can't even paste it in there what

if I make a new file I'm gonna click

here where it says dancing I'm gonna say

unicode utf-8 I'm going to paste in the

character for mother I'm gonna hit save

when I put it on my desktop go out to

the command line look at that in this

case here I got three bytes it doesn't

look Chinese and it's wrong but is it

what could we do to guarantee that folks

got this right what if we saved a

signature in front of it a byte order

mark we're gonna save it again I want to

point something out I'm gonna go ahead

and say this is a nine character file

right now I'm gonna hit save now it is a

6 character file okay if I switch it to

Unicode save it again is a three

character file change it back to Unicode

signature back to six characters it's

still wrong in the dosbox

because that's how things are going to

work for a while but there's three

characters in front of it that's giving

me information that I maybe didn't know

about okay what if I said I want to

change the code page

iemon daus and what I said was display

this character using this code page I

could go and I could say change code

page to 1252 that doesn't look right I

could change it to 437 that's where we

were at the beginning member that's the

default code page or I could change it

to Unicode which enables that but what's

that first character what's going on

there what is this thing here

remember when we saved that stuff we

said save it with a signature let's open

it up find out what's going on let's go

to a hex dump those three characters are

called the bomb the byte order mark it's

the unicode byte order mark the idea was

if you had this magic string here it

would tell you what to expect it says

expect things to look like this and

bytes to be in this order from this

point forward so that byte order mark

would get carried around and then once I

go and have that byte order mark in my

text file it assumes that everything

from that point on is stored as a

Unicode code point which is a magic

number of two three or six bytes that

expresses a point in a map that it could

be any character that Unicode supports

in fact a Unicode has this lovely

website where you can go and find all of

these characters if you're in Windows

needs like windows are you type in char

map you can get this old and wonderfully

fabulous application and pick any font

I'll just pick a regular font like Arial

and click on it and you can see the

Unicode code point for that character

and this thing is interesting it says


alt 233 if I ran notepad

I pull it down here and I'm gonna use

the number pad on my keyboard here I'm

going to hold down alt with my left

finger and then with my right hand I'm

gonna type say 0 2 3 3 and I just typed

that symbol if I want to type the

restricted trademark alt 0 174 makes

sense when you get way down farther you

can't type these yourself but if you're

looking for a character that you can't

type you can grab it select it hit copy

and paste it in there okay but again if

you don't watch for your encoding when

you save it you will potentially lose

information because anything over 255

anything over about 241 anything right

there that's your cutoff right there

what you need to understand though is

once you've got this bomb this byte

order mark you see it works immediately

and I can go and put ASCII before and

after it actually lets go ABC ABC I look

at the file here we can see that it was

loaded correctly with utf-8 with BOM I

can right-click on it and I can see the

byte order mark ABC the Chinese

character this is interesting

see right there my hex inspector it

actually points out the string at the

bottom there then ABC again that cool

without that byte order mark things

would go south last thing we'll do what

if I made a little bit of room and I

made a 256 bytes with a bomb this is not

how you would do this to be clear

and what I'm going to do is I'm going to

hard-code the first three bytes I'm

gonna say bytes at zero I'm gonna make

it EF byte 1 and bi 2 are gonna be BB

and bf I'm hard coding the bomb then I

will do my 256 other characters and then

we'll run this shows up over here boom

there's my byte order mark now we're

going to open this with notepad or 128

bytes text file got confused are 256 one

got confused opened as ANSI looks pretty

decent though remembering that the first

27 odd characters are kind of trashy

they're just control characters for

doing stuff but 256 bytes with bomb

watch right here where it says ANSI when

I drop it see how it says utf-8

signature it recognized that we wrote

out that byte order mark and it was

smart enough to even give us the

characters for those higher-level bytes

that we wrote out those higher than 128

bytes so everything from here down okay

so I realized that there are lots of

different ways to express this

information and maybe this wasn't the

easiest for you I'm doing the best I can

but I want folks to get a general sense

of encoding character encoding what it

means how it works and that you need to

know about it because when you get a

string and you don't know the encoding

of the string the best you can do is

guess if you have a byte order mark then

you have a lot more to go on but not all

bytes are made equal and if you have any

more comments or questions please put

them in the comments below and if you

have an idea for a future video please

holler at me and I'll do the best I can

to make one thank you very much and

please do subscribe