query

How do I use the MultiIndex in pandas?



Sharing buttons:

hi there this is kevin from data' school

you're about to watch an excerpt from a

private Q&A webcast that I do every

single month

this question is about multi level

indexing in pandas I'm super excited to

release this video because people have

been asking me about the multi index for

years and I finally built a lesson to

cover this crucial topic stick around to

the end to find out how you can join

these webcasts in the future

Thanks so this is a question from Manny

he says could you please elaborate a bit

on multi indexing this is in pandas

especially how it's related to group by

how to properly slice a multi index

table and also how to concatenate on

more than one level could you also

please elaborate a bit on tidy data and

if you think it is preferable to split

data into different tables per patient

which is that's his data set or to keep

all observations in a multi index okay

so his question is about multi indexing

in pandas and I've put together a little

lesson that I'm gonna live code that

should answer a lot of these I think

will answer all of these questions which

I hope you enjoy so let me go ahead and

share my screen one more time here so

the topic here is multi index in pandas

and I will import pandas as PD and then

I was thinking about a good data set to

demonstrate the multi index and I pulled

one up I'm going to read it from my

local machine it's a very small data

frame called stocks dot CSV okay but um

that would work but actually there's

also a URL you can pull this data set

from yourself so I'll go ahead and use

that actually um it dot Li slash small

stocks okay you have to type in that

whole URL but read CSV can read from

that URL we're gonna store it in a data

frame called stocks and let's look

stocks okay now before I forget let me

go ahead and hide the header in the

toolbar so that we can have one more

screen space I will be providing you

with this code or you can follow along

but I'll provide you with this code in

this notebook form if you want to study

it later on but here's our stocks data

frame and this is the entire data frame

it's only nine rows what are we looking

at we're looking at some stocks data

it's got four columns we've got date

we've got the closing price that day

we've got the volume and we've got the

stock symbol okay

so the stock symbol is the company that

Cisco Apple and Microsoft here are the

dates it's three days in October 2016

here is the price and volume of trading

for that stock on that day okay anyway

let's get into the index so this is a

regular data frame so if we look at the

index we have what's called a range

index that represents these integers 0

through 8 ok it's a regular index it's

not a multi index now before we get into

the multi index I just want to do a

quick group buy okay stock stock group

by symbol okay and you'll see what I'm

doing this in a moment close got me okay

so this is the group by means for each

symbol meaning for each company what's

the mean closing price okay and we'll

run that and we'll see apples mean 1/12

Cisco I think 31 Microsoft 57 okay so

that's a group by now to follow what I'm

about to do next you have to be familiar

with group buys because I'm now going to

group my two things I'm going to do

stock stock group group buy and I'm

gonna pass a list the pass symbol and

eight hot clothes mean so this is saying

for each hair of symbol and date what's

the mean closing price okay now it's a

little silly because there's only one

closing price for each symbol and date

but you could imagine a case if this

column was like price and it had two

prices per day like an open and closed

then you could take the mean of those so

you know yes it's true the mean is kind

of silly in this case but I think you'll

see that what I'm going after in a

minute okay so what are we looking at we

are actually looking at a series so I'm

gonna save this as s er just to

emphasize that this is a series with a

multi index and if we look at the index

will see yes this is a multi index now

the first thing I'll say about multi

index is a multi index adds another

dimension to your data so a series is

normally one dimension but a series with

a multi index is kind of like two

dimensions a data frame within normal

index is one is two dimensions but with

a multi index it's three dimensions or

at least three dimensions so the way you

can kind of notice this is you're like

okay normally my index is just one thing

and now I've got two things that I'm

indexing by okay what we're gonna do is

you can actually on stack a multi index

series and it becomes a data frame okay

so if you've always thought of as a

series as one dimensional and a data

frame is two dimensional well a series

with a multi index is two dimensions not

one but because it's two dimensions it

naturally can be represented as a data

frame

so when I unstacked it when I unstuck it

no unstacked it you now have a a

dataframe now I will say one side note

and then we're gonna talk more about

selection and then we'll move into a

data frame with multi index which is

probably what Manny was ultimately

asking about but I don't want to just

fly right through this and miss a couple

important things so I do want to mention

that if you if this is what you were

looking for there's actually a different

way to get this data and it's actually

by doing a pivot table okay so if I had

instead gun stocks the original data

frame dot pivot table values equals

close because that's what's in here the

closing price index equals symbol

because that's what's over here and then

columns equals date you will get the

exact same thing as this followed by an

unstack okay so it it renders the exact

same thing and that's because the

default aggregation function for a pivot

table is a mean but you can change the

aggregation function okay so this is a

data frame okay and I'm just gonna save

it as DF to emphasize that okay so let's

talk a little bit about selection

because understanding how to select from

a multi index series will help us when

we talk about selection from a multi

index data frame okay so let me just

highlight real quick you've got our

series okay here's our series let's say

that I want to select all of the Apple

data okay I just do and actually let's

make this a new cell series got loke

AAPL okay so selecting from a

a multi index series you can just name

with Lok the outer level and that will

allow you to select all the data that is

that it falls under Apple okay so we've

got the date and the closing price all

right now let's say I just wanted let's

say I wanted to focus on Apple on this

day okay well I would actually say

Series C wrote and I comma

whoops bad fate I need to put that in

quotes that's how I just select that

number okay now what if instead of

selecting a single symbol and getting

all this stuff what if I wanted to

select a single date and then get all

the data surrounding it okay well kind

of like I mean it's exactly like

selecting from a data frame so which

first level data do I want I want all of

the symbols which date do I want I want

your steak okay you get all of the

prices Apple get Cisco Microsoft from

that date all right

so quick recap I can select from the

outer level by just naming it I can

select from the outer and inter and

inner level and just get the value or I

can select from just the inner level

okay all right

hope that is helpful feel free as I said

before feel free to type in the chat if

you have some questions as I'm going now

what I want to show you real quick is

the really cool part when you have the

data frame which remember you can either

create this your data frame contains the

same exact data as the series it's just

in a different shape I got it by

unstacking the series okay so see dot

unstack is this the data it's the data

frame is the same thing the cool thing

is that if you have reshaped your data

to look like a data frame the selection

works exactly the same data frame got

loke Apple dataframe got lok apple

october

bird ate a frame dot Lok all rows this

column and you can see it makes sense

because when we've got a data frame and

we're using dot Lok this is how we

select rows this is how we select row

column pairs and this is how we say all

rows this column okay

so whether or not you're multi index

series is represented as a series or a

regular data frame the selection works

the same all right all right let's

finally actually move on to a data frame

with a multi index okay because again

that is probably what manny has in his

data set so stocks what I'm gonna do so

again here's the room here's the stocks

data set as it started but to create a

multi index out of this we're gonna set

the index on symbol and eight okay and

we'll do in place equals true and then

we'll print it out okay I think I got

that right okay so we have now set a

multi index for this data frame and you

can tell it's a multi index because

there are two of them that are in bold

and they're a little lower so visually

you can tell but you can also of course

check the index you'll see that it is a

multi index okay now you're probably

thinking like wow that doesn't look

something looks off and the reason is

you actually probably want to sort it

first and then it'll look correct if you

will and you'll see what I mean in a

second so all we have to do is just

stocks dot sort an index in place equals

true okay and let's look at stocks again

now this is what we really wanted so we

have a multi index and it's sorted first

by the outer level the outer index level

and then by the inner in next level okay

so this is our multi index data frame

and the challenge we want to get into

now is how do we do selection from this

multi index data frame let's start with

the easiest stuff so stocks dot Lok and

we're gonna continue to use Lok if I

choose Apple AAPL

you will get um kind of what you'd

expect I'm just selecting this little

block of data

when I select Apple and that is exactly

what I get all right now what if I want

Apple on October 3rd well your first

thought might be well how about Apple

and 16 1003 now I believe this will work

but I'm gonna show you a good practice

um that is going to save you a lot of

headache okay in debugging your your

selection statements before multi index

data frames okay so here's what you're

gonna do when you have this multi index

and you're doing anything other than

just selecting the outer level you're

going to pass in the indices you want as

a tuple comma and then you're gonna name

which columns you want so just like

before with a regular data frame for

loke you say what rows do I want what

columns do I want the first part of that

here's the what rows do I want here's

the what columns do I want okay now you

pass it as a tuple

that's what if this is right here

because when you don't have the

parentheses pandas can get confused

about like what that comma means are you

distinguishing between different index

levels or are you distinguishing between

rows between index levels and columns

okay

so you put it in parentheses to pass it

as a tuple and then you do a comma and

then you put a colon to say all columns

okay so with that I can get the close

and the volume for Apple on that day

Apple date closed and volume now if I

just want the close you probably have

figured out how to do it which is just

type closed here 11250 - all right now

get a tiny bit trickier what if you want

to get the data from multiple stocks

let's say Apple and Microsoft well we're

gonna say stocks loke and our brackets

and then we know we're gonna have

parentheses something comma colon okay

and what I'm gonna pass is a list Apple

Microsoft okay and let's um pick a

particular date so I think let's see

2016 10 three okay

so now I've got two different stocks on

one day but both the clothes and volume

columns now as you might expect I can

also I don't know how many exam or

examples of selection I just if this is

new to you I know this is probably

pretty fast but if you've been

struggling with this hopefully this is

turned on a light bulb for you so you

can throw in of course clothes only and

now I can see Apple and Mike

on that day just the closing price

what's another thing I can do here

I could say well what if and I'll copy

and paste just to save myself a little

typing what if I want let's see I want

multiple dates but I just want Apple

okay so I'm gonna remove the brackets

there because I'm don't need to pass

that list so this would work but what

I'm doing is what if I want multiple

dates I will say your state and another

day 2016 1004 okay and that's a list a

list of strings so I think I've got that

right okay now the final example I want

to show you for this and then I'll

answer some of me and these other

questions and then we'll wrap up so what

if I want these two days but I I want

all stocks not just Apple you probably

would think and so let's start with this

and I'm gonna change this to a colon

okay now the first thought if like oh I

want all symbols on those two dates and

both close and volume you might think

I'll just throw in a colon well for

technical reasons that does not work

okay so what you have to do looks really

funny but you're gonna type slice none

okay and that is how it will work okay

slice none and then your two dates and

then I want both clothes and volume okay

I know that's super complicated but this

is what selection with multi index looks

like okay takes a lot of practice to get

it right now I know I've been talking

about this for a while but I still

haven't answered all of me and these

questions though I will be able to wrap

up with them

so he asked how to concatenate on more

than one level now what does that mean I

want to just show you what he means so

I've got some code I wrote to kind of

save me a little time and don't worry

too much about the code itself but let

me just I'm just gonna copy and paste it

in and then we'll talk about it okay

now in our data frame we've got symbol

date closing price symbol date volume

okay and his question is how do I put

this data frame and this data frame

together and the answer is you're gonna

do a merge operation okay so it's

actually pretty simple we're gonna do

I'm gonna call it both is the result

we're going to BPD merge and we're gonna

say close because close is the name of

this data frame I named it close as you

can see this one's called volume so

we're going to merge close and volume

and we're gonna say rather than defining

a a column to join on we're gonna say

join on the left index meaning the index

of clothes and the right index now if

you if there were some missing values

and one of them or the other you'd have

to think about the merge type which is

the how argument but there aren't any

missing values so we're just gonna use

the default and let's go ahead and run

this and you can see that we've

reconstructed our data frame okay so

that is how when you have two data

frames with the same multi index but a

different column or columns that is how

you combine them you combine them with

the merge functionality okay final

aspect of manís question could you

please elaborate a bit on tidy data and

if you think it's preferable to split

the data into different tables per

patient

or keep all observations in a multi

index so essentially he's asking well

number one what is tidy data and I'll

just give a brief explanation from the

book are for data science written by how

they Wickham who's famous for talking

about tidy data and his definition is

here are the rules which make a dataset

tidy each variable must have its own

column each observation must have its

own row each value must have its own

cell okay and I will link to this where

you can read more about it so that's the

tidy data principle and funny enough

this the very data set we started with

is already tidy this is a tidy data set

now when we add two indices to it and

make it a multi index I don't think it

makes it any less tidy but you can

always if you need like if what you've

got is this down here and you're saying

oh well I need to do something else and

it's not technically tidy because it has

this multi index all you have to do is

just say both that reset index now

you're back to the original data frame

okay so you just reset the index and

this data is tidy to answer your

question should you split this into

multiple data frames based on patient

that would essentially be like having

one data frame for Apple one data frame

for Cisco one data frame for Microsoft

and in my opinion in most cases that's

just gonna make your job a lot more

complicated and having it all in the

same data frame is going to be much more

efficient for your workflow

hope this video was helpful to you if

you'd like to join my monthly webcasts

and ask your own question sign up for my

membership program at the $5 level by

going to patreon.com slash data school

there's a link in the description below

or you can click the Box on your screen

thank you so much for watching and I'll

see you again soon