join

How do I merge DataFrames in pandas?



Sharing buttons:

first question from Chinese hi Kevin

thank you so much for the pandas

tutorial I've really improved my skills

in pandas after going through the

tutorials there's one topic that I could

not get a good hang of it is merge and

join based on columns and indexes I know

this is a big topic can you explain it

alright so let me go ahead and share my

screen and here's my plan I'm gonna

cover kind of four sections in this

lesson and the first section is about

selecting a function meaning which

function should you use to merge and

when the second is the bulk of the

lesson I'm gonna just talk I'm gonna go

through a relatively simple merge but

really talk through it in detail so you

feel like you can you can get it the

third section is called what-if and it's

about common variations to the basic

merge and the fourth is about the four

types of joins very briefly because

you'll also need to use that depending

upon the scenario so that's my plan

let's just jump right in and again if

you have a follow-up question feel free

to post it in the chat so part one

selecting a function here are the

functions related to merging that you

might want to use at some point so

there's two at the top one is a data

frame method called append and it takes

the format you know data frame 1 dot

append data frame 2 and that's for

stacking things but only vertically then

there's what's called a top-level

function PT concat and you pass it a

list of objects that you want to

concatenate and that will work either

horizontally or vertically ok so those

are kind of similar and then you've got

two more functions

you've got DF 1 dot join DF 2 so that's

certain types of joins and that's a day

frame method and then you've got PD dot

merge another top level function for

merging now here's my recommendation I

never use append I only use concat I

never use join I only use merge now why

is that

well append and kin can't have very

similar functionality except concat is

more flexible and similarly join and

merge have similar functionality but

merge is more flexible so my philosophy

is why bother learning all four if you

can just learn two of them and handle

learning these concat and merge you can

handle all scenarios you might ever see

okay so there's no point in my mind in

using a pendent joint just you've concat

and merge okay and they are both

top-level functions meaning you say PD

dot something and then you pass all the

objects okay

now I'm not gonna cover concat because I

I do talk about concat in my pandas

video series so check that out and I can

share a link to the particular video but

I don't talk about merge in any of those

videos which is why I'm gonna go

in-depth on it here so we're gonna focus

on this last one called merge okay

so part two is about the actual merging

of data frames and when I say merge or I

say join I really mean the same thing so

I've got some datasets I'm gonna pull up

that I think will demonstrate this well

and you won't be able to follow along

right now because you won't have the

files I will post them afterwards but

you're not going to have them right now

so I've got two data frames that we are

eventually going to merge and it's

important to understand the structure so

that when you see the result you can

understand what happened and why the

first is called movies and don't worry

about the the file reading code that's

not the important part here what we're

looking at is a data frame of movies

it's got two columns movie ID and tie

the data frame has sixteen hundred and

eighty two rows and two columns okay so

this is the if you're not super familiar

with pandas this is the index this is

the movie ID column and this is the

title column so every movie ID relates

to one title kind of makes sense there

are sixteen hundred and eighty-two

movies and we're gonna select the movie

ID column and run the end unique

function or method rather series method

and it's gonna be 1682 which is my way

of showing to you that this is a unique

identifier all right I know this set up

is taking a little while but bear with

me this is the payoff will be will be

worth it so the other data frame is

called the ratings data frame it has

four columns the user ID movie ID rating

and timestamp in other words this user

gave a rating of three to this movie at

this time and that's a unix time so

that's why you can't read it really okay

so this user gave this rating to this

movie at this time okay there are a

hundred thousand rows so that means

there's a hundred thousand ratings and

we're gonna look at the end unique of

the movie ID column there are exactly

six hundred sixteen hundred and

eighty-two movie IDs in here just like

there are sixteen hundred and eighty-two

movie IDs here and they're the same

movie ID now the last thing I want to

show you before we do the merge is let's

run this right here and what I'm saying

I'm using the dot lowkick sesor and I'm

saying what rows do I want I want the

rows in which ratings dot movie ID is

one and I want all columns and then I

want the head of that okay so what am I

just showing you is I want you to see

what some ratings of movie ID

look like and you'll notice these are

the index values so the 25th row in the

ratings data frame is a rating of 4 by

this user idea of movie 1 so I just

wanted to show you this and this

understanding what you're looking at

here will come in handy a little later

on when we're understanding the results

of the merge

ok so now I've set up what is the movies

data frame and what is the ratings data

frame now if we're merging it what would

be the point of merging it is the first

question you'd want to answer and one

reason would be let's say you want to

study the ratings data but you want to

see what the movie name is because 2 4 2

like you don't care who this person is

and you probably don't have any

information on them but the movies we

want to know what is movie 2 4 2 and we

have that data it's just not in this

data frame so that's why we want to do

the join or the merge again same thing

so that's our motivation for the merge

now let's go ahead and do the merge but

I want I want to show you just a recap

of what columns there are and this is

key movies there are two columns and

movies there are four columns in ratings

what I'm gonna do is I'm gonna use PD

dot merge and I'm gonna pass it movies

and ratings because you can see that the

the format of the merge function is

what's the left data frame and what's

the right data frame when you're merging

okay so I'm gonna merge these two I'm

gonna save the result in movie

underscore ratings and then we're gonna

look at the columns before I show you

the result now when you do a merge it

will merge on all columns with the same

name

in other words it will match on movie ID

because they have the same name by

default okay so notice the resulting

columns in movie ratings which is the

data frame that is the merge of these

two we see five columns movie ID title

user ID rating timestamp notice the

order you get both of the columns from

the first data frame in order movie ID

then title then you get user ID rating

and timestamp in that order it skips

movie ID because that's what you joined

on so it's not gonna list it twice in

the result okay now that I've showed you

which columns let's take a look at the

head and actually see the result

okay so again move the ID title user ID

rating timestamp all right so what

happened first thing that pandas did in

the merge is it said okay the first

movie ID in the movies data frame is one

and the title is toys Toy Story so let's

put that right here then it's gonna look

for a movie ID in the ratings data frame

and every time it finds a match it's

gonna create a new row showing the user

ID rating and timestamp from that row

now this may look familiar because

you've seen it already up here user ID

308 287 148 280 66 did these ratings and

here they are again okay so essentially

the it's trying to find every instance

of movie ID 1 in the ratings data frame

and it's building a row with the other

columns from ratings that match to it

okay now there are a hundred thousand

rows because that's how many ratings

there were in the movies data frame it

doesn't it doesn't just try to find one

match it tries to find every match in

the right data frame to the things in

the left

data frame okay so that is what has

happened in the merge so far okay and

this is just a narrative about all of

that that you are welcome to read later

just a quick recap of the shapes so

movies was sixteen hundred and eighty

two rows and two columns ratings was a

hundred thousand by four and movie

ratings is a hundred thousand by five so

you added the two columns plus four

columns minus the one matching column

and you get five and there's a hundred

thousand rows because there was a match

in all a hundred thousand cases okay so

that's how what happened in this basic

merge alright this is my what-if cases

first question what if the columns you

want to join on don't have the same name

okay so let's create that scenario we

can overwrite the columns attribute of

movies and let's just

instead of movie ID let's make it M

underscore ID okay so those are the

movie columns and here the rating

columns if we try to do a merge it's

going to fail because there's no

matching column names okay so if you

want to merge this remember the data

hasn't changed if you want to do the

merge you use these two arguments left

on and right on in other words I want to

join on this column in the left data

frame and join on this column in the

right data frame and if we do that we'll

get the same exact result as before okay

so that was the first what if if the

column names don't match the second what

if what if you want to join on an index

and I'll get to joining on two indexes

but let's start with joining on one

index and this is a little confusing but

let's go ahead and set the index as mi d

okay and you'll see now the Moot the mi

d2 the index okay if you want to do the

join in this case you say left index

equals true right on movie ID okay so

previously we said left on this time

we're saying left index equals true now

why isn't it just like left on equals

index that would have been another

option for how they could have created

it but you know you might have a column

named index and whatever the point is if

the thing you're trying to join on in

one of the data frames is in index you

just say left index equals true and then

in the right data frame you're matching

on a movie ID okay now there's a funny

thing that happens this looks exactly

the same as before except the index has

changed so this is a little a little

tricky to understand why this would

happen but if you really think through

it it kind of makes sense the index from

the right data frame got used as the

index of the result okay now why would

that be well we joined on the left index

with a column called movie ID so you

don't really need to keep this index in

the results because it already matches

movie ID so there's no point in using

this as the new index there's just no

point however you might as well rather

than using the default index of an

integer starting at zero

you might as well use the index from the

right data frame so even though you

might think that's like the reverse of

what should happen it uses the index in

the result from the data frame that you

did not match on the index okay it's a

bit of a complicated logic but hopefully

that was helpful and then let me get to

joining on two indexes then I will get

back to your question Todd so what if

you want to join on two indexes so let's

do that and now we've set the ratings

data frame to have movie ID as the index

okay now both of our data frames have an

index and we want to join on it in that

case we just say left index equals true

and right index equals true and we'll go

ahead and run that and in this case it

uses the index that gets created is the

index from the left data frame okay so

this is the index it came from the left

data frame you might not know that you

don't have to use a unique index in

pandas so theirs which is why the Hinda

x is 1 1 1 1 1 1 I mean the index varies

throughout the data frame of course but

it does not have to be unique okay so

part four is the four types of joins

what we've been doing is known as an

inner join I n NER and inner join but

there are also other types of joins

supported by the pandas merge function

there the outer join the left join and

the right join ok and the easiest way to

understand them is by looking at some

examples so I've got these two examples

a and B alright and then we're gonna do

those four types of joins and you'll see

exactly like the logic of those types of

joins ok

and it'll then it'll make sense why you

need to know which type of join to use

okay so um we've got a which has two

colors color and num and B which has two

cup columns color and size

now you're gonna ignore the index that

doesn't matter and when we merge these

they're going to merge based on the

color column okay cuz that's the

matching column name okay so notice we

have green yellow red green yellow pink

one two three SML like small medium

large okay

so inner join and the names will make

sense once you know what they do the

inner join says only include a row in

the result if the thing you joined on is

present in both data frames okay so here

is the result now notice there's this

how argument and by default its inner so

you'd get the same result if you just

didn't have this but I just want to be

explicit so how equals inner okay now

because red was not found in B and pink

was not found in a neither of those show

up in the results okay and so green

yellow one and two from a and then green

yellow S&M from B okay so that's our

result with an inner join outer join is

the kind of the opposite

it gets included regardless all the the

keys get included so we get green yellow

red and pink so we get green yellow red

from a and one two three from a green

yellow red one two three but what

happens with pink because pink has no

num so a get fit gets filled with na n

which is a missing

you as well you get green yellow pink

from B with sizes small medium and large

but because red is not in B if size gets

marked as missing okay so inner join

includes only the matching outer join

includes all okay and missing values

just get marked as missing left join you

only you keep all the keys from the left

data frame regardless of whether they

match so in the left join since pink was

not found in a you don't include that

row and you do have an na n value for

size and a n value for size okay by red

right join you have green yellow pink it

says okay we're gonna include everything

that's in B so we include green yellow

and pink we include size small medium

large but pink doesn't have a num in a

so it gets marked as missing okay so

those are the four types of joins you

know now most of what you need to know

about merges and joins and there's some

more functionality available within PD

not merge that you can read about as as

needed hope this video was helpful to

you if you'd like to join my monthly

webcasts and ask your own question sign

up for my membership program at the $5

level by going to patreon.com slash data

school there's a link in the description

below or you can click the Box on your

screen thank you so much for watching

and I'll see you again soon