fetch

Python 3 Programming Tutorial - Parsing Websites with re and urllib



Sharing buttons:

everybody and welcome to another Python

3 tutorial video in this video we're

gonna be doing is combining two of our

standard library modules and using them

to parcel website so we're going to be

using URL Lib and re for regular

expression so with that let's go ahead

and get started so we're needing to

import a URL Lib dot request and we're

also gonna import URL Lib dot parse

finally we're going to import our e for

regular expressions and we're ready to

go

so first we're going to define our URL

that we want to visit let's go ahead and

visit HTTP colon slash slash Python

programming dotnet next we're going to

say our values so we're going to do a

search on python programming on that so

our values whoops we dictionary is s

will be equal to basics Baysox basics

and then we'll say submit : sir

okay so those are the values that we're

going to parse in and a post now we're

going to say data equals URL Lib parsed

URL and code and we're going to encode

values I'm going to say data equals data

died and code utf-8 oops and then we're

going to say rec for request equals URL

Lib dot request dot request capital R

and we're going to make the request of

URL passing through data and then the

response will be URL Lib dot request dot

URL open request just as a quick aside

I'm not really explaining any of these

stuff that we're typing because we've

already covered the URL Lib tutorial so

if you feel lost here if you're starting

on this video or something and you don't

know

I'm doing what I'm doing check out the

URL live tutorials for Python 3 so

response now RESP data for response data

equals rest read and then what we want

to do is we're gonna use regular

expressions so if you remember last time

what this printed out we could say print

rest data so we can save and run that

let's make sure that works first I

suppose anyway so we got all this junk

right um but within this junk there's

some normal stuff we don't really see

anything here at least so far for me I

don't really see it but here we go here

we got some math basics blah blah blah

here's a paragraph tag finally anyway so

here's a paragraph I'm not I'm not

seeing the opening paragraph tag but

you'll have to take my word for it

there's an opening paragraph tag

somewhere anyway the idea is we want to

parse out paragraph data so generally in

HTML people have you know header 102 for

like big titles and then name or maybe

even had a three then maybe subtitles

had a three out of four maybe header two

and then the content is almost always a

paragraph tag just a you know your

typical like this make some space you've

got a paragraph tag like that then

you've got content content content ba ba

ba ba and then you have a closing

paragraph tag now depending on the site

sometimes this might be slightly

different but usually it's paragraph tag

of some sort you just might have this

parse like this or something hmm anyway

moving on so that's what we want so we

want to use a regular expression that's

going to say hey I want to parse

everything between paragraph tags so

what we're going to say is paragraphs

equals read up find all and what we're

going to do is regular expression and

the regular expression we want is it's

going to be paragraph tags so like I

said if you want to search for specific

text you would you can just type in that

specific text and so so far we're not

using any special character so we don't

need to escape anything

and then uh parentheses can't blanket on

the name there this is where the pattern

that we want so like if you want to put

let's see we'll put the closing

paragraph so basically we're saying

paragraph tag paragraph tag will be here

and then here is what you want to search

for so you put between these little

curly braces or curly braces parentheses

and put between these what what data do

we actually want to search for so like

what is the data we want to output

because so far we're saying nothing and

but it's like find something between

paragraph tags but we have to specify

what that something might be now my

little favorite combination to look for

just anything is a period asterisk

question mark okay so that basically

means find me everything between

paragraph tags and so if we look back at

what these characters actually mean the

period is basically the period is going

to say any character right except for

your new line the asterisk is 0 or more

repetitions so if there's 0 that's okay

and then anything more than that that's

fine and then so we're basically that

this applies to your period and then

finally the question mark which is match

a 0 or 1 repetitions so it's basically

finding anything between paragraph tags

and then only 1 0 or 1 repetitions of

that and then it will break because

that's the end of your paragraph and

then continue on so I just always like

when I'm parsing websites sometimes it's

not between paragraph tags might be

between something else but though the

combination of regular expressions that

I will use is always a period asterisk

question mark just because it works so

well so

what we'll do now is we'll say for each

P in paragraphs print each P okay so

let's save and run that and see what we

get and it's going to take a second to

pop up there it is oh man I think we're

a printing source code there hold on

yeah we printed the response data let's

go ahead and comment that out we don't

want to do that anymore so let's try

that one more time

you definite okay so paragraphs

redefined all great regular expression

but what are we looking in to here and

we're looking for it in RESP data you

need to convert this to a string because

it's not in string but anyways try it

one more time and now we get this right

so this is all of our paragraph data

okay so that's going to conclude the

tutorial of using URL live and regular

expressions to parse a website if you

guys have any questions or comments on

this feel free to leave them below as

always thanks for watching thanks for

all the support subscriptions and until

next time