There are more pages on the web than people on Earth.
And while I haven’t checked, I am sure each one is full of original, high quality content
that would make our ancestors proud.
Most people access web pages through a browser, but as programmers we have other methods...
Today, we will learn how to use Python to send GET requests to web servers, and then
parse the response.
This way you can write software to read websites for you,
giving you more time to browse the internet.
In a browser, you access a web page by typing the URL in the address bar.
URL stands for “Uniform Resource Locator” and this string can hold a LOT of information.
At the beginning is the protocol, which is sometimes called the scheme.
Next is the host name.
Sometimes you will see a colon followed by a number.
That number is the port.
If the port is not explicitly specified, you can determine it from the protocol.
HTTP uses port 80, while HTTPS uses port 443.
After the host name comes the path.
The text after the question mark is called the “query string”.
It holds a collection of key-value pairs separated by ampersands.
And lastly, you may see a hashtag at the end followed by a string.
This value is called a fragment and is used to jump to a section within the webpage.
Python 3 comes equipped with a package that simplifies the task of building, loading and
parsing URLs The “URL LIB” package...
This package contains five modules: request, response, error, parse and robotparser
The request module is used to open URLs The response module is used internally by
the request module - you will not work with this directly
The error module contains several error classes for use by the request module
The parse module has a variety of functions for breaking up a URL into meaningful pieces,
like the scheme, host, port, and query string data.
And finally there is robotparser.
An exciting name, for a less than exciting module...
It is used to inspect robots.txt (“robots-dot-t-x-t”) files for what permissions are granted to
bots and crawlers.
Today we will focus on the request module, since this is where the action lies.
To begin, import url-lib.
Now use the “directory” function to see what is available.
This is because urllib is a package holding the modules that do the actual work.
So instead, import the module inside urllib that you want to use.
We want to use the “request” module.
If you call the directory function on the request module, you will see a lot of classes
The function which enables you to easily open a specific URL is the “urlopen” function.
Just as the “open” function is used to open files,
“urlopen” is used to open URLs.
As an example, let us open the home page for Wikipedia.
The function returns a “response” object.
If you look at the type, you will see it is NOT the response in the urllib package, but
a different type of response from a different package.
To see what you can do with the response, use the directory function.
First, let us check if the request was successful by looking at the response code.
This is actually good news.
A 200 response code means everything went OK.
You may ask why the number 200 was chosen.
I may ask the same thing...
Next, let us see how large the response is.
This is the size of the response in BYTES.
We can use the “peek” function to look at small part of the response, rather than
the full value.
This most definitely looks like HTML, but notice that this is not a string.
The “b” at the beginning tells us this is a “bytes object”
The reason for this is that web servers can host binary data
in addition to plain HTML files.
Let us now read the entire response.
If you look at the type, it is indeed a bytes object.
And it is the correct size...
We can convert this to text by decoding it.
If you look at the peek value, the character set in the response is “UTF-8”
So to decode this bytes object, call the “decode” method and specify the encoding that was used.
We now have a string…
And if you display the value, you can see all the HTML for the web page.
By the way, look what happens if you try to read the response a second time.
This is because once you read the response, Python closes the connection.
As a second example, let us send a search request to Google.
Earlier we said that a 200 response code meant everything was OK.
So things are definitely not OK.
A 403 response code means that while our request was valid, the server is refusing to respond.
I can understand their reaction.
If they let anyone scrape their search results without restriction, then competitors would
use this information to their advantage.
Let us try a different example...
We will now load the YouTube page for this incredible video on Black Holes.
Here is the URL.
Notice that this URL contains two parameters in the query string: V and T.
V is the video ID, and T is the time in the video to begin playback.
One way to construct the querystring is to append a lot of strings together.
But there is an easier way.
To see this, import the “parse” module.
Looking at the directory, you can see a large collection of functions for working with URLs.
Here, we will use the “urlencode” function.
First, we create a dictionary containing the querystring parameters.
Next, call the “urlencode” function.
The result is a string that is suitable for use as the querystring.
Notice, however, that the question mark is NOT included.
We can now build the URL.
Next, open the url using the “urlopen” method.
If you call the “isclosed” method,
you can see we still have a connection with the server.
The response code is 200, so our request was fulfilled.
We can then read and decode the server response in a single line.
Looking at the first 500 characters of the html, we see everything looks to be in order.
You have now taken your first step towards bypassing the browser, and interacting with
web servers programmatically.
But there is much more to learn.
What if you want to send a POST or PUT request?
How do you include cookies in your request?
What if authentication is required?
And what if you aren’t subscribed to Socratica?
Why don’t we make videos more quickly?
You will soon learn how to solve all of these problems…