Python Urllib
## Python3.x Python urllib
The Python urllib library is used to manipulate web URLs and fetch and process web page content.
This article mainly introduces Python3's urllib.
The urllib package contains the following modules:
* urllib.request - Opens and reads URLs.
* urllib.error - Contains exceptions raised by urllib.request.
* urllib.parse - Parses URLs.
* urllib.robotparser - Parses robots.txt files.
!(#)
* * *
## urllib.request
urllib.request defines some functions and classes for opening URLs, including authorization verification, redirection, browser cookies, etc.
urllib.request can simulate a browser's request initiation process.
We can use urllib.request's urlopen method to open a URL, with the following syntax:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
* **url**: URL address.
* **data**: Other data objects sent to the server, default is None.
* **timeout**: Set the access timeout time.
* **cafile and capath**: cafile is the CA certificate, capath is the path to the CA certificate, used for HTTPS.
* **cadefault**: Already deprecated.
* **context**: ssl.SSLContext type, used to specify SSL settings.
Example is as follows:
## Instance
from urllib.request import urlopen
myURL = urlopen("")
print(myURL.read())
The above code uses urlopen to open a URL, then uses the read() function to get the HTML source code of the page.
read() reads the entire webpage content, we can specify the length to read:
## Instance
from urllib.request import urlopen
myURL = urlopen("")
print(myURL.read(300))
In addition to the read() function, there are two other functions for reading webpage content:
* **readline()** - Reads one line of the file
from urllib.request import urlopen myURL = urlopen("")print(myURL.readline()) #Read one line
* **readlines()** - Reads all content of the file, it will assign the read content to a list variable.
from urllib.request import urlopen myURL = urlopen("") lines = myURL.readlines()for line in lines: print(line)
When crawling web pages, we often need to determine if the webpage can be accessed normally. Here we can use the getcode() function to get the webpage status code, returning 200 means the webpage is normal, returning 404 means the webpage does not exist:
## Instance
import urllib.request
myURL1 =urllib.request.urlopen("")
print(myURL1.getcode())# 200
try:
myURL2 =urllib.request.urlopen("")
except urllib.error.HTTPError as e:
if e.code==404:
print(404)# 404
For more HTTP status codes, please refer to: [
To save the crawled webpage locally, you can use the [Python3 File write() method](#) function:
## Instance
from urllib.request import urlopen
myURL = urlopen("")
f =open("tutorial_urllib_test.html","wb")
content = myURL.read()# Read webpage content
f.write(content)
f.close()
Executing the above code will generate a tutorial_urllib_test.html file locally, which contains the content of the webpage.
For more Python File processing, please refer to: [
.
URL encoding and decoding can use **urllib.request.quote()** and **urllib.request.unquote()** methods:
## Instance
import urllib.request
encode_url =urllib.request.quote("")# Encode
print(encode_url)
unencode_url =urllib.request.unquote(encode_url)# Decode
print(unencode_url)
The output result is:
https%3A//www./
### Simulate Header Information
When crawling web pages, we generally need to simulate the headers (webpage header information), which requires using the urllib.request.Request class:
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
* **url**: URL address.
* **data**: Other data objects sent to the server, default is None.
* **headers**: HTTP request header information, dictionary format.
* **origin_req_host**: Request host address, IP or domain name.
* **unverifiable**: Rarely used this parameter, used to set whether the webpage needs verification, default is False.
* **method**: Request method, such as GET, POST, DELETE, PUT, etc.
## Instance - py3_urllib_test.py file code
import urllib.request
import urllib.parse
url =''# search page
keyword='Python Tutorial'
key_code =urllib.request.quote(keyword)# Encode the request
url_all = url+key_code
header ={
'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}#Header information
YouTip