YouTip LogoYouTip

Python Urllib

## Python3.x Python urllib The Python urllib library is used to manipulate web URLs and fetch and process web page content. This article mainly introduces Python3's urllib. The urllib package contains the following modules: * urllib.request - Opens and reads URLs. * urllib.error - Contains exceptions raised by urllib.request. * urllib.parse - Parses URLs. * urllib.robotparser - Parses robots.txt files. !(#) * * * ## urllib.request urllib.request defines some functions and classes for opening URLs, including authorization verification, redirection, browser cookies, etc. urllib.request can simulate a browser's request initiation process. We can use urllib.request's urlopen method to open a URL, with the following syntax: urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None) * **url**: URL address. * **data**: Other data objects sent to the server, default is None. * **timeout**: Set the access timeout time. * **cafile and capath**: cafile is the CA certificate, capath is the path to the CA certificate, used for HTTPS. * **cadefault**: Already deprecated. * **context**: ssl.SSLContext type, used to specify SSL settings. Example is as follows: ## Instance from urllib.request import urlopen myURL = urlopen("") print(myURL.read()) The above code uses urlopen to open a URL, then uses the read() function to get the HTML source code of the page. read() reads the entire webpage content, we can specify the length to read: ## Instance from urllib.request import urlopen myURL = urlopen("") print(myURL.read(300)) In addition to the read() function, there are two other functions for reading webpage content: * **readline()** - Reads one line of the file from urllib.request import urlopen myURL = urlopen("")print(myURL.readline()) #Read one line * **readlines()** - Reads all content of the file, it will assign the read content to a list variable. from urllib.request import urlopen myURL = urlopen("") lines = myURL.readlines()for line in lines: print(line) When crawling web pages, we often need to determine if the webpage can be accessed normally. Here we can use the getcode() function to get the webpage status code, returning 200 means the webpage is normal, returning 404 means the webpage does not exist: ## Instance import urllib.request myURL1 =urllib.request.urlopen("") print(myURL1.getcode())# 200 try: myURL2 =urllib.request.urlopen("") except urllib.error.HTTPError as e: if e.code==404: print(404)# 404 For more HTTP status codes, please refer to: [ To save the crawled webpage locally, you can use the [Python3 File write() method](#) function: ## Instance from urllib.request import urlopen myURL = urlopen("") f =open("tutorial_urllib_test.html","wb") content = myURL.read()# Read webpage content f.write(content) f.close() Executing the above code will generate a tutorial_urllib_test.html file locally, which contains the content of the webpage. For more Python File processing, please refer to: [ . URL encoding and decoding can use **urllib.request.quote()** and **urllib.request.unquote()** methods: ## Instance import urllib.request encode_url =urllib.request.quote("")# Encode print(encode_url) unencode_url =urllib.request.unquote(encode_url)# Decode print(unencode_url) The output result is: https%3A//www./ ### Simulate Header Information When crawling web pages, we generally need to simulate the headers (webpage header information), which requires using the urllib.request.Request class: class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None) * **url**: URL address. * **data**: Other data objects sent to the server, default is None. * **headers**: HTTP request header information, dictionary format. * **origin_req_host**: Request host address, IP or domain name. * **unverifiable**: Rarely used this parameter, used to set whether the webpage needs verification, default is False. * **method**: Request method, such as GET, POST, DELETE, PUT, etc. ## Instance - py3_urllib_test.py file code import urllib.request import urllib.parse url =''# search page keyword='Python Tutorial' key_code =urllib.request.quote(keyword)# Encode the request url_all = url+key_code header ={ 'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' }#Header information
← Pandas Csv FilePandas Series β†’