Requests

Requests basics

Requests is a Python module for communicating with web pages. We can connect to a web page by passing the address of the page to the the requests.get() function:

[1]:
import requests

url = "https://en.wikipedia.org/wiki/Buffalo,_New_York"
page = requests.get(url)

The get() function returns an object with several types of data returned by the web server. For example, we can check the status code of the connection. The code 200 means that the everything went fine:

[2]:
page.status_code
[2]:
200

Code 404 means that the requested page has not been found:

[3]:
bad_url = "https://en.wikipedia.org/THERE_IS_NO_SUCH_PAGE"
bad_page = requests.get(bad_url)
bad_page.status_code
[3]:
404

The text of the web page can be retrieved using the text property:

[5]:
page_text = page.text
print(page_text[203000:204000])  # print a fragment of the wikipedia page text
ape" title="Rape">Rape</a></th><td class="infobox-data">121</td></tr><tr><th scope="row" class="info
box-label"><a href="/wiki/Robbery" title="Robbery">Robbery</a></th><td class="infobox-data">802</td>
</tr><tr><th scope="row" class="infobox-label"><a href="/wiki/Aggravated_assault" class="mw-redirect
" title="Aggravated assault">Aggravated assault</a></th><td class="infobox-data">1,563</td></tr><tr>
<th scope="row" class="infobox-label"><a href="/wiki/Violent_crime" title="Violent crime">Total viol
ent crime</a></th><td class="infobox-data">2,533 <img alt="Positive decrease" src="//upload.wikimedi
a.org/wikipedia/commons/thumb/9/92/Decrease_Positive.svg/11px-Decrease_Positive.svg.png" decoding="a
sync" title="Positive decrease" width="11" height="11" srcset="//upload.wikimedia.org/wikipedia/comm
ons/thumb/9/92/Decrease_Positive.svg/17px-Decrease_Positive.svg.png 1.5x, //upload.wikimedia.org/wik
ipedia/commons/thumb/9/92/Decrease_Positive.svg/22px-Decrease_Positive.svg.png 2x" data-file-width="

Binary files

Some web addresses point not to web pages, but to other types of data: images, videos, pdf files etc. We can retrieve such data using requests too. For example, here is the url of a jpg file posted on Wikimedia Commons:

[7]:
photo_url = "https://upload.wikimedia.org/wikipedia/commons/9/93/Northern_Cardinal_Male-27527-3.jpg"

The code below downloads the photo and saves to a jpg file:

[14]:
p = requests.get(photo_url)
photo = p.content  # get the binary content of the file

# save the photo to a file
with open("northern_cardinal.jpg", 'wb') as f:
    f.write(photo)

We can now display the saved file:

[15]:
from IPython.core.display import Image
Image("northern_cardinal.jpg", width=250)
[15]:
../../_images/Tools_requests_requests_18_0.jpg

User-Agent

Every request sent by a web browser to a web server includes User-Agent, i.e. a string with information about the type of browser, the computer on which the browser operates etc. For example, the following User-Agent identifies the browser as Firefox 73 operating on a Macintosh with the operating system Mac OS X 10.14:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:73.0) Gecko/20100101 Firefox/73.0

The web page httpbin.org/get simply displays the data sent to it by the browser, so you can visit it to check the User-Agent used by your web browser. Using requests to connect to this website, we can inspect the default User-Agent set by the requests module:

[16]:
page = requests.get("https://httpbin.org/get")
print(page.text)
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.27.1",
    "X-Amzn-Trace-Id": "Root=1-622acd3b-2da793644ef419a2752bcf87"
  },
  "origin": "98.118.166.35",
  "url": "https://httpbin.org/get"
}

This default User-Agent is good enough in many cases, but some websites expect less generic information, and may return an error message if these expectations are not met. For example, here is the User-Agent policy of Wikimedia, which applies e.g. to accessing all Wikipedia pages. We can change the request’s User-Agent to any value as follows:

[17]:
# define the User-Agent string
user_agent = "MTH548_test_script/1.0"

# create a dictionary specifying the User-Agent
headers = {'User-Agent': user_agent}
# pass the dictionary to the get() function
page = requests.get("https://httpbin.org/get", headers=headers)
# check the data we sent to the website
print(page.text)
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Host": "httpbin.org",
    "User-Agent": "MTH548_test_script/1.0",
    "X-Amzn-Trace-Id": "Root=1-622acd48-50e323124c02d82524984ea7"
  },
  "origin": "98.118.166.35",
  "url": "https://httpbin.org/get"
}

Sending data

In some cases we need to specify more data, beside the address of a web page, to get the information we need. For example, to make use of the On-Line Encyclopedia of Integer Sequences we need to provide a sequence of integers as a search query. This can be accomplished using requests as follows:

[18]:
url = "https://oeis.org/search"

# define a dictionary with data to be sent
params = {"q": "448, 548",  # search for integer sequences containing 448 and 548
          "fmt": "text"}     # return results as plain text (and not HTML)

page = requests.get(url, headers=headers, params=params)
print(page.text[:1000]) # print a fragment of the response
# Greetings from The On-Line Encyclopedia of Integer Sequences! http://oeis.org/

Search: seq:448,548
Showing 1-3 of 3

%I A328780
%S A328780 0,1,2,3,10,20,30,100,200,245,247,249,251,253,283,300,448,548,949,
%T A328780 1000,1249,1253,1416,1747,1749,1751,1753,1755,2000,2245,2247,2249,
%U A328780 2251,2253,2429,2450,2451,2470,2490,2498,2510,2530,2647,2830,3000,3747,3751,4480,4899
%N A328780 Nonnegative integers k such that k and k^2 have the same number of nonzero digits.
%C A328780 The idea of this sequence comes from the 1st problem of the 28th British Mathematical Olympiad in 1992 (see the link).
%C A328780 This sequence is infinite because the family of integers {10^k, k >= 0} (A011557) belongs to this sequence.
%C A328780 The numbers m, m + 1, m + 2 where m = 49*10^k - 3, or m = 99*10^k - 3, k >= 3 are terms with all nonzero digits. - _Marius A. Burtea_, Dec 21 2020
%D A328780 A. Gardiner, The Mathematical Olympiad Handbook: An Introduction to Problem Solving, Oxford University Pres

POST requests

The HTTP protocol specifies several different methods in which a client (e.g. a web browser or a Python script) can sent a message to a web server. The function requests.get() implements one such method: a GET request. The main purpose of GET requests is to ask the server for some data. As we have seen in the last example, GET requests can also send some information to the server, but this is not their main purpose, and their functionality in this respect is limited. In order to submit data to the server - e.g. to securely send a password or upload a file - clients use POST requests. See w3schools.com for more information about differences between GET and POST.

In the requests module POST requests are implemented by the requests.post() function. Below we use this function to send a POST request with some data to the web page httpbin.org/post. This web page simply displays the information sent to it by a client, so it will let us check how this request looks when it is received by the server.

Note

If you click on the link httpbin.org/post, or you enter this url in the address bar of a web browser, you will see a page with a message “Method not allowed”. It happens because this url expects POST requests, and web browsers use GET unless directed otherwise.

[19]:
url = "https://httpbin.org/post"

# define the User-Agent string
user_agent = "MTH548_POST_script/1.0"
# create a dictionary specifying the User-Agent
headers = {'User-Agent': user_agent}
# define data to be sent
data = {'first_name': 'Stanislaw', 'second_name': 'Lem'}

# send a POST request
page = requests.post(url, headers=headers, data=data)
print(page.text)
{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "first_name": "Stanislaw",
    "second_name": "Lem"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Content-Length": "36",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "MTH548_POST_script/1.0",
    "X-Amzn-Trace-Id": "Root=1-622acd80-1816d66522c996eb5e6857d6"
  },
  "json": null,
  "origin": "98.118.166.35",
  "url": "https://httpbin.org/post"
}