diff options
Diffstat (limited to 'Doc/howto/urllib2.rst')
-rw-r--r-- | Doc/howto/urllib2.rst | 603 |
1 files changed, 0 insertions, 603 deletions
diff --git a/Doc/howto/urllib2.rst b/Doc/howto/urllib2.rst deleted file mode 100644 index f8f4a2b..0000000 --- a/Doc/howto/urllib2.rst +++ /dev/null @@ -1,603 +0,0 @@ -============================================== - HOWTO Fetch Internet Resources Using urllib2 -============================================== ----------------------------- - Fetching URLs With Python ----------------------------- - - -.. note:: - - There is an French translation of an earlier revision of this - HOWTO, available at `urllib2 - Le Manuel manquant - <http://www.voidspace/python/articles/urllib2_francais.shtml>`_. - -.. contents:: urllib2 Tutorial - - -Introduction -============ - -.. sidebar:: Related Articles - - You may also find useful the following article on fetching web - resources with Python : - - * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ - - A tutorial on *Basic Authentication*, with examples in Python. - - This HOWTO is written by `Michael Foord - <http://www.voidspace.org.uk/python/index.shtml>`_. - -**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs -(Uniform Resource Locators). It offers a very simple interface, in the form of -the *urlopen* function. This is capable of fetching URLs using a variety -of different protocols. It also offers a slightly more complex -interface for handling common situations - like basic authentication, -cookies, proxies and so on. These are provided by objects called -handlers and openers. - -urllib2 supports fetching URLs for many "URL schemes" (identified by the string -before the ":" in URL - for example "ftp" is the URL scheme of -"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). -This tutorial focuses on the most common case, HTTP. - -For straightforward situations *urlopen* is very easy to use. But as -soon as you encounter errors or non-trivial cases when opening HTTP -URLs, you will need some understanding of the HyperText Transfer -Protocol. The most comprehensive and authoritative reference to HTTP -is :RFC:`2616`. This is a technical document and not intended to be -easy to read. This HOWTO aims to illustrate using *urllib2*, with -enough detail about HTTP to help you through. It is not intended to -replace the `urllib2 docs <http://docs.python.org/lib/module-urllib2.html>`_ , -but is supplementary to them. - - -Fetching URLs -============= - -The simplest way to use urllib2 is as follows : :: - - import urllib2 - response = urllib2.urlopen('http://python.org/') - html = response.read() - -Many uses of urllib2 will be that simple (note that instead of an -'http:' URL we could have used an URL starting with 'ftp:', 'file:', -etc.). However, it's the purpose of this tutorial to explain the more -complicated cases, concentrating on HTTP. - -HTTP is based on requests and responses - the client makes requests -and servers send responses. urllib2 mirrors this with a ``Request`` -object which represents the HTTP request you are making. In its -simplest form you create a Request object that specifies the URL you -want to fetch. Calling ``urlopen`` with this Request object returns a -response object for the URL requested. This response is a file-like -object, which means you can for example call .read() on the response : -:: - - import urllib2 - - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) - the_page = response.read() - -Note that urllib2 makes use of the same Request interface to handle -all URL schemes. For example, you can make an FTP request like so: :: - - req = urllib2.Request('ftp://example.com/') - -In the case of HTTP, there are two extra things that Request objects -allow you to do: First, you can pass data to be sent to the server. -Second, you can pass extra information ("metadata") *about* the data -or the about request itself, to the server - this information is sent -as HTTP "headers". Let's look at each of these in turn. - -Data ----- - -Sometimes you want to send data to a URL (often the URL will refer to -a CGI (Common Gateway Interface) script [#]_ or other web -application). With HTTP, this is often done using what's known as a -**POST** request. This is often what your browser does when you submit -a HTML form that you filled in on the web. Not all POSTs have to come -from forms: you can use a POST to transmit arbitrary data to your own -application. In the common case of HTML forms, the data needs to be -encoded in a standard way, and then passed to the Request object as -the ``data`` argument. The encoding is done using a function from the -``urllib`` library *not* from ``urllib2``. :: - - import urllib - import urllib2 - - url = 'http://www.someserver.com/cgi-bin/register.cgi' - values = {'name' : 'Michael Foord', - 'location' : 'Northampton', - 'language' : 'Python' } - - data = urllib.urlencode(values) - req = urllib2.Request(url, data) - response = urllib2.urlopen(req) - the_page = response.read() - -Note that other encodings are sometimes required (e.g. for file upload -from HTML forms - see -`HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ -for more details). - -If you do not pass the ``data`` argument, urllib2 uses a **GET** -request. One way in which GET and POST requests differ is that POST -requests often have "side-effects": they change the state of the -system in some way (for example by placing an order with the website -for a hundredweight of tinned spam to be delivered to your door). -Though the HTTP standard makes it clear that POSTs are intended to -*always* cause side-effects, and GET requests *never* to cause -side-effects, nothing prevents a GET request from having side-effects, -nor a POST requests from having no side-effects. Data can also be -passed in an HTTP GET request by encoding it in the URL itself. - -This is done as follows:: - - >>> import urllib2 - >>> import urllib - >>> data = {} - >>> data['name'] = 'Somebody Here' - >>> data['location'] = 'Northampton' - >>> data['language'] = 'Python' - >>> url_values = urllib.urlencode(data) - >>> print url_values - name=Somebody+Here&language=Python&location=Northampton - >>> url = 'http://www.example.com/example.cgi' - >>> full_url = url + '?' + url_values - >>> data = urllib2.open(full_url) - -Notice that the full URL is created by adding a ``?`` to the URL, followed by -the encoded values. - -Headers -------- - -We'll discuss here one particular HTTP header, to illustrate how to -add headers to your HTTP request. - -Some websites [#]_ dislike being browsed by programs, or send -different versions to different browsers [#]_ . By default urllib2 -identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are -the major and minor version numbers of the Python release, -e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain -not work. The way a browser identifies itself is through the -``User-Agent`` header [#]_. When you create a Request object you can -pass a dictionary of headers in. The following example makes the same -request as above, but identifies itself as a version of Internet -Explorer [#]_. :: - - import urllib - import urllib2 - - url = 'http://www.someserver.com/cgi-bin/register.cgi' - user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' - values = {'name' : 'Michael Foord', - 'location' : 'Northampton', - 'language' : 'Python' } - headers = { 'User-Agent' : user_agent } - - data = urllib.urlencode(values) - req = urllib2.Request(url, data, headers) - response = urllib2.urlopen(req) - the_page = response.read() - -The response also has two useful methods. See the section on `info and -geturl`_ which comes after we have a look at what happens when things -go wrong. - - -Handling Exceptions -=================== - -*urlopen* raises ``URLError`` when it cannot handle a response (though -as usual with Python APIs, builtin exceptions such as ValueError, -TypeError etc. may also be raised). - -``HTTPError`` is the subclass of ``URLError`` raised in the specific -case of HTTP URLs. - -URLError --------- - -Often, URLError is raised because there is no network connection (no -route to the specified server), or the specified server doesn't exist. -In this case, the exception raised will have a 'reason' attribute, -which is a tuple containing an error code and a text error message. - -e.g. :: - - >>> req = urllib2.Request('http://www.pretend_server.org') - >>> try: urllib2.urlopen(req) - >>> except URLError as e: - >>> print e.reason - >>> - (4, 'getaddrinfo failed') - - -HTTPError ---------- - -Every HTTP response from the server contains a numeric "status -code". Sometimes the status code indicates that the server is unable -to fulfil the request. The default handlers will handle some of these -responses for you (for example, if the response is a "redirection" -that requests the client fetch the document from a different URL, -urllib2 will handle that for you). For those it can't handle, urlopen -will raise an ``HTTPError``. Typical errors include '404' (page not -found), '403' (request forbidden), and '401' (authentication -required). - -See section 10 of RFC 2616 for a reference on all the HTTP error -codes. - -The ``HTTPError`` instance raised will have an integer 'code' -attribute, which corresponds to the error sent by the server. - -Error Codes -~~~~~~~~~~~ - -Because the default handlers handle redirects (codes in the 300 -range), and codes in the 100-299 range indicate success, you will -usually only see error codes in the 400-599 range. - -``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful -dictionary of response codes in that shows all the response codes used -by RFC 2616. The dictionary is reproduced here for convenience :: - - # Table mapping response codes to messages; entries have the - # form {code: (shortmessage, longmessage)}. - responses = { - 100: ('Continue', 'Request received, please continue'), - 101: ('Switching Protocols', - 'Switching to new protocol; obey Upgrade header'), - - 200: ('OK', 'Request fulfilled, document follows'), - 201: ('Created', 'Document created, URL follows'), - 202: ('Accepted', - 'Request accepted, processing continues off-line'), - 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), - 204: ('No Content', 'Request fulfilled, nothing follows'), - 205: ('Reset Content', 'Clear input form for further input.'), - 206: ('Partial Content', 'Partial content follows.'), - - 300: ('Multiple Choices', - 'Object has several resources -- see URI list'), - 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), - 302: ('Found', 'Object moved temporarily -- see URI list'), - 303: ('See Other', 'Object moved -- see Method and URL list'), - 304: ('Not Modified', - 'Document has not changed since given time'), - 305: ('Use Proxy', - 'You must use proxy specified in Location to access this ' - 'resource.'), - 307: ('Temporary Redirect', - 'Object moved temporarily -- see URI list'), - - 400: ('Bad Request', - 'Bad request syntax or unsupported method'), - 401: ('Unauthorized', - 'No permission -- see authorization schemes'), - 402: ('Payment Required', - 'No payment -- see charging schemes'), - 403: ('Forbidden', - 'Request forbidden -- authorization will not help'), - 404: ('Not Found', 'Nothing matches the given URI'), - 405: ('Method Not Allowed', - 'Specified method is invalid for this server.'), - 406: ('Not Acceptable', 'URI not available in preferred format.'), - 407: ('Proxy Authentication Required', 'You must authenticate with ' - 'this proxy before proceeding.'), - 408: ('Request Timeout', 'Request timed out; try again later.'), - 409: ('Conflict', 'Request conflict.'), - 410: ('Gone', - 'URI no longer exists and has been permanently removed.'), - 411: ('Length Required', 'Client must specify Content-Length.'), - 412: ('Precondition Failed', 'Precondition in headers is false.'), - 413: ('Request Entity Too Large', 'Entity is too large.'), - 414: ('Request-URI Too Long', 'URI is too long.'), - 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), - 416: ('Requested Range Not Satisfiable', - 'Cannot satisfy request range.'), - 417: ('Expectation Failed', - 'Expect condition could not be satisfied.'), - - 500: ('Internal Server Error', 'Server got itself in trouble'), - 501: ('Not Implemented', - 'Server does not support this operation'), - 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), - 503: ('Service Unavailable', - 'The server cannot process the request due to a high load'), - 504: ('Gateway Timeout', - 'The gateway server did not receive a timely response'), - 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), - } - -When an error is raised the server responds by returning an HTTP error -code *and* an error page. You can use the ``HTTPError`` instance as a -response on the page returned. This means that as well as the code -attribute, it also has read, geturl, and info, methods. :: - - >>> req = urllib2.Request('http://www.python.org/fish.html') - >>> try: - >>> urllib2.urlopen(req) - >>> except URLError as e: - >>> print e.code - >>> print e.read() - >>> - 404 - <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <?xml-stylesheet href="./css/ht2html.css" - type="text/css"?> - <html><head><title>Error 404: File Not Found</title> - ...... etc... - -Wrapping it Up --------------- - -So if you want to be prepared for ``HTTPError`` *or* ``URLError`` -there are two basic approaches. I prefer the second approach. - -Number 1 -~~~~~~~~ - -:: - - - from urllib2 import Request, urlopen, URLError, HTTPError - req = Request(someurl) - try: - response = urlopen(req) - except HTTPError as e: - print 'The server couldn\'t fulfill the request.' - print 'Error code: ', e.code - except URLError as e: - print 'We failed to reach a server.' - print 'Reason: ', e.reason - else: - # everything is fine - - -.. note:: - - The ``except HTTPError`` *must* come first, otherwise ``except URLError`` - will *also* catch an ``HTTPError``. - -Number 2 -~~~~~~~~ - -:: - - from urllib2 import Request, urlopen, URLError - req = Request(someurl) - try: - response = urlopen(req) - except URLError as e: - if hasattr(e, 'reason'): - print 'We failed to reach a server.' - print 'Reason: ', e.reason - elif hasattr(e, 'code'): - print 'The server couldn\'t fulfill the request.' - print 'Error code: ', e.code - else: - # everything is fine - - -info and geturl -=============== - -The response returned by urlopen (or the ``HTTPError`` instance) has -two useful methods ``info`` and ``geturl``. - -**geturl** - this returns the real URL of the page fetched. This is -useful because ``urlopen`` (or the opener object used) may have -followed a redirect. The URL of the page fetched may not be the same -as the URL requested. - -**info** - this returns a dictionary-like object that describes the -page fetched, particularly the headers sent by the server. It is -currently an ``httplib.HTTPMessage`` instance. - -Typical headers include 'Content-length', 'Content-type', and so -on. See the -`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_ -for a useful listing of HTTP headers with brief explanations of their meaning -and use. - - -Openers and Handlers -==================== - -When you fetch a URL you use an opener (an instance of the perhaps -confusingly-named ``urllib2.OpenerDirector``). Normally we have been using -the default opener - via ``urlopen`` - but you can create custom -openers. Openers use handlers. All the "heavy lifting" is done by the -handlers. Each handler knows how to open URLs for a particular URL -scheme (http, ftp, etc.), or how to handle an aspect of URL opening, -for example HTTP redirections or HTTP cookies. - -You will want to create openers if you want to fetch URLs with -specific handlers installed, for example to get an opener that handles -cookies, or to get an opener that does not handle redirections. - -To create an opener, instantiate an OpenerDirector, and then call -.add_handler(some_handler_instance) repeatedly. - -Alternatively, you can use ``build_opener``, which is a convenience -function for creating opener objects with a single function call. -``build_opener`` adds several handlers by default, but provides a -quick way to add more and/or override the default handlers. - -Other sorts of handlers you might want to can handle proxies, -authentication, and other common but slightly specialised -situations. - -``install_opener`` can be used to make an ``opener`` object the -(global) default opener. This means that calls to ``urlopen`` will use -the opener you have installed. - -Opener objects have an ``open`` method, which can be called directly -to fetch urls in the same way as the ``urlopen`` function: there's no -need to call ``install_opener``, except as a convenience. - - -Basic Authentication -==================== - -To illustrate creating and installing a handler we will use the -``HTTPBasicAuthHandler``. For a more detailed discussion of this -subject - including an explanation of how Basic Authentication works - -see the `Basic Authentication Tutorial <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. - -When authentication is required, the server sends a header (as well as -the 401 error code) requesting authentication. This specifies the -authentication scheme and a 'realm'. The header looks like : -``Www-authenticate: SCHEME realm="REALM"``. - -e.g. :: - - Www-authenticate: Basic realm="cPanel Users" - - -The client should then retry the request with the appropriate name and -password for the realm included as a header in the request. This is -'basic authentication'. In order to simplify this process we can -create an instance of ``HTTPBasicAuthHandler`` and an opener to use -this handler. - -The ``HTTPBasicAuthHandler`` uses an object called a password manager -to handle the mapping of URLs and realms to passwords and -usernames. If you know what the realm is (from the authentication -header sent by the server), then you can use a -``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In -that case, it is convenient to use -``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a -default username and password for a URL. This will be supplied in the -absence of you providing an alternative combination for a specific -realm. We indicate this by providing ``None`` as the realm argument to -the ``add_password`` method. - -The top-level URL is the first URL that requires authentication. URLs -"deeper" than the URL you pass to .add_password() will also match. :: - - # create a password manager - password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() - - # Add the username and password. - # If we knew the realm, we could use it instead of ``None``. - top_level_url = "http://example.com/foo/" - password_mgr.add_password(None, top_level_url, username, password) - - handler = urllib2.HTTPBasicAuthHandler(password_mgr) - - # create "opener" (OpenerDirector instance) - opener = urllib2.build_opener(handler) - - # use the opener to fetch a URL - opener.open(a_url) - - # Install the opener. - # Now all calls to urllib2.urlopen use our opener. - urllib2.install_opener(opener) - -.. note:: - - In the above example we only supplied our ``HHTPBasicAuthHandler`` - to ``build_opener``. By default openers have the handlers for - normal situations - ``ProxyHandler``, ``UnknownHandler``, - ``HTTPHandler``, ``HTTPDefaultErrorHandler``, - ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``, - ``HTTPErrorProcessor``. - -top_level_url is in fact *either* a full URL (including the 'http:' -scheme component and the hostname and optionally the port number) -e.g. "http://example.com/" *or* an "authority" (i.e. the hostname, -optionally including the port number) e.g. "example.com" or -"example.com:8080" (the latter example includes a port number). The -authority, if present, must NOT contain the "userinfo" component - for -example "joe@password:example.com" is not correct. - - -Proxies -======= - -**urllib2** will auto-detect your proxy settings and use those. This -is through the ``ProxyHandler`` which is part of the normal handler -chain. Normally that's a good thing, but there are occasions when it -may not be helpful [#]_. One way to do this is to setup our own -``ProxyHandler``, with no proxies defined. This is done using similar -steps to setting up a `Basic Authentication`_ handler : :: - - >>> proxy_support = urllib2.ProxyHandler({}) - >>> opener = urllib2.build_opener(proxy_support) - >>> urllib2.install_opener(opener) - -.. note:: - - Currently ``urllib2`` *does not* support fetching of ``https`` - locations through a proxy. However, this can be enabled by extending - urllib2 as shown in the recipe [#]_. - - -Sockets and Layers -================== - -The Python support for fetching resources from the web is -layered. urllib2 uses the httplib library, which in turn uses the -socket library. - -As of Python 2.3 you can specify how long a socket should wait for a -response before timing out. This can be useful in applications which -have to fetch web pages. By default the socket module has *no timeout* -and can hang. Currently, the socket timeout is not exposed at the -httplib or urllib2 levels. However, you can set the default timeout -globally for all sockets using : :: - - import socket - import urllib2 - - # timeout in seconds - timeout = 10 - socket.setdefaulttimeout(timeout) - - # this call to urllib2.urlopen now uses the default timeout - # we have set in the socket module - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) - - -------- - - -Footnotes -========= - -This document was reviewed and revised by John Lee. - -.. [#] For an introduction to the CGI protocol see - `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_. -.. [#] Like Google for example. The *proper* way to use google from a program - is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See - `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_ - for some examples of using the Google API. -.. [#] Browser sniffing is a very bad practise for website design - building - sites using web standards is much more sensible. Unfortunately a lot of - sites still send different versions to different browsers. -.. [#] The user agent for MSIE 6 is - *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* -.. [#] For details of more HTTP request headers, see - `Quick Reference to HTTP Headers`_. -.. [#] In my case I have to use a proxy to access the internet at work. If you - attempt to fetch *localhost* URLs through this proxy it blocks them. IE - is set to use the proxy, which urllib2 picks up on. In order to test - scripts with a localhost server, I have to prevent urllib2 from using - the proxy. -.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe - <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_. - |