diff options
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/howto/urllib2.rst | 135 | ||||
-rw-r--r-- | Doc/library/contextlib.rst | 4 | ||||
-rw-r--r-- | Doc/library/fileformats.rst | 1 | ||||
-rw-r--r-- | Doc/library/ftplib.rst | 6 | ||||
-rw-r--r-- | Doc/library/http.client.rst | 9 | ||||
-rw-r--r-- | Doc/library/internet.rst | 7 | ||||
-rw-r--r-- | Doc/library/urllib.error.rst | 48 | ||||
-rw-r--r-- | Doc/library/urllib.parse.rst (renamed from Doc/library/urlparse.rst) | 64 | ||||
-rw-r--r-- | Doc/library/urllib.request.rst (renamed from Doc/library/urllib2.rst) | 344 | ||||
-rw-r--r-- | Doc/library/urllib.robotparser.rst | 73 | ||||
-rw-r--r-- | Doc/library/urllib.rst | 459 | ||||
-rw-r--r-- | Doc/tutorial/stdlib.rst | 8 |
12 files changed, 565 insertions, 593 deletions
diff --git a/Doc/howto/urllib2.rst b/Doc/howto/urllib2.rst index 0940d82..6342b6e 100644 --- a/Doc/howto/urllib2.rst +++ b/Doc/howto/urllib2.rst @@ -1,6 +1,6 @@ -************************************************ - HOWTO Fetch Internet Resources Using urllib2 -************************************************ +***************************************************** + HOWTO Fetch Internet Resources Using urllib package +***************************************************** :Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_ @@ -24,14 +24,14 @@ Introduction A tutorial on *Basic Authentication*, with examples in Python. -**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs +**urllib.request** is a `Python <http://www.python.org>`_ module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the *urlopen* function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers. -urllib2 supports fetching URLs for many "URL schemes" (identified by the string +urllib.request supports fetching URLs for many "URL schemes" (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP. @@ -40,43 +40,43 @@ For straightforward situations *urlopen* is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is :rfc:`2616`. This is a technical document and -not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*, +not intended to be easy to read. This HOWTO aims to illustrate using *urllib*, with enough detail about HTTP to help you through. It is not intended to replace -the :mod:`urllib2` docs, but is supplementary to them. +the :mod:`urllib.request` docs, but is supplementary to them. Fetching URLs ============= -The simplest way to use urllib2 is as follows:: +The simplest way to use urllib.request is as follows:: - import urllib2 - response = urllib2.urlopen('http://python.org/') + import urllib.request + response = urllib.request.urlopen('http://python.org/') html = response.read() -Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we +Many uses of urllib will be that simple (note that instead of an 'http:' URL we could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP. HTTP is based on requests and responses - the client makes requests and servers -send responses. urllib2 mirrors this with a ``Request`` object which represents +send responses. urllib.request mirrors this with a ``Request`` object which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Calling ``urlopen`` with this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call ``.read()`` on the response:: - import urllib2 + import urllib.request - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) + req = urllib.request.Request('http://www.voidspace.org.uk') + response = urllib.request.urlopen(req) the_page = response.read() -Note that urllib2 makes use of the same Request interface to handle all URL +Note that urllib.request makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so:: - req = urllib2.Request('ftp://example.com/') + req = urllib.request.Request('ftp://example.com/') In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass @@ -94,20 +94,20 @@ your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as the ``data`` -argument. The encoding is done using a function from the ``urllib`` library -*not* from ``urllib2``. :: +argument. The encoding is done using a function from the ``urllib.parse`` library +*not* from ``urllib.request``. :: - import urllib - import urllib2 + import urllib.parse + import urllib.request url = 'http://www.someserver.com/cgi-bin/register.cgi' values = {'name' : 'Michael Foord', 'location' : 'Northampton', 'language' : 'Python' } - data = urllib.urlencode(values) - req = urllib2.Request(url, data) - response = urllib2.urlopen(req) + data = urllib.parse.urlencode(values) + req = urllib.request.Request(url, data) + response = urllib.request.urlopen(req) the_page = response.read() Note that other encodings are sometimes required (e.g. for file upload from HTML @@ -115,7 +115,7 @@ forms - see `HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more details). -If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One +If you do not pass the ``data`` argument, urllib.request uses a **GET** request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be @@ -127,18 +127,18 @@ GET request by encoding it in the URL itself. This is done as follows:: - >>> import urllib2 - >>> import urllib + >>> import urllib.request + >>> import urllib.parse >>> data = {} >>> data['name'] = 'Somebody Here' >>> data['location'] = 'Northampton' >>> data['language'] = 'Python' - >>> url_values = urllib.urlencode(data) + >>> url_values = urllib.parse.urlencode(data) >>> print(url_values) name=Somebody+Here&language=Python&location=Northampton >>> url = 'http://www.example.com/example.cgi' >>> full_url = url + '?' + url_values - >>> data = urllib2.open(full_url) + >>> data = urllib.request.open(full_url) Notice that the full URL is created by adding a ``?`` to the URL, followed by the encoded values. @@ -150,7 +150,7 @@ We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request. Some websites [#]_ dislike being browsed by programs, or send different versions -to different browsers [#]_ . By default urllib2 identifies itself as +to different browsers [#]_ . By default urllib identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version numbers of the Python release, e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain @@ -160,8 +160,8 @@ pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [#]_. :: - import urllib - import urllib2 + import urllib.parse + import urllib.request url = 'http://www.someserver.com/cgi-bin/register.cgi' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' @@ -170,9 +170,9 @@ Explorer [#]_. :: 'language' : 'Python' } headers = { 'User-Agent' : user_agent } - data = urllib.urlencode(values) - req = urllib2.Request(url, data, headers) - response = urllib2.urlopen(req) + data = urllib.parse.urlencode(values) + req = urllib.request.Request(url, data, headers) + response = urllib.request.urlopen(req) the_page = response.read() The response also has two useful methods. See the section on `info and geturl`_ @@ -182,7 +182,7 @@ which comes after we have a look at what happens when things go wrong. Handling Exceptions =================== -*urlopen* raises ``URLError`` when it cannot handle a response (though as usual +*urllib.error* raises ``URLError`` when it cannot handle a response (though as usual with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also be raised). @@ -199,9 +199,9 @@ error code and a text error message. e.g. :: - >>> req = urllib2.Request('http://www.pretend_server.org') - >>> try: urllib2.urlopen(req) - >>> except URLError, e: + >>> req = urllib.request.Request('http://www.pretend_server.org') + >>> try: urllib.request.urlopen(req) + >>> except urllib.error.URLError, e: >>> print(e.reason) >>> (4, 'getaddrinfo failed') @@ -214,7 +214,7 @@ Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from -a different URL, urllib2 will handle that for you). For those it can't handle, +a different URL, urllib.request will handle that for you). For those it can't handle, urlopen will raise an ``HTTPError``. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required). @@ -305,12 +305,12 @@ dictionary is reproduced here for convenience :: When an error is raised the server responds by returning an HTTP error code *and* an error page. You can use the ``HTTPError`` instance as a response on the page returned. This means that as well as the code attribute, it also has read, -geturl, and info, methods. :: +geturl, and info, methods as returned by the ``urllib.response`` module:: - >>> req = urllib2.Request('http://www.python.org/fish.html') + >>> req = urllib.request.Request('http://www.python.org/fish.html') >>> try: - >>> urllib2.urlopen(req) - >>> except URLError, e: + >>> urllib.request.urlopen(req) + >>> except urllib.error.URLError, e: >>> print(e.code) >>> print(e.read()) >>> @@ -334,7 +334,8 @@ Number 1 :: - from urllib2 import Request, urlopen, URLError, HTTPError + from urllib.request import Request, urlopen + from urllib.error import URLError, HTTPError req = Request(someurl) try: response = urlopen(req) @@ -358,7 +359,8 @@ Number 2 :: - from urllib2 import Request, urlopen, URLError + from urllib.request import Request, urlopen + from urllib.error import URLError req = Request(someurl) try: response = urlopen(req) @@ -377,7 +379,8 @@ info and geturl =============== The response returned by urlopen (or the ``HTTPError`` instance) has two useful -methods ``info`` and ``geturl``. +methods ``info`` and ``geturl`` and is defined in the module +``urllib.response``. **geturl** - this returns the real URL of the page fetched. This is useful because ``urlopen`` (or the opener object used) may have followed a @@ -397,7 +400,7 @@ Openers and Handlers ==================== When you fetch a URL you use an opener (an instance of the perhaps -confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using +confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using the default opener - via ``urlopen`` - but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, @@ -466,24 +469,24 @@ The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match. :: # create a password manager - password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() + password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() # Add the username and password. # If we knew the realm, we could use it instead of ``None``. top_level_url = "http://example.com/foo/" password_mgr.add_password(None, top_level_url, username, password) - handler = urllib2.HTTPBasicAuthHandler(password_mgr) + handler = urllib.request.HTTPBasicAuthHandler(password_mgr) # create "opener" (OpenerDirector instance) - opener = urllib2.build_opener(handler) + opener = urllib.request.build_opener(handler) # use the opener to fetch a URL opener.open(a_url) # Install the opener. - # Now all calls to urllib2.urlopen use our opener. - urllib2.install_opener(opener) + # Now all calls to urllib.request.urlopen use our opener. + urllib.request.install_opener(opener) .. note:: @@ -505,46 +508,46 @@ not correct. Proxies ======= -**urllib2** will auto-detect your proxy settings and use those. This is through +**urllib.request** will auto-detect your proxy settings and use those. This is through the ``ProxyHandler`` which is part of the normal handler chain. Normally that's a good thing, but there are occasions when it may not be helpful [#]_. One way to do this is to setup our own ``ProxyHandler``, with no proxies defined. This is done using similar steps to setting up a `Basic Authentication`_ handler : :: - >>> proxy_support = urllib2.ProxyHandler({}) - >>> opener = urllib2.build_opener(proxy_support) - >>> urllib2.install_opener(opener) + >>> proxy_support = urllib.request.ProxyHandler({}) + >>> opener = urllib.request.build_opener(proxy_support) + >>> urllib.request.install_opener(opener) .. note:: - Currently ``urllib2`` *does not* support fetching of ``https`` locations - through a proxy. However, this can be enabled by extending urllib2 as + Currently ``urllib.request`` *does not* support fetching of ``https`` locations + through a proxy. However, this can be enabled by extending urllib.request as shown in the recipe [#]_. Sockets and Layers ================== -The Python support for fetching resources from the web is layered. urllib2 uses -the http.client library, which in turn uses the socket library. +The Python support for fetching resources from the web is layered. +urllib.request uses the http.client library, which in turn uses the socket library. As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has *no timeout* and can hang. Currently, -the socket timeout is not exposed at the http.client or urllib2 levels. +the socket timeout is not exposed at the http.client or urllib.request levels. However, you can set the default timeout globally for all sockets using :: import socket - import urllib2 + import urllib.request # timeout in seconds timeout = 10 socket.setdefaulttimeout(timeout) - # this call to urllib2.urlopen now uses the default timeout + # this call to urllib.request.urlopen now uses the default timeout # we have set in the socket module - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) + req = urllib.request.Request('http://www.voidspace.org.uk') + response = urllib.request.urlopen(req) ------- diff --git a/Doc/library/contextlib.rst b/Doc/library/contextlib.rst index 54d2a19..2cd97c2 100644 --- a/Doc/library/contextlib.rst +++ b/Doc/library/contextlib.rst @@ -98,9 +98,9 @@ Functions provided: And lets you write code like this:: from contextlib import closing - import urllib + import urllib.request - with closing(urllib.urlopen('http://www.python.org')) as page: + with closing(urllib.request.urlopen('http://www.python.org')) as page: for line in page: print(line) diff --git a/Doc/library/fileformats.rst b/Doc/library/fileformats.rst index d2f0639..dc2e237 100644 --- a/Doc/library/fileformats.rst +++ b/Doc/library/fileformats.rst @@ -13,7 +13,6 @@ that aren't markup languages or are related to e-mail. csv.rst configparser.rst - robotparser.rst netrc.rst xdrlib.rst plistlib.rst diff --git a/Doc/library/ftplib.rst b/Doc/library/ftplib.rst index 8a35a40..f360c60 100644 --- a/Doc/library/ftplib.rst +++ b/Doc/library/ftplib.rst @@ -13,9 +13,9 @@ This module defines the class :class:`FTP` and a few related items. The :class:`FTP` class implements the client side of the FTP protocol. You can use this to write Python programs that perform a variety of automated FTP jobs, such -as mirroring other ftp servers. It is also used by the module :mod:`urllib` to -handle URLs that use FTP. For more information on FTP (File Transfer Protocol), -see Internet :rfc:`959`. +as mirroring other ftp servers. It is also used by the module +:mod:`urllib.request` to handle URLs that use FTP. For more information on FTP +(File Transfer Protocol), see Internet :rfc:`959`. Here's a sample session using the :mod:`ftplib` module:: diff --git a/Doc/library/http.client.rst b/Doc/library/http.client.rst index 8138467..1ea3576 100644 --- a/Doc/library/http.client.rst +++ b/Doc/library/http.client.rst @@ -9,10 +9,11 @@ pair: HTTP; protocol single: HTTP; http.client (standard module) -.. index:: module: urllib +.. index:: module: urllib.request This module defines classes which implement the client side of the HTTP and -HTTPS protocols. It is normally not used directly --- the module :mod:`urllib` +HTTPS protocols. It is normally not used directly --- the module +:mod:`urllib.request` uses it to handle URLs that use HTTP and HTTPS. .. note:: @@ -484,8 +485,8 @@ Here is an example session that uses the ``GET`` method:: Here is an example session that shows how to ``POST`` requests:: - >>> import http.client, urllib - >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) + >>> import http.client, urllib.parse + >>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) >>> headers = {"Content-type": "application/x-www-form-urlencoded", ... "Accept": "text/plain"} >>> conn = http.client.HTTPConnection("musi-cal.mojam.com:80") diff --git a/Doc/library/internet.rst b/Doc/library/internet.rst index 948a0b2..a676a66 100644 --- a/Doc/library/internet.rst +++ b/Doc/library/internet.rst @@ -24,8 +24,10 @@ is currently supported on most popular platforms. Here is an overview: cgi.rst cgitb.rst wsgiref.rst - urllib.rst - urllib2.rst + urllib.request.rst + urllib.parse.rst + urllib.error.rst + urllib.robotparser.rst http.client.rst ftplib.rst poplib.rst @@ -35,7 +37,6 @@ is currently supported on most popular platforms. Here is an overview: smtpd.rst telnetlib.rst uuid.rst - urlparse.rst socketserver.rst http.server.rst http.cookies.rst diff --git a/Doc/library/urllib.error.rst b/Doc/library/urllib.error.rst new file mode 100644 index 0000000..1cbfe7d --- /dev/null +++ b/Doc/library/urllib.error.rst @@ -0,0 +1,48 @@ +:mod:`urllib.error` --- Exception classes raised by urllib.request +================================================================== + +.. module:: urllib.error + :synopsis: Next generation URL opening library. +.. moduleauthor:: Jeremy Hylton <jhylton@users.sourceforge.net> +.. sectionauthor:: Senthil Kumaran <orsenthil@gmail.com> + + +The :mod:`urllib.error` module defines exception classes raise by +urllib.request. The base exception class is URLError, which inherits from +IOError. + +The following exceptions are raised by :mod:`urllib.error` as appropriate: + + +.. exception:: URLError + + The handlers raise this exception (or derived exceptions) when they run into a + problem. It is a subclass of :exc:`IOError`. + + .. attribute:: reason + + The reason for this error. It can be a message string or another exception + instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local + URLs). + + +.. exception:: HTTPError + + Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError` + can also function as a non-exceptional file-like return value (the same thing + that :func:`urlopen` returns). This is useful when handling exotic HTTP + errors, such as requests for authentication. + + .. attribute:: code + + An HTTP status code as defined in `RFC 2616 <http://www.faqs.org/rfcs/rfc2616.html>`_. + This numeric value corresponds to a value found in the dictionary of + codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`. + +.. exception:: ContentTooShortError(msg[, content]) + + This exception is raised when the :func:`urlretrieve` function detects that the + amount of the downloaded data is less than the expected amount (given by the + *Content-Length* header). The :attr:`content` attribute stores the downloaded + (and supposedly truncated) data. + diff --git a/Doc/library/urlparse.rst b/Doc/library/urllib.parse.rst index e305e0b..affa406 100644 --- a/Doc/library/urlparse.rst +++ b/Doc/library/urllib.parse.rst @@ -1,7 +1,7 @@ -:mod:`urlparse` --- Parse URLs into components -============================================== +:mod:`urllib.parse` --- Parse URLs into components +================================================== -.. module:: urlparse +.. module:: urllib.parse :synopsis: Parse URLs into or assemble them from components. @@ -24,7 +24,7 @@ following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``, ``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``, ``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``. -The :mod:`urlparse` module defines the following functions: +The :mod:`urllib.parse` module defines the following functions: .. function:: urlparse(urlstring[, default_scheme[, allow_fragments]]) @@ -37,7 +37,7 @@ The :mod:`urlparse` module defines the following functions: result, except for a leading slash in the *path* component, which is retained if present. For example: - >>> from urlparse import urlparse + >>> from urllib.parse import urlparse >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') >>> o # doctest: +NORMALIZE_WHITESPACE ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', @@ -154,7 +154,7 @@ The :mod:`urlparse` module defines the following functions: particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL. For example: - >>> from urlparse import urljoin + >>> from urllib.parse import urljoin >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') 'http://www.cwi.nl/%7Eguido/FAQ.html' @@ -183,6 +183,52 @@ The :mod:`urlparse` module defines the following functions: If there is no fragment identifier in *url*, returns *url* unmodified and an empty string. +.. function:: quote(string[, safe]) + + Replace special characters in *string* using the ``%xx`` escape. Letters, + digits, and the characters ``'_.-'`` are never quoted. The optional *safe* + parameter specifies additional characters that should not be quoted --- its + default value is ``'/'``. + + Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. + + +.. function:: quote_plus(string[, safe]) + + Like :func:`quote`, but also replaces spaces by plus signs, as required for + quoting HTML form values. Plus signs in the original string are escaped unless + they are included in *safe*. It also does not have *safe* default to ``'/'``. + + +.. function:: unquote(string) + + Replace ``%xx`` escapes by their single-character equivalent. + + Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. + + +.. function:: unquote_plus(string) + + Like :func:`unquote`, but also replaces plus signs by spaces, as required for + unquoting HTML form values. + + +.. function:: urlencode(query[, doseq]) + + Convert a mapping object or a sequence of two-element tuples to a "url-encoded" + string, suitable to pass to :func:`urlopen` above as the optional *data* + argument. This is useful to pass a dictionary of form fields to a ``POST`` + request. The resulting string is a series of ``key=value`` pairs separated by + ``'&'`` characters, where both *key* and *value* are quoted using + :func:`quote_plus` above. If the optional parameter *doseq* is present and + evaluates to true, individual ``key=value`` pairs are generated for each element + of the sequence. When a sequence of two-element tuples is used as the *query* + argument, the first element of each tuple is a key and the second is a value. + The order of parameters in the encoded string will match the order of parameter + tuples in the sequence. The :mod:`cgi` module provides the functions + :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings + into Python data structures. + .. seealso:: @@ -219,14 +265,14 @@ described in those functions, as well as provide an additional method: The result of this method is a fixpoint if passed back through the original parsing function: - >>> import urlparse + >>> import urllib.parse >>> url = 'HTTP://www.Python.org/doc/#' - >>> r1 = urlparse.urlsplit(url) + >>> r1 = urllib.parse.urlsplit(url) >>> r1.geturl() 'http://www.Python.org/doc/' - >>> r2 = urlparse.urlsplit(r1.geturl()) + >>> r2 = urllib.parse.urlsplit(r1.geturl()) >>> r2.geturl() 'http://www.Python.org/doc/' diff --git a/Doc/library/urllib2.rst b/Doc/library/urllib.request.rst index 06dbb44..4262836 100644 --- a/Doc/library/urllib2.rst +++ b/Doc/library/urllib.request.rst @@ -1,17 +1,17 @@ -:mod:`urllib2` --- extensible library for opening URLs -====================================================== +:mod:`urllib.request` --- extensible library for opening URLs +============================================================= -.. module:: urllib2 +.. module:: urllib.request :synopsis: Next generation URL opening library. .. moduleauthor:: Jeremy Hylton <jhylton@users.sourceforge.net> .. sectionauthor:: Moshe Zadka <moshez@users.sourceforge.net> -The :mod:`urllib2` module defines functions and classes which help in opening +The :mod:`urllib.request` module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world --- basic and digest authentication, redirections, cookies and more. -The :mod:`urllib2` module defines the following functions: +The :mod:`urllib.request` module defines the following functions: .. function:: urlopen(url[, data][, timeout]) @@ -31,7 +31,8 @@ The :mod:`urllib2` module defines the following functions: timeout setting will be used). This actually only works for HTTP, HTTPS, FTP and FTPS connections. - This function returns a file-like object with two additional methods: + This function returns a file-like object with two additional methods from + the :mod:`urllib.response` module * :meth:`geturl` --- return the URL of the resource retrieved, commonly used to determine if a redirect was followed @@ -45,6 +46,11 @@ The :mod:`urllib2` module defines the following functions: Note that ``None`` may be returned if no handler handles the request (though the default installed global :class:`OpenerDirector` uses :class:`UnknownHandler` to ensure this never happens). + The urlopen function from the previous version, Python 2.6 and earlier, of + the module urllib has been discontinued as urlopen can return the + file-object as the previous. The proxy handling, which in earlier was passed + as a dict parameter to urlopen can be availed by the use of `ProxyHandler` + objects. .. function:: install_opener(opener) @@ -74,39 +80,87 @@ The :mod:`urllib2` module defines the following functions: A :class:`BaseHandler` subclass may also change its :attr:`handler_order` member variable to modify its position in the handlers list. -The following exceptions are raised as appropriate: +.. function:: urlretrieve(url[, filename[, reporthook[, data]]]) + Copy a network object denoted by a URL to a local file, if necessary. If the URL + points to a local file, or a valid cached copy of the object exists, the object + is not copied. Return a tuple ``(filename, headers)`` where *filename* is the + local file name under which the object can be found, and *headers* is whatever + the :meth:`info` method of the object returned by :func:`urlopen` returned (for + a remote object, possibly cached). Exceptions are the same as for + :func:`urlopen`. -.. exception:: URLError + The second argument, if present, specifies the file location to copy to (if + absent, the location will be a tempfile with a generated name). The third + argument, if present, is a hook function that will be called once on + establishment of the network connection and once after each block read + thereafter. The hook will be passed three arguments; a count of blocks + transferred so far, a block size in bytes, and the total size of the file. The + third argument may be ``-1`` on older FTP servers which do not return a file + size in response to a retrieval request. - The handlers raise this exception (or derived exceptions) when they run into a - problem. It is a subclass of :exc:`IOError`. + If the *url* uses the :file:`http:` scheme identifier, the optional *data* + argument may be given to specify a ``POST`` request (normally the request type + is ``GET``). The *data* argument must in standard + :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` + function below. - .. attribute:: reason + :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that + the amount of data available was less than the expected amount (which is the + size reported by a *Content-Length* header). This can occur, for example, when + the download is interrupted. - The reason for this error. It can be a message string or another exception - instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local - URLs). + The *Content-Length* is treated as a lower bound: if there's more data to read, + urlretrieve reads more data, but if less data is available, it raises the + exception. + You can still retrieve the downloaded data in this case, it is stored in the + :attr:`content` attribute of the exception instance. -.. exception:: HTTPError + If no *Content-Length* header was supplied, urlretrieve can not check the size + of the data it has downloaded, and just returns it. In this case you just have + to assume that the download was successful. - Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError` - can also function as a non-exceptional file-like return value (the same thing - that :func:`urlopen` returns). This is useful when handling exotic HTTP - errors, such as requests for authentication. - .. attribute:: code +.. data:: _urlopener - An HTTP status code as defined in `RFC 2616 <http://www.faqs.org/rfcs/rfc2616.html>`_. - This numeric value corresponds to a value found in the dictionary of - codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`. + The public functions :func:`urlopen` and :func:`urlretrieve` create an instance + of the :class:`FancyURLopener` class and use it to perform their requested + actions. To override this functionality, programmers can create a subclass of + :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that + class to the ``urllib._urlopener`` variable before calling the desired function. + For example, applications may want to specify a different + :mailheader:`User-Agent` header than :class:`URLopener` defines. This can be + accomplished with the following code:: + import urllib.request + class AppURLopener(urllib.request.FancyURLopener): + version = "App/1.7" -The following classes are provided: + urllib._urlopener = AppURLopener() + + +.. function:: urlcleanup() + + Clear the cache that may have been built up by previous calls to + :func:`urlretrieve`. + +.. function:: pathname2url(path) + + Convert the pathname *path* from the local syntax for a path to the form used in + the path component of a URL. This does not produce a complete URL. The return + value will already be quoted using the :func:`quote` function. +.. function:: url2pathname(path) + + Convert the path component *path* from an encoded URL to the local syntax for a + path. This does not accept a complete URL. This function uses :func:`unquote` + to decode *path*. + +The following classes are provided: + .. class:: Request(url[, data][, headers][, origin_req_host][, unverifiable]) This class is an abstraction of a URL request. @@ -145,6 +199,114 @@ The following classes are provided: an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true. +.. class:: URLopener([proxies[, **x509]]) + + Base class for opening and reading URLs. Unless you need to support opening + objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`, + you probably want to use :class:`FancyURLopener`. + + By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header + of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number. + Applications can define their own :mailheader:`User-Agent` header by subclassing + :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute + :attr:`version` to an appropriate string value in the subclass definition. + + The optional *proxies* parameter should be a dictionary mapping scheme names to + proxy URLs, where an empty dictionary turns proxies off completely. Its default + value is ``None``, in which case environmental proxy settings will be used if + present, as discussed in the definition of :func:`urlopen`, above. + + Additional keyword parameters, collected in *x509*, may be used for + authentication of the client when using the :file:`https:` scheme. The keywords + *key_file* and *cert_file* are supported to provide an SSL key and certificate; + both are needed to support client authentication. + + :class:`URLopener` objects will raise an :exc:`IOError` exception if the server + returns an error code. + + .. method:: open(fullurl[, data]) + + Open *fullurl* using the appropriate protocol. This method sets up cache and + proxy information, then calls the appropriate open method with its input + arguments. If the scheme is not recognized, :meth:`open_unknown` is called. + The *data* argument has the same meaning as the *data* argument of + :func:`urlopen`. + + + .. method:: open_unknown(fullurl[, data]) + + Overridable interface to open unknown URL types. + + + .. method:: retrieve(url[, filename[, reporthook[, data]]]) + + Retrieves the contents of *url* and places it in *filename*. The return value + is a tuple consisting of a local filename and either a + :class:`email.message.Message` object containing the response headers (for remote + URLs) or ``None`` (for local URLs). The caller must then open and read the + contents of *filename*. If *filename* is not given and the URL refers to a + local file, the input filename is returned. If the URL is non-local and + *filename* is not given, the filename is the output of :func:`tempfile.mktemp` + with a suffix that matches the suffix of the last path component of the input + URL. If *reporthook* is given, it must be a function accepting three numeric + parameters. It will be called after each chunk of data is read from the + network. *reporthook* is ignored for local URLs. + + If the *url* uses the :file:`http:` scheme identifier, the optional *data* + argument may be given to specify a ``POST`` request (normally the request type + is ``GET``). The *data* argument must in standard + :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` + function below. + + + .. attribute:: version + + Variable that specifies the user agent of the opener object. To get + :mod:`urllib` to tell servers that it is a particular user agent, set this in a + subclass as a class variable or in the constructor before calling the base + constructor. + + +.. class:: FancyURLopener(...) + + :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling + for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x + response codes listed above, the :mailheader:`Location` header is used to fetch + the actual URL. For 401 response codes (authentication required), basic HTTP + authentication is performed. For the 30x response codes, recursion is bounded + by the value of the *maxtries* attribute, which defaults to 10. + + For all other response codes, the method :meth:`http_error_default` is called + which you can override in subclasses to handle the error appropriately. + + .. note:: + + According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests + must not be automatically redirected without confirmation by the user. In + reality, browsers do allow automatic redirection of these responses, changing + the POST to a GET, and :mod:`urllib` reproduces this behaviour. + + The parameters to the constructor are the same as those for :class:`URLopener`. + + .. note:: + + When performing basic authentication, a :class:`FancyURLopener` instance calls + its :meth:`prompt_user_passwd` method. The default implementation asks the + users for the required information on the controlling terminal. A subclass may + override this method to support more appropriate behavior if needed. + + The :class:`FancyURLopener` class offers one additional method that should be + overloaded to provide the appropriate behavior: + + .. method:: prompt_user_passwd(host, realm) + + Return information needed to authenticate the user at the given host in the + specified security realm. The return value should be a tuple, ``(user, + password)``, which can be used for basic authentication. + + The implementation prompts for this information on the terminal; an application + should override this method to use an appropriate interaction model in the local + environment. .. class:: OpenerDirector() @@ -846,7 +1008,6 @@ HTTPErrorProcessor Objects Eventually, :class:`urllib2.HTTPDefaultErrorHandler` will raise an :exc:`HTTPError` if no other handler handles the error. - .. _urllib2-examples: Examples @@ -855,8 +1016,8 @@ Examples This example gets the python.org main page and displays the first 100 bytes of it:: - >>> import urllib2 - >>> f = urllib2.urlopen('http://www.python.org/') + >>> import urllib.request + >>> f = urllib.request.urlopen('http://www.python.org/') >>> print(f.read(100)) <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <?xml-stylesheet href="./css/ht2html @@ -865,10 +1026,10 @@ Here we are sending a data-stream to the stdin of a CGI and reading the data it returns to us. Note that this example will only work when the Python installation supports SSL. :: - >>> import urllib2 - >>> req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi', + >>> import urllib.request + >>> req = urllib.request.Request(url='https://localhost/cgi-bin/test.cgi', ... data='This data is passed to stdin of the CGI') - >>> f = urllib2.urlopen(req) + >>> f = urllib.request.urlopen(req) >>> print(f.read()) Got Data: "This data is passed to stdin of the CGI" @@ -881,17 +1042,17 @@ The code for the sample CGI used in the above example is:: Use of Basic HTTP Authentication:: - import urllib2 + import urllib.request # Create an OpenerDirector with support for Basic HTTP Authentication... - auth_handler = urllib2.HTTPBasicAuthHandler() + auth_handler = urllib.request.HTTPBasicAuthHandler() auth_handler.add_password(realm='PDQ Application', uri='https://mahler:8092/site-updates.py', user='klem', passwd='kadidd!ehopper') - opener = urllib2.build_opener(auth_handler) + opener = urllib.request.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. - urllib2.install_opener(opener) - urllib2.urlopen('http://www.example.com/login.html') + urllib.request.install_opener(opener) + urllib.request.urlopen('http://www.example.com/login.html') :func:`build_opener` provides many handlers by default, including a :class:`ProxyHandler`. By default, :class:`ProxyHandler` uses the environment @@ -903,8 +1064,8 @@ This example replaces the default :class:`ProxyHandler` with one that uses programatically-supplied proxy URLs, and adds proxy authorization support with :class:`ProxyBasicAuthHandler`. :: - proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'}) - proxy_auth_handler = urllib2.HTTPBasicAuthHandler() + proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'}) + proxy_auth_handler = urllib.request.HTTPBasicAuthHandler() proxy_auth_handler.add_password('realm', 'host', 'username', 'password') opener = build_opener(proxy_handler, proxy_auth_handler) @@ -915,16 +1076,16 @@ Adding HTTP headers: Use the *headers* argument to the :class:`Request` constructor, or:: - import urllib2 - req = urllib2.Request('http://www.example.com/') + import urllib + req = urllib.request.Request('http://www.example.com/') req.add_header('Referer', 'http://www.python.org/') - r = urllib2.urlopen(req) + r = urllib.request.urlopen(req) :class:`OpenerDirector` automatically adds a :mailheader:`User-Agent` header to every :class:`Request`. To change this:: - import urllib2 - opener = urllib2.build_opener() + import urllib + opener = urllib.request.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] opener.open('http://www.example.com/') @@ -932,3 +1093,102 @@ Also, remember that a few standard headers (:mailheader:`Content-Length`, :mailheader:`Content-Type` and :mailheader:`Host`) are added when the :class:`Request` is passed to :func:`urlopen` (or :meth:`OpenerDirector.open`). +.. _urllib-examples: + +Here is an example session that uses the ``GET`` method to retrieve a URL +containing parameters:: + + >>> import urllib.request + >>> import urllib.parse + >>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) + >>> f = urllib.request.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) + >>> print(f.read()) + +The following example uses the ``POST`` method instead:: + + >>> import urllib.request + >>> import urllib.parse + >>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) + >>> f = urllib.request.urlopen("http://www.musi-cal.com/cgi-bin/query", params) + >>> print(f.read()) + +The following example uses an explicitly specified HTTP proxy, overriding +environment settings:: + + >>> import urllib.request + >>> proxies = {'http': 'http://proxy.example.com:8080/'} + >>> opener = urllib.request.FancyURLopener(proxies) + >>> f = opener.open("http://www.python.org") + >>> f.read() + +The following example uses no proxies at all, overriding environment settings:: + + >>> import urllib.request + >>> opener = urllib.request.FancyURLopener({}) + >>> f = opener.open("http://www.python.org/") + >>> f.read() + + +:mod:`urllib.request` Restrictions +---------------------------------- + + .. index:: + pair: HTTP; protocol + pair: FTP; protocol + +* Currently, only the following protocols are supported: HTTP, (versions 0.9 and + 1.0), FTP, and local files. + +* The caching feature of :func:`urlretrieve` has been disabled until I find the + time to hack proper processing of Expiration time headers. + +* There should be a function to query whether a particular URL is in the cache. + +* For backward compatibility, if a URL appears to point to a local file but the + file can't be opened, the URL is re-interpreted using the FTP protocol. This + can sometimes cause confusing error messages. + +* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily + long delays while waiting for a network connection to be set up. This means + that it is difficult to build an interactive Web client using these functions + without using threads. + + .. index:: + single: HTML + pair: HTTP; protocol + +* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data + returned by the server. This may be binary data (such as an image), plain text + or (for example) HTML. The HTTP protocol provides type information in the reply + header, which can be inspected by looking at the :mailheader:`Content-Type` + header. If the returned data is HTML, you can use the module + :mod:`html.parser` to parse it. + + .. index:: single: FTP + +* The code handling the FTP protocol cannot differentiate between a file and a + directory. This can lead to unexpected behavior when attempting to read a URL + that points to a file that is not accessible. If the URL ends in a ``/``, it is + assumed to refer to a directory and will be handled accordingly. But if an + attempt to read a file leads to a 550 error (meaning the URL cannot be found or + is not accessible, often for permission reasons), then the path is treated as a + directory in order to handle the case when a directory is specified by a URL but + the trailing ``/`` has been left off. This can cause misleading results when + you try to fetch a file whose read permissions make it inaccessible; the FTP + code will try to read it, fail with a 550 error, and then perform a directory + listing for the unreadable file. If fine-grained control is needed, consider + using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing + *_urlopener* to meet your needs. + +:mod:`urllib.response` --- Response classes used by urllib. +=========================================================== +.. module:: urllib.response + :synopsis: Response classes used by urllib. + +The :mod:`urllib.response` module defines functions and classes which define a +minimal file like interface, including read() and readline(). The typical +response object is an addinfourl instance, which defines and info() method and +that returns headers and a geturl() method that returns the url. +Functions defined by this module are used internally by the +:mod:`urllib.request` module. + diff --git a/Doc/library/urllib.robotparser.rst b/Doc/library/urllib.robotparser.rst new file mode 100644 index 0000000..e351c56 --- /dev/null +++ b/Doc/library/urllib.robotparser.rst @@ -0,0 +1,73 @@ + +:mod:`urllib.robotparser` --- Parser for robots.txt +==================================================== + +.. module:: urllib.robotparser + :synopsis: Loads a robots.txt file and answers questions about + fetchability of other URLs. +.. sectionauthor:: Skip Montanaro <skip@pobox.com> + + +.. index:: + single: WWW + single: World Wide Web + single: URL + single: robots.txt + +This module provides a single class, :class:`RobotFileParser`, which answers +questions about whether or not a particular user agent can fetch a URL on the +Web site that published the :file:`robots.txt` file. For more details on the +structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html. + + +.. class:: RobotFileParser() + + This class provides a set of methods to read, parse and answer questions + about a single :file:`robots.txt` file. + + + .. method:: set_url(url) + + Sets the URL referring to a :file:`robots.txt` file. + + + .. method:: read() + + Reads the :file:`robots.txt` URL and feeds it to the parser. + + + .. method:: parse(lines) + + Parses the lines argument. + + + .. method:: can_fetch(useragent, url) + + Returns ``True`` if the *useragent* is allowed to fetch the *url* + according to the rules contained in the parsed :file:`robots.txt` + file. + + + .. method:: mtime() + + Returns the time the ``robots.txt`` file was last fetched. This is + useful for long-running web spiders that need to check for new + ``robots.txt`` files periodically. + + + .. method:: modified() + + Sets the time the ``robots.txt`` file was last fetched to the current + time. + +The following example demonstrates basic use of the RobotFileParser class. :: + + >>> import urllib.robotparser + >>> rp = urllib.robotparser.RobotFileParser() + >>> rp.set_url("http://www.musi-cal.com/robots.txt") + >>> rp.read() + >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") + False + >>> rp.can_fetch("*", "http://www.musi-cal.com/") + True + diff --git a/Doc/library/urllib.rst b/Doc/library/urllib.rst deleted file mode 100644 index 3435e55..0000000 --- a/Doc/library/urllib.rst +++ /dev/null @@ -1,459 +0,0 @@ -:mod:`urllib` --- Open arbitrary resources by URL -================================================= - -.. module:: urllib - :synopsis: Open an arbitrary network resource by URL (requires sockets). - - -.. index:: - single: WWW - single: World Wide Web - single: URL - -This module provides a high-level interface for fetching data across the World -Wide Web. In particular, the :func:`urlopen` function is similar to the -built-in function :func:`open`, but accepts Universal Resource Locators (URLs) -instead of filenames. Some restrictions apply --- it can only open URLs for -reading, and no seek operations are available. - -High-level interface --------------------- - -.. function:: urlopen(url[, data[, proxies]]) - - Open a network object denoted by a URL for reading. If the URL does not have a - scheme identifier, or if it has :file:`file:` as its scheme identifier, this - opens a local file (without universal newlines); otherwise it opens a socket to - a server somewhere on the network. If the connection cannot be made the - :exc:`IOError` exception is raised. If all went well, a file-like object is - returned. This supports the following methods: :meth:`read`, :meth:`readline`, - :meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and - :meth:`geturl`. It also has proper support for the :term:`iterator` protocol. One - caveat: the :meth:`read` method, if the size argument is omitted or negative, - may not read until the end of the data stream; there is no good way to determine - that the entire stream from a socket has been read in the general case. - - Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods, - these methods have the same interface as for file objects --- see section - :ref:`bltin-file-objects` in this manual. (It is not a built-in file object, - however, so it can't be used at those few places where a true built-in file - object is required.) - - The :meth:`info` method returns an instance of the class - :class:`email.message.Message` containing meta-information associated with - the URL. When the method is HTTP, these headers are those returned by the - server at the head of the retrieved HTML page (including Content-Length and - Content-Type). When the method is FTP, a Content-Length header will be - present if (as is now usual) the server passed back a file length in response - to the FTP retrieval request. A Content-Type header will be present if the - MIME type can be guessed. When the method is local-file, returned headers - will include a Date representing the file's last-modified time, a - Content-Length giving file size, and a Content-Type containing a guess at the - file's type. - - The :meth:`geturl` method returns the real URL of the page. In some cases, the - HTTP server redirects a client to another URL. The :func:`urlopen` function - handles this transparently, but in some cases the caller needs to know which URL - the client was redirected to. The :meth:`geturl` method can be used to get at - this redirected URL. - - The :meth:`getcode` method returns the HTTP status code that was sent with the - response, or ``None`` if the URL is no HTTP URL. - - If the *url* uses the :file:`http:` scheme identifier, the optional *data* - argument may be given to specify a ``POST`` request (normally the request type - is ``GET``). The *data* argument must be in standard - :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` - function below. - - The :func:`urlopen` function works transparently with proxies which do not - require authentication. In a Unix or Windows environment, set the - :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that - identifies the proxy server before starting the Python interpreter. For example - (the ``'%'`` is the command prompt):: - - % http_proxy="http://www.someproxy.com:3128" - % export http_proxy - % python - ... - - The :envvar:`no_proxy` environment variable can be used to specify hosts which - shouldn't be reached via proxy; if set, it should be a comma-separated list - of hostname suffixes, optionally with ``:port`` appended, for example - ``cern.ch,ncsa.uiuc.edu,some.host:8080``. - - In a Windows environment, if no proxy environment variables are set, proxy - settings are obtained from the registry's Internet Settings section. - - .. index:: single: Internet Config - - In a Macintosh environment, :func:`urlopen` will retrieve proxy information from - Internet Config. - - Alternatively, the optional *proxies* argument may be used to explicitly specify - proxies. It must be a dictionary mapping scheme names to proxy URLs, where an - empty dictionary causes no proxies to be used, and ``None`` (the default value) - causes environmental proxy settings to be used as discussed above. For - example:: - - # Use http://www.someproxy.com:3128 for http proxying - proxies = {'http': 'http://www.someproxy.com:3128'} - filehandle = urllib.urlopen(some_url, proxies=proxies) - # Don't use any proxies - filehandle = urllib.urlopen(some_url, proxies={}) - # Use proxies from environment - both versions are equivalent - filehandle = urllib.urlopen(some_url, proxies=None) - filehandle = urllib.urlopen(some_url) - - Proxies which require authentication for use are not currently supported; this - is considered an implementation limitation. - - -.. function:: urlretrieve(url[, filename[, reporthook[, data]]]) - - Copy a network object denoted by a URL to a local file, if necessary. If the URL - points to a local file, or a valid cached copy of the object exists, the object - is not copied. Return a tuple ``(filename, headers)`` where *filename* is the - local file name under which the object can be found, and *headers* is whatever - the :meth:`info` method of the object returned by :func:`urlopen` returned (for - a remote object, possibly cached). Exceptions are the same as for - :func:`urlopen`. - - The second argument, if present, specifies the file location to copy to (if - absent, the location will be a tempfile with a generated name). The third - argument, if present, is a hook function that will be called once on - establishment of the network connection and once after each block read - thereafter. The hook will be passed three arguments; a count of blocks - transferred so far, a block size in bytes, and the total size of the file. The - third argument may be ``-1`` on older FTP servers which do not return a file - size in response to a retrieval request. - - If the *url* uses the :file:`http:` scheme identifier, the optional *data* - argument may be given to specify a ``POST`` request (normally the request type - is ``GET``). The *data* argument must in standard - :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` - function below. - - :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that - the amount of data available was less than the expected amount (which is the - size reported by a *Content-Length* header). This can occur, for example, when - the download is interrupted. - - The *Content-Length* is treated as a lower bound: if there's more data to read, - urlretrieve reads more data, but if less data is available, it raises the - exception. - - You can still retrieve the downloaded data in this case, it is stored in the - :attr:`content` attribute of the exception instance. - - If no *Content-Length* header was supplied, urlretrieve can not check the size - of the data it has downloaded, and just returns it. In this case you just have - to assume that the download was successful. - - -.. data:: _urlopener - - The public functions :func:`urlopen` and :func:`urlretrieve` create an instance - of the :class:`FancyURLopener` class and use it to perform their requested - actions. To override this functionality, programmers can create a subclass of - :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that - class to the ``urllib._urlopener`` variable before calling the desired function. - For example, applications may want to specify a different - :mailheader:`User-Agent` header than :class:`URLopener` defines. This can be - accomplished with the following code:: - - import urllib - - class AppURLopener(urllib.FancyURLopener): - version = "App/1.7" - - urllib._urlopener = AppURLopener() - - -.. function:: urlcleanup() - - Clear the cache that may have been built up by previous calls to - :func:`urlretrieve`. - - -Utility functions ------------------ - -.. function:: quote(string[, safe]) - - Replace special characters in *string* using the ``%xx`` escape. Letters, - digits, and the characters ``'_.-'`` are never quoted. The optional *safe* - parameter specifies additional characters that should not be quoted --- its - default value is ``'/'``. - - Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. - - -.. function:: quote_plus(string[, safe]) - - Like :func:`quote`, but also replaces spaces by plus signs, as required for - quoting HTML form values. Plus signs in the original string are escaped unless - they are included in *safe*. It also does not have *safe* default to ``'/'``. - - -.. function:: unquote(string) - - Replace ``%xx`` escapes by their single-character equivalent. - - Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. - - -.. function:: unquote_plus(string) - - Like :func:`unquote`, but also replaces plus signs by spaces, as required for - unquoting HTML form values. - - -.. function:: urlencode(query[, doseq]) - - Convert a mapping object or a sequence of two-element tuples to a "url-encoded" - string, suitable to pass to :func:`urlopen` above as the optional *data* - argument. This is useful to pass a dictionary of form fields to a ``POST`` - request. The resulting string is a series of ``key=value`` pairs separated by - ``'&'`` characters, where both *key* and *value* are quoted using - :func:`quote_plus` above. If the optional parameter *doseq* is present and - evaluates to true, individual ``key=value`` pairs are generated for each element - of the sequence. When a sequence of two-element tuples is used as the *query* - argument, the first element of each tuple is a key and the second is a value. - The order of parameters in the encoded string will match the order of parameter - tuples in the sequence. The :mod:`cgi` module provides the functions - :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings - into Python data structures. - - -.. function:: pathname2url(path) - - Convert the pathname *path* from the local syntax for a path to the form used in - the path component of a URL. This does not produce a complete URL. The return - value will already be quoted using the :func:`quote` function. - - -.. function:: url2pathname(path) - - Convert the path component *path* from an encoded URL to the local syntax for a - path. This does not accept a complete URL. This function uses :func:`unquote` - to decode *path*. - - -URL Opener objects ------------------- - -.. class:: URLopener([proxies[, **x509]]) - - Base class for opening and reading URLs. Unless you need to support opening - objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`, - you probably want to use :class:`FancyURLopener`. - - By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header - of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number. - Applications can define their own :mailheader:`User-Agent` header by subclassing - :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute - :attr:`version` to an appropriate string value in the subclass definition. - - The optional *proxies* parameter should be a dictionary mapping scheme names to - proxy URLs, where an empty dictionary turns proxies off completely. Its default - value is ``None``, in which case environmental proxy settings will be used if - present, as discussed in the definition of :func:`urlopen`, above. - - Additional keyword parameters, collected in *x509*, may be used for - authentication of the client when using the :file:`https:` scheme. The keywords - *key_file* and *cert_file* are supported to provide an SSL key and certificate; - both are needed to support client authentication. - - :class:`URLopener` objects will raise an :exc:`IOError` exception if the server - returns an error code. - - .. method:: open(fullurl[, data]) - - Open *fullurl* using the appropriate protocol. This method sets up cache and - proxy information, then calls the appropriate open method with its input - arguments. If the scheme is not recognized, :meth:`open_unknown` is called. - The *data* argument has the same meaning as the *data* argument of - :func:`urlopen`. - - - .. method:: open_unknown(fullurl[, data]) - - Overridable interface to open unknown URL types. - - - .. method:: retrieve(url[, filename[, reporthook[, data]]]) - - Retrieves the contents of *url* and places it in *filename*. The return value - is a tuple consisting of a local filename and either a - :class:`email.message.Message` object containing the response headers (for remote - URLs) or ``None`` (for local URLs). The caller must then open and read the - contents of *filename*. If *filename* is not given and the URL refers to a - local file, the input filename is returned. If the URL is non-local and - *filename* is not given, the filename is the output of :func:`tempfile.mktemp` - with a suffix that matches the suffix of the last path component of the input - URL. If *reporthook* is given, it must be a function accepting three numeric - parameters. It will be called after each chunk of data is read from the - network. *reporthook* is ignored for local URLs. - - If the *url* uses the :file:`http:` scheme identifier, the optional *data* - argument may be given to specify a ``POST`` request (normally the request type - is ``GET``). The *data* argument must in standard - :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` - function below. - - - .. attribute:: version - - Variable that specifies the user agent of the opener object. To get - :mod:`urllib` to tell servers that it is a particular user agent, set this in a - subclass as a class variable or in the constructor before calling the base - constructor. - - -.. class:: FancyURLopener(...) - - :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling - for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x - response codes listed above, the :mailheader:`Location` header is used to fetch - the actual URL. For 401 response codes (authentication required), basic HTTP - authentication is performed. For the 30x response codes, recursion is bounded - by the value of the *maxtries* attribute, which defaults to 10. - - For all other response codes, the method :meth:`http_error_default` is called - which you can override in subclasses to handle the error appropriately. - - .. note:: - - According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests - must not be automatically redirected without confirmation by the user. In - reality, browsers do allow automatic redirection of these responses, changing - the POST to a GET, and :mod:`urllib` reproduces this behaviour. - - The parameters to the constructor are the same as those for :class:`URLopener`. - - .. note:: - - When performing basic authentication, a :class:`FancyURLopener` instance calls - its :meth:`prompt_user_passwd` method. The default implementation asks the - users for the required information on the controlling terminal. A subclass may - override this method to support more appropriate behavior if needed. - - The :class:`FancyURLopener` class offers one additional method that should be - overloaded to provide the appropriate behavior: - - .. method:: prompt_user_passwd(host, realm) - - Return information needed to authenticate the user at the given host in the - specified security realm. The return value should be a tuple, ``(user, - password)``, which can be used for basic authentication. - - The implementation prompts for this information on the terminal; an application - should override this method to use an appropriate interaction model in the local - environment. - -.. exception:: ContentTooShortError(msg[, content]) - - This exception is raised when the :func:`urlretrieve` function detects that the - amount of the downloaded data is less than the expected amount (given by the - *Content-Length* header). The :attr:`content` attribute stores the downloaded - (and supposedly truncated) data. - - -:mod:`urllib` Restrictions --------------------------- - - .. index:: - pair: HTTP; protocol - pair: FTP; protocol - -* Currently, only the following protocols are supported: HTTP, (versions 0.9 and - 1.0), FTP, and local files. - -* The caching feature of :func:`urlretrieve` has been disabled until I find the - time to hack proper processing of Expiration time headers. - -* There should be a function to query whether a particular URL is in the cache. - -* For backward compatibility, if a URL appears to point to a local file but the - file can't be opened, the URL is re-interpreted using the FTP protocol. This - can sometimes cause confusing error messages. - -* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily - long delays while waiting for a network connection to be set up. This means - that it is difficult to build an interactive Web client using these functions - without using threads. - - .. index:: - single: HTML - pair: HTTP; protocol - -* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data - returned by the server. This may be binary data (such as an image), plain text - or (for example) HTML. The HTTP protocol provides type information in the reply - header, which can be inspected by looking at the :mailheader:`Content-Type` - header. If the returned data is HTML, you can use the module - :mod:`html.parser` to parse it. - - .. index:: single: FTP - -* The code handling the FTP protocol cannot differentiate between a file and a - directory. This can lead to unexpected behavior when attempting to read a URL - that points to a file that is not accessible. If the URL ends in a ``/``, it is - assumed to refer to a directory and will be handled accordingly. But if an - attempt to read a file leads to a 550 error (meaning the URL cannot be found or - is not accessible, often for permission reasons), then the path is treated as a - directory in order to handle the case when a directory is specified by a URL but - the trailing ``/`` has been left off. This can cause misleading results when - you try to fetch a file whose read permissions make it inaccessible; the FTP - code will try to read it, fail with a 550 error, and then perform a directory - listing for the unreadable file. If fine-grained control is needed, consider - using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing - *_urlopener* to meet your needs. - -* This module does not support the use of proxies which require authentication. - This may be implemented in the future. - - .. index:: module: urlparse - -* Although the :mod:`urllib` module contains (undocumented) routines to parse - and unparse URL strings, the recommended interface for URL manipulation is in - module :mod:`urlparse`. - - -.. _urllib-examples: - -Examples --------- - -Here is an example session that uses the ``GET`` method to retrieve a URL -containing parameters:: - - >>> import urllib - >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) - >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) - >>> print(f.read()) - -The following example uses the ``POST`` method instead:: - - >>> import urllib - >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) - >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) - >>> print(f.read()) - -The following example uses an explicitly specified HTTP proxy, overriding -environment settings:: - - >>> import urllib - >>> proxies = {'http': 'http://proxy.example.com:8080/'} - >>> opener = urllib.FancyURLopener(proxies) - >>> f = opener.open("http://www.python.org") - >>> f.read() - -The following example uses no proxies at all, overriding environment settings:: - - >>> import urllib - >>> opener = urllib.FancyURLopener({}) - >>> f = opener.open("http://www.python.org/") - >>> f.read() - diff --git a/Doc/tutorial/stdlib.rst b/Doc/tutorial/stdlib.rst index 66e73a9..b0c6e8e 100644 --- a/Doc/tutorial/stdlib.rst +++ b/Doc/tutorial/stdlib.rst @@ -147,11 +147,11 @@ Internet Access =============== There are a number of modules for accessing the internet and processing internet -protocols. Two of the simplest are :mod:`urllib2` for retrieving data from urls -and :mod:`smtplib` for sending mail:: +protocols. Two of the simplest are :mod:`urllib.request` for retrieving data +from urls and :mod:`smtplib` for sending mail:: - >>> import urllib2 - >>> for line in urllib2.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'): + >>> import urllib.request + >>> for line in urllib.request.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'): ... if 'EST' in line or 'EDT' in line: # look for Eastern Time ... print(line) |