From aca8fd7a9dc96143e592076fab4d89cc1691d03f Mon Sep 17 00:00:00 2001 From: Senthil Kumaran Date: Mon, 23 Jun 2008 04:41:59 +0000 Subject: Documentation updates for urllib package. Modified the documentation for the urllib,urllib2 -> urllib.request,urllib.error urlparse -> urllib.parse RobotParser -> urllib.robotparser Updated tutorial references and other module references (http.client.rst, ftplib.rst,contextlib.rst) Updated the examples in the urllib2-howto Addresses Issue3142. --- Doc/howto/urllib2.rst | 135 ++-- Doc/library/contextlib.rst | 4 +- Doc/library/fileformats.rst | 1 - Doc/library/ftplib.rst | 6 +- Doc/library/http.client.rst | 9 +- Doc/library/internet.rst | 7 +- Doc/library/urllib.error.rst | 48 ++ Doc/library/urllib.parse.rst | 301 +++++++++ Doc/library/urllib.request.rst | 1194 ++++++++++++++++++++++++++++++++++++ Doc/library/urllib.robotparser.rst | 73 +++ Doc/library/urllib.rst | 459 -------------- Doc/library/urllib2.rst | 934 ---------------------------- Doc/library/urlparse.rst | 255 -------- Doc/tutorial/stdlib.rst | 8 +- 14 files changed, 1703 insertions(+), 1731 deletions(-) create mode 100644 Doc/library/urllib.error.rst create mode 100644 Doc/library/urllib.parse.rst create mode 100644 Doc/library/urllib.request.rst create mode 100644 Doc/library/urllib.robotparser.rst delete mode 100644 Doc/library/urllib.rst delete mode 100644 Doc/library/urllib2.rst delete mode 100644 Doc/library/urlparse.rst diff --git a/Doc/howto/urllib2.rst b/Doc/howto/urllib2.rst index 0940d82..6342b6e 100644 --- a/Doc/howto/urllib2.rst +++ b/Doc/howto/urllib2.rst @@ -1,6 +1,6 @@ -************************************************ - HOWTO Fetch Internet Resources Using urllib2 -************************************************ +***************************************************** + HOWTO Fetch Internet Resources Using urllib package +***************************************************** :Author: `Michael Foord `_ @@ -24,14 +24,14 @@ Introduction A tutorial on *Basic Authentication*, with examples in Python. -**urllib2** is a `Python `_ module for fetching URLs +**urllib.request** is a `Python `_ module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the *urlopen* function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers. -urllib2 supports fetching URLs for many "URL schemes" (identified by the string +urllib.request supports fetching URLs for many "URL schemes" (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP. @@ -40,43 +40,43 @@ For straightforward situations *urlopen* is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is :rfc:`2616`. This is a technical document and -not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*, +not intended to be easy to read. This HOWTO aims to illustrate using *urllib*, with enough detail about HTTP to help you through. It is not intended to replace -the :mod:`urllib2` docs, but is supplementary to them. +the :mod:`urllib.request` docs, but is supplementary to them. Fetching URLs ============= -The simplest way to use urllib2 is as follows:: +The simplest way to use urllib.request is as follows:: - import urllib2 - response = urllib2.urlopen('http://python.org/') + import urllib.request + response = urllib.request.urlopen('http://python.org/') html = response.read() -Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we +Many uses of urllib will be that simple (note that instead of an 'http:' URL we could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP. HTTP is based on requests and responses - the client makes requests and servers -send responses. urllib2 mirrors this with a ``Request`` object which represents +send responses. urllib.request mirrors this with a ``Request`` object which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Calling ``urlopen`` with this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call ``.read()`` on the response:: - import urllib2 + import urllib.request - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) + req = urllib.request.Request('http://www.voidspace.org.uk') + response = urllib.request.urlopen(req) the_page = response.read() -Note that urllib2 makes use of the same Request interface to handle all URL +Note that urllib.request makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so:: - req = urllib2.Request('ftp://example.com/') + req = urllib.request.Request('ftp://example.com/') In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass @@ -94,20 +94,20 @@ your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as the ``data`` -argument. The encoding is done using a function from the ``urllib`` library -*not* from ``urllib2``. :: +argument. The encoding is done using a function from the ``urllib.parse`` library +*not* from ``urllib.request``. :: - import urllib - import urllib2 + import urllib.parse + import urllib.request url = 'http://www.someserver.com/cgi-bin/register.cgi' values = {'name' : 'Michael Foord', 'location' : 'Northampton', 'language' : 'Python' } - data = urllib.urlencode(values) - req = urllib2.Request(url, data) - response = urllib2.urlopen(req) + data = urllib.parse.urlencode(values) + req = urllib.request.Request(url, data) + response = urllib.request.urlopen(req) the_page = response.read() Note that other encodings are sometimes required (e.g. for file upload from HTML @@ -115,7 +115,7 @@ forms - see `HTML Specification, Form Submission `_ for more details). -If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One +If you do not pass the ``data`` argument, urllib.request uses a **GET** request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be @@ -127,18 +127,18 @@ GET request by encoding it in the URL itself. This is done as follows:: - >>> import urllib2 - >>> import urllib + >>> import urllib.request + >>> import urllib.parse >>> data = {} >>> data['name'] = 'Somebody Here' >>> data['location'] = 'Northampton' >>> data['language'] = 'Python' - >>> url_values = urllib.urlencode(data) + >>> url_values = urllib.parse.urlencode(data) >>> print(url_values) name=Somebody+Here&language=Python&location=Northampton >>> url = 'http://www.example.com/example.cgi' >>> full_url = url + '?' + url_values - >>> data = urllib2.open(full_url) + >>> data = urllib.request.open(full_url) Notice that the full URL is created by adding a ``?`` to the URL, followed by the encoded values. @@ -150,7 +150,7 @@ We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request. Some websites [#]_ dislike being browsed by programs, or send different versions -to different browsers [#]_ . By default urllib2 identifies itself as +to different browsers [#]_ . By default urllib identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version numbers of the Python release, e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain @@ -160,8 +160,8 @@ pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [#]_. :: - import urllib - import urllib2 + import urllib.parse + import urllib.request url = 'http://www.someserver.com/cgi-bin/register.cgi' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' @@ -170,9 +170,9 @@ Explorer [#]_. :: 'language' : 'Python' } headers = { 'User-Agent' : user_agent } - data = urllib.urlencode(values) - req = urllib2.Request(url, data, headers) - response = urllib2.urlopen(req) + data = urllib.parse.urlencode(values) + req = urllib.request.Request(url, data, headers) + response = urllib.request.urlopen(req) the_page = response.read() The response also has two useful methods. See the section on `info and geturl`_ @@ -182,7 +182,7 @@ which comes after we have a look at what happens when things go wrong. Handling Exceptions =================== -*urlopen* raises ``URLError`` when it cannot handle a response (though as usual +*urllib.error* raises ``URLError`` when it cannot handle a response (though as usual with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also be raised). @@ -199,9 +199,9 @@ error code and a text error message. e.g. :: - >>> req = urllib2.Request('http://www.pretend_server.org') - >>> try: urllib2.urlopen(req) - >>> except URLError, e: + >>> req = urllib.request.Request('http://www.pretend_server.org') + >>> try: urllib.request.urlopen(req) + >>> except urllib.error.URLError, e: >>> print(e.reason) >>> (4, 'getaddrinfo failed') @@ -214,7 +214,7 @@ Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from -a different URL, urllib2 will handle that for you). For those it can't handle, +a different URL, urllib.request will handle that for you). For those it can't handle, urlopen will raise an ``HTTPError``. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required). @@ -305,12 +305,12 @@ dictionary is reproduced here for convenience :: When an error is raised the server responds by returning an HTTP error code *and* an error page. You can use the ``HTTPError`` instance as a response on the page returned. This means that as well as the code attribute, it also has read, -geturl, and info, methods. :: +geturl, and info, methods as returned by the ``urllib.response`` module:: - >>> req = urllib2.Request('http://www.python.org/fish.html') + >>> req = urllib.request.Request('http://www.python.org/fish.html') >>> try: - >>> urllib2.urlopen(req) - >>> except URLError, e: + >>> urllib.request.urlopen(req) + >>> except urllib.error.URLError, e: >>> print(e.code) >>> print(e.read()) >>> @@ -334,7 +334,8 @@ Number 1 :: - from urllib2 import Request, urlopen, URLError, HTTPError + from urllib.request import Request, urlopen + from urllib.error import URLError, HTTPError req = Request(someurl) try: response = urlopen(req) @@ -358,7 +359,8 @@ Number 2 :: - from urllib2 import Request, urlopen, URLError + from urllib.request import Request, urlopen + from urllib.error import URLError req = Request(someurl) try: response = urlopen(req) @@ -377,7 +379,8 @@ info and geturl =============== The response returned by urlopen (or the ``HTTPError`` instance) has two useful -methods ``info`` and ``geturl``. +methods ``info`` and ``geturl`` and is defined in the module +``urllib.response``. **geturl** - this returns the real URL of the page fetched. This is useful because ``urlopen`` (or the opener object used) may have followed a @@ -397,7 +400,7 @@ Openers and Handlers ==================== When you fetch a URL you use an opener (an instance of the perhaps -confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using +confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using the default opener - via ``urlopen`` - but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, @@ -466,24 +469,24 @@ The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match. :: # create a password manager - password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() + password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() # Add the username and password. # If we knew the realm, we could use it instead of ``None``. top_level_url = "http://example.com/foo/" password_mgr.add_password(None, top_level_url, username, password) - handler = urllib2.HTTPBasicAuthHandler(password_mgr) + handler = urllib.request.HTTPBasicAuthHandler(password_mgr) # create "opener" (OpenerDirector instance) - opener = urllib2.build_opener(handler) + opener = urllib.request.build_opener(handler) # use the opener to fetch a URL opener.open(a_url) # Install the opener. - # Now all calls to urllib2.urlopen use our opener. - urllib2.install_opener(opener) + # Now all calls to urllib.request.urlopen use our opener. + urllib.request.install_opener(opener) .. note:: @@ -505,46 +508,46 @@ not correct. Proxies ======= -**urllib2** will auto-detect your proxy settings and use those. This is through +**urllib.request** will auto-detect your proxy settings and use those. This is through the ``ProxyHandler`` which is part of the normal handler chain. Normally that's a good thing, but there are occasions when it may not be helpful [#]_. One way to do this is to setup our own ``ProxyHandler``, with no proxies defined. This is done using similar steps to setting up a `Basic Authentication`_ handler : :: - >>> proxy_support = urllib2.ProxyHandler({}) - >>> opener = urllib2.build_opener(proxy_support) - >>> urllib2.install_opener(opener) + >>> proxy_support = urllib.request.ProxyHandler({}) + >>> opener = urllib.request.build_opener(proxy_support) + >>> urllib.request.install_opener(opener) .. note:: - Currently ``urllib2`` *does not* support fetching of ``https`` locations - through a proxy. However, this can be enabled by extending urllib2 as + Currently ``urllib.request`` *does not* support fetching of ``https`` locations + through a proxy. However, this can be enabled by extending urllib.request as shown in the recipe [#]_. Sockets and Layers ================== -The Python support for fetching resources from the web is layered. urllib2 uses -the http.client library, which in turn uses the socket library. +The Python support for fetching resources from the web is layered. +urllib.request uses the http.client library, which in turn uses the socket library. As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has *no timeout* and can hang. Currently, -the socket timeout is not exposed at the http.client or urllib2 levels. +the socket timeout is not exposed at the http.client or urllib.request levels. However, you can set the default timeout globally for all sockets using :: import socket - import urllib2 + import urllib.request # timeout in seconds timeout = 10 socket.setdefaulttimeout(timeout) - # this call to urllib2.urlopen now uses the default timeout + # this call to urllib.request.urlopen now uses the default timeout # we have set in the socket module - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) + req = urllib.request.Request('http://www.voidspace.org.uk') + response = urllib.request.urlopen(req) ------- diff --git a/Doc/library/contextlib.rst b/Doc/library/contextlib.rst index 54d2a19..2cd97c2 100644 --- a/Doc/library/contextlib.rst +++ b/Doc/library/contextlib.rst @@ -98,9 +98,9 @@ Functions provided: And lets you write code like this:: from contextlib import closing - import urllib + import urllib.request - with closing(urllib.urlopen('http://www.python.org')) as page: + with closing(urllib.request.urlopen('http://www.python.org')) as page: for line in page: print(line) diff --git a/Doc/library/fileformats.rst b/Doc/library/fileformats.rst index d2f0639..dc2e237 100644 --- a/Doc/library/fileformats.rst +++ b/Doc/library/fileformats.rst @@ -13,7 +13,6 @@ that aren't markup languages or are related to e-mail. csv.rst configparser.rst - robotparser.rst netrc.rst xdrlib.rst plistlib.rst diff --git a/Doc/library/ftplib.rst b/Doc/library/ftplib.rst index 8a35a40..f360c60 100644 --- a/Doc/library/ftplib.rst +++ b/Doc/library/ftplib.rst @@ -13,9 +13,9 @@ This module defines the class :class:`FTP` and a few related items. The :class:`FTP` class implements the client side of the FTP protocol. You can use this to write Python programs that perform a variety of automated FTP jobs, such -as mirroring other ftp servers. It is also used by the module :mod:`urllib` to -handle URLs that use FTP. For more information on FTP (File Transfer Protocol), -see Internet :rfc:`959`. +as mirroring other ftp servers. It is also used by the module +:mod:`urllib.request` to handle URLs that use FTP. For more information on FTP +(File Transfer Protocol), see Internet :rfc:`959`. Here's a sample session using the :mod:`ftplib` module:: diff --git a/Doc/library/http.client.rst b/Doc/library/http.client.rst index 8138467..1ea3576 100644 --- a/Doc/library/http.client.rst +++ b/Doc/library/http.client.rst @@ -9,10 +9,11 @@ pair: HTTP; protocol single: HTTP; http.client (standard module) -.. index:: module: urllib +.. index:: module: urllib.request This module defines classes which implement the client side of the HTTP and -HTTPS protocols. It is normally not used directly --- the module :mod:`urllib` +HTTPS protocols. It is normally not used directly --- the module +:mod:`urllib.request` uses it to handle URLs that use HTTP and HTTPS. .. note:: @@ -484,8 +485,8 @@ Here is an example session that uses the ``GET`` method:: Here is an example session that shows how to ``POST`` requests:: - >>> import http.client, urllib - >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) + >>> import http.client, urllib.parse + >>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) >>> headers = {"Content-type": "application/x-www-form-urlencoded", ... "Accept": "text/plain"} >>> conn = http.client.HTTPConnection("musi-cal.mojam.com:80") diff --git a/Doc/library/internet.rst b/Doc/library/internet.rst index 948a0b2..a676a66 100644 --- a/Doc/library/internet.rst +++ b/Doc/library/internet.rst @@ -24,8 +24,10 @@ is currently supported on most popular platforms. Here is an overview: cgi.rst cgitb.rst wsgiref.rst - urllib.rst - urllib2.rst + urllib.request.rst + urllib.parse.rst + urllib.error.rst + urllib.robotparser.rst http.client.rst ftplib.rst poplib.rst @@ -35,7 +37,6 @@ is currently supported on most popular platforms. Here is an overview: smtpd.rst telnetlib.rst uuid.rst - urlparse.rst socketserver.rst http.server.rst http.cookies.rst diff --git a/Doc/library/urllib.error.rst b/Doc/library/urllib.error.rst new file mode 100644 index 0000000..1cbfe7d --- /dev/null +++ b/Doc/library/urllib.error.rst @@ -0,0 +1,48 @@ +:mod:`urllib.error` --- Exception classes raised by urllib.request +================================================================== + +.. module:: urllib.error + :synopsis: Next generation URL opening library. +.. moduleauthor:: Jeremy Hylton +.. sectionauthor:: Senthil Kumaran + + +The :mod:`urllib.error` module defines exception classes raise by +urllib.request. The base exception class is URLError, which inherits from +IOError. + +The following exceptions are raised by :mod:`urllib.error` as appropriate: + + +.. exception:: URLError + + The handlers raise this exception (or derived exceptions) when they run into a + problem. It is a subclass of :exc:`IOError`. + + .. attribute:: reason + + The reason for this error. It can be a message string or another exception + instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local + URLs). + + +.. exception:: HTTPError + + Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError` + can also function as a non-exceptional file-like return value (the same thing + that :func:`urlopen` returns). This is useful when handling exotic HTTP + errors, such as requests for authentication. + + .. attribute:: code + + An HTTP status code as defined in `RFC 2616 `_. + This numeric value corresponds to a value found in the dictionary of + codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`. + +.. exception:: ContentTooShortError(msg[, content]) + + This exception is raised when the :func:`urlretrieve` function detects that the + amount of the downloaded data is less than the expected amount (given by the + *Content-Length* header). The :attr:`content` attribute stores the downloaded + (and supposedly truncated) data. + diff --git a/Doc/library/urllib.parse.rst b/Doc/library/urllib.parse.rst new file mode 100644 index 0000000..affa406 --- /dev/null +++ b/Doc/library/urllib.parse.rst @@ -0,0 +1,301 @@ +:mod:`urllib.parse` --- Parse URLs into components +================================================== + +.. module:: urllib.parse + :synopsis: Parse URLs into or assemble them from components. + + +.. index:: + single: WWW + single: World Wide Web + single: URL + pair: URL; parsing + pair: relative; URL + +This module defines a standard interface to break Uniform Resource Locator (URL) +strings up in components (addressing scheme, network location, path etc.), to +combine the components back into a URL string, and to convert a "relative URL" +to an absolute URL given a "base URL." + +The module has been designed to match the Internet RFC on Relative Uniform +Resource Locators (and discovered a bug in an earlier draft!). It supports the +following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``, +``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``, +``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``, +``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``. + +The :mod:`urllib.parse` module defines the following functions: + + +.. function:: urlparse(urlstring[, default_scheme[, allow_fragments]]) + + Parse a URL into six components, returning a 6-tuple. This corresponds to the + general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``. + Each tuple item is a string, possibly empty. The components are not broken up in + smaller parts (for example, the network location is a single string), and % + escapes are not expanded. The delimiters as shown above are not part of the + result, except for a leading slash in the *path* component, which is retained if + present. For example: + + >>> from urllib.parse import urlparse + >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') + >>> o # doctest: +NORMALIZE_WHITESPACE + ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', + params='', query='', fragment='') + >>> o.scheme + 'http' + >>> o.port + 80 + >>> o.geturl() + 'http://www.cwi.nl:80/%7Eguido/Python.html' + + If the *default_scheme* argument is specified, it gives the default addressing + scheme, to be used only if the URL does not specify one. The default value for + this argument is the empty string. + + If the *allow_fragments* argument is false, fragment identifiers are not + allowed, even if the URL's addressing scheme normally does support them. The + default value for this argument is :const:`True`. + + The return value is actually an instance of a subclass of :class:`tuple`. This + class has the following additional read-only convenience attributes: + + +------------------+-------+--------------------------+----------------------+ + | Attribute | Index | Value | Value if not present | + +==================+=======+==========================+======================+ + | :attr:`scheme` | 0 | URL scheme specifier | empty string | + +------------------+-------+--------------------------+----------------------+ + | :attr:`netloc` | 1 | Network location part | empty string | + +------------------+-------+--------------------------+----------------------+ + | :attr:`path` | 2 | Hierarchical path | empty string | + +------------------+-------+--------------------------+----------------------+ + | :attr:`params` | 3 | Parameters for last path | empty string | + | | | element | | + +------------------+-------+--------------------------+----------------------+ + | :attr:`query` | 4 | Query component | empty string | + +------------------+-------+--------------------------+----------------------+ + | :attr:`fragment` | 5 | Fragment identifier | empty string | + +------------------+-------+--------------------------+----------------------+ + | :attr:`username` | | User name | :const:`None` | + +------------------+-------+--------------------------+----------------------+ + | :attr:`password` | | Password | :const:`None` | + +------------------+-------+--------------------------+----------------------+ + | :attr:`hostname` | | Host name (lower case) | :const:`None` | + +------------------+-------+--------------------------+----------------------+ + | :attr:`port` | | Port number as integer, | :const:`None` | + | | | if present | | + +------------------+-------+--------------------------+----------------------+ + + See section :ref:`urlparse-result-object` for more information on the result + object. + + +.. function:: urlunparse(parts) + + Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument + can be any six-item iterable. This may result in a slightly different, but + equivalent URL, if the URL that was parsed originally had unnecessary delimiters + (for example, a ? with an empty query; the RFC states that these are + equivalent). + + +.. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]]) + + This is similar to :func:`urlparse`, but does not split the params from the URL. + This should generally be used instead of :func:`urlparse` if the more recent URL + syntax allowing parameters to be applied to each segment of the *path* portion + of the URL (see :rfc:`2396`) is wanted. A separate function is needed to + separate the path segments and parameters. This function returns a 5-tuple: + (addressing scheme, network location, path, query, fragment identifier). + + The return value is actually an instance of a subclass of :class:`tuple`. This + class has the following additional read-only convenience attributes: + + +------------------+-------+-------------------------+----------------------+ + | Attribute | Index | Value | Value if not present | + +==================+=======+=========================+======================+ + | :attr:`scheme` | 0 | URL scheme specifier | empty string | + +------------------+-------+-------------------------+----------------------+ + | :attr:`netloc` | 1 | Network location part | empty string | + +------------------+-------+-------------------------+----------------------+ + | :attr:`path` | 2 | Hierarchical path | empty string | + +------------------+-------+-------------------------+----------------------+ + | :attr:`query` | 3 | Query component | empty string | + +------------------+-------+-------------------------+----------------------+ + | :attr:`fragment` | 4 | Fragment identifier | empty string | + +------------------+-------+-------------------------+----------------------+ + | :attr:`username` | | User name | :const:`None` | + +------------------+-------+-------------------------+----------------------+ + | :attr:`password` | | Password | :const:`None` | + +------------------+-------+-------------------------+----------------------+ + | :attr:`hostname` | | Host name (lower case) | :const:`None` | + +------------------+-------+-------------------------+----------------------+ + | :attr:`port` | | Port number as integer, | :const:`None` | + | | | if present | | + +------------------+-------+-------------------------+----------------------+ + + See section :ref:`urlparse-result-object` for more information on the result + object. + + +.. function:: urlunsplit(parts) + + Combine the elements of a tuple as returned by :func:`urlsplit` into a complete + URL as a string. The *parts* argument can be any five-item iterable. This may + result in a slightly different, but equivalent URL, if the URL that was parsed + originally had unnecessary delimiters (for example, a ? with an empty query; the + RFC states that these are equivalent). + + +.. function:: urljoin(base, url[, allow_fragments]) + + Construct a full ("absolute") URL by combining a "base URL" (*base*) with + another URL (*url*). Informally, this uses components of the base URL, in + particular the addressing scheme, the network location and (part of) the path, + to provide missing components in the relative URL. For example: + + >>> from urllib.parse import urljoin + >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') + 'http://www.cwi.nl/%7Eguido/FAQ.html' + + The *allow_fragments* argument has the same meaning and default as for + :func:`urlparse`. + + .. note:: + + If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``), + the *url*'s host name and/or scheme will be present in the result. For example: + + .. doctest:: + + >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', + ... '//www.python.org/%7Eguido') + 'http://www.python.org/%7Eguido' + + If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and + :func:`urlunsplit`, removing possible *scheme* and *netloc* parts. + + +.. function:: urldefrag(url) + + If *url* contains a fragment identifier, returns a modified version of *url* + with no fragment identifier, and the fragment identifier as a separate string. + If there is no fragment identifier in *url*, returns *url* unmodified and an + empty string. + +.. function:: quote(string[, safe]) + + Replace special characters in *string* using the ``%xx`` escape. Letters, + digits, and the characters ``'_.-'`` are never quoted. The optional *safe* + parameter specifies additional characters that should not be quoted --- its + default value is ``'/'``. + + Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. + + +.. function:: quote_plus(string[, safe]) + + Like :func:`quote`, but also replaces spaces by plus signs, as required for + quoting HTML form values. Plus signs in the original string are escaped unless + they are included in *safe*. It also does not have *safe* default to ``'/'``. + + +.. function:: unquote(string) + + Replace ``%xx`` escapes by their single-character equivalent. + + Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. + + +.. function:: unquote_plus(string) + + Like :func:`unquote`, but also replaces plus signs by spaces, as required for + unquoting HTML form values. + + +.. function:: urlencode(query[, doseq]) + + Convert a mapping object or a sequence of two-element tuples to a "url-encoded" + string, suitable to pass to :func:`urlopen` above as the optional *data* + argument. This is useful to pass a dictionary of form fields to a ``POST`` + request. The resulting string is a series of ``key=value`` pairs separated by + ``'&'`` characters, where both *key* and *value* are quoted using + :func:`quote_plus` above. If the optional parameter *doseq* is present and + evaluates to true, individual ``key=value`` pairs are generated for each element + of the sequence. When a sequence of two-element tuples is used as the *query* + argument, the first element of each tuple is a key and the second is a value. + The order of parameters in the encoded string will match the order of parameter + tuples in the sequence. The :mod:`cgi` module provides the functions + :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings + into Python data structures. + + +.. seealso:: + + :rfc:`1738` - Uniform Resource Locators (URL) + This specifies the formal syntax and semantics of absolute URLs. + + :rfc:`1808` - Relative Uniform Resource Locators + This Request For Comments includes the rules for joining an absolute and a + relative URL, including a fair number of "Abnormal Examples" which govern the + treatment of border cases. + + :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax + Document describing the generic syntactic requirements for both Uniform Resource + Names (URNs) and Uniform Resource Locators (URLs). + + +.. _urlparse-result-object: + +Results of :func:`urlparse` and :func:`urlsplit` +------------------------------------------------ + +The result objects from the :func:`urlparse` and :func:`urlsplit` functions are +subclasses of the :class:`tuple` type. These subclasses add the attributes +described in those functions, as well as provide an additional method: + + +.. method:: ParseResult.geturl() + + Return the re-combined version of the original URL as a string. This may differ + from the original URL in that the scheme will always be normalized to lower case + and empty components may be dropped. Specifically, empty parameters, queries, + and fragment identifiers will be removed. + + The result of this method is a fixpoint if passed back through the original + parsing function: + + >>> import urllib.parse + >>> url = 'HTTP://www.Python.org/doc/#' + + >>> r1 = urllib.parse.urlsplit(url) + >>> r1.geturl() + 'http://www.Python.org/doc/' + + >>> r2 = urllib.parse.urlsplit(r1.geturl()) + >>> r2.geturl() + 'http://www.Python.org/doc/' + + +The following classes provide the implementations of the parse results:: + + +.. class:: BaseResult + + Base class for the concrete result classes. This provides most of the attribute + definitions. It does not provide a :meth:`geturl` method. It is derived from + :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__` + methods. + + +.. class:: ParseResult(scheme, netloc, path, params, query, fragment) + + Concrete class for :func:`urlparse` results. The :meth:`__new__` method is + overridden to support checking that the right number of arguments are passed. + + +.. class:: SplitResult(scheme, netloc, path, query, fragment) + + Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is + overridden to support checking that the right number of arguments are passed. + diff --git a/Doc/library/urllib.request.rst b/Doc/library/urllib.request.rst new file mode 100644 index 0000000..4262836 --- /dev/null +++ b/Doc/library/urllib.request.rst @@ -0,0 +1,1194 @@ +:mod:`urllib.request` --- extensible library for opening URLs +============================================================= + +.. module:: urllib.request + :synopsis: Next generation URL opening library. +.. moduleauthor:: Jeremy Hylton +.. sectionauthor:: Moshe Zadka + + +The :mod:`urllib.request` module defines functions and classes which help in opening +URLs (mostly HTTP) in a complex world --- basic and digest authentication, +redirections, cookies and more. + +The :mod:`urllib.request` module defines the following functions: + + +.. function:: urlopen(url[, data][, timeout]) + + Open the URL *url*, which can be either a string or a :class:`Request` object. + + *data* may be a string specifying additional data to send to the server, or + ``None`` if no such data is needed. Currently HTTP requests are the only ones + that use *data*; the HTTP request will be a POST instead of a GET when the + *data* parameter is provided. *data* should be a buffer in the standard + :mimetype:`application/x-www-form-urlencoded` format. The + :func:`urllib.urlencode` function takes a mapping or sequence of 2-tuples and + returns a string in this format. + + The optional *timeout* parameter specifies a timeout in seconds for blocking + operations like the connection attempt (if not specified, the global default + timeout setting will be used). This actually only works for HTTP, HTTPS, + FTP and FTPS connections. + + This function returns a file-like object with two additional methods from + the :mod:`urllib.response` module + + * :meth:`geturl` --- return the URL of the resource retrieved, commonly used to + determine if a redirect was followed + + * :meth:`info` --- return the meta-information of the page, such as headers, in + the form of an ``http.client.HTTPMessage`` instance + (see `Quick Reference to HTTP Headers `_) + + Raises :exc:`URLError` on errors. + + Note that ``None`` may be returned if no handler handles the request (though the + default installed global :class:`OpenerDirector` uses :class:`UnknownHandler` to + ensure this never happens). + The urlopen function from the previous version, Python 2.6 and earlier, of + the module urllib has been discontinued as urlopen can return the + file-object as the previous. The proxy handling, which in earlier was passed + as a dict parameter to urlopen can be availed by the use of `ProxyHandler` + objects. + + +.. function:: install_opener(opener) + + Install an :class:`OpenerDirector` instance as the default global opener. + Installing an opener is only necessary if you want urlopen to use that opener; + otherwise, simply call :meth:`OpenerDirector.open` instead of :func:`urlopen`. + The code does not check for a real :class:`OpenerDirector`, and any class with + the appropriate interface will work. + + +.. function:: build_opener([handler, ...]) + + Return an :class:`OpenerDirector` instance, which chains the handlers in the + order given. *handler*\s can be either instances of :class:`BaseHandler`, or + subclasses of :class:`BaseHandler` (in which case it must be possible to call + the constructor without any parameters). Instances of the following classes + will be in front of the *handler*\s, unless the *handler*\s contain them, + instances of them or subclasses of them: :class:`ProxyHandler`, + :class:`UnknownHandler`, :class:`HTTPHandler`, :class:`HTTPDefaultErrorHandler`, + :class:`HTTPRedirectHandler`, :class:`FTPHandler`, :class:`FileHandler`, + :class:`HTTPErrorProcessor`. + + If the Python installation has SSL support (i.e., if the :mod:`ssl` module can be imported), + :class:`HTTPSHandler` will also be added. + + A :class:`BaseHandler` subclass may also change its :attr:`handler_order` + member variable to modify its position in the handlers list. + +.. function:: urlretrieve(url[, filename[, reporthook[, data]]]) + + Copy a network object denoted by a URL to a local file, if necessary. If the URL + points to a local file, or a valid cached copy of the object exists, the object + is not copied. Return a tuple ``(filename, headers)`` where *filename* is the + local file name under which the object can be found, and *headers* is whatever + the :meth:`info` method of the object returned by :func:`urlopen` returned (for + a remote object, possibly cached). Exceptions are the same as for + :func:`urlopen`. + + The second argument, if present, specifies the file location to copy to (if + absent, the location will be a tempfile with a generated name). The third + argument, if present, is a hook function that will be called once on + establishment of the network connection and once after each block read + thereafter. The hook will be passed three arguments; a count of blocks + transferred so far, a block size in bytes, and the total size of the file. The + third argument may be ``-1`` on older FTP servers which do not return a file + size in response to a retrieval request. + + If the *url* uses the :file:`http:` scheme identifier, the optional *data* + argument may be given to specify a ``POST`` request (normally the request type + is ``GET``). The *data* argument must in standard + :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` + function below. + + :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that + the amount of data available was less than the expected amount (which is the + size reported by a *Content-Length* header). This can occur, for example, when + the download is interrupted. + + The *Content-Length* is treated as a lower bound: if there's more data to read, + urlretrieve reads more data, but if less data is available, it raises the + exception. + + You can still retrieve the downloaded data in this case, it is stored in the + :attr:`content` attribute of the exception instance. + + If no *Content-Length* header was supplied, urlretrieve can not check the size + of the data it has downloaded, and just returns it. In this case you just have + to assume that the download was successful. + + +.. data:: _urlopener + + The public functions :func:`urlopen` and :func:`urlretrieve` create an instance + of the :class:`FancyURLopener` class and use it to perform their requested + actions. To override this functionality, programmers can create a subclass of + :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that + class to the ``urllib._urlopener`` variable before calling the desired function. + For example, applications may want to specify a different + :mailheader:`User-Agent` header than :class:`URLopener` defines. This can be + accomplished with the following code:: + + import urllib.request + + class AppURLopener(urllib.request.FancyURLopener): + version = "App/1.7" + + urllib._urlopener = AppURLopener() + + +.. function:: urlcleanup() + + Clear the cache that may have been built up by previous calls to + :func:`urlretrieve`. + +.. function:: pathname2url(path) + + Convert the pathname *path* from the local syntax for a path to the form used in + the path component of a URL. This does not produce a complete URL. The return + value will already be quoted using the :func:`quote` function. + + +.. function:: url2pathname(path) + + Convert the path component *path* from an encoded URL to the local syntax for a + path. This does not accept a complete URL. This function uses :func:`unquote` + to decode *path*. + +The following classes are provided: + +.. class:: Request(url[, data][, headers][, origin_req_host][, unverifiable]) + + This class is an abstraction of a URL request. + + *url* should be a string containing a valid URL. + + *data* may be a string specifying additional data to send to the server, or + ``None`` if no such data is needed. Currently HTTP requests are the only ones + that use *data*; the HTTP request will be a POST instead of a GET when the + *data* parameter is provided. *data* should be a buffer in the standard + :mimetype:`application/x-www-form-urlencoded` format. The + :func:`urllib.urlencode` function takes a mapping or sequence of 2-tuples and + returns a string in this format. + + *headers* should be a dictionary, and will be treated as if :meth:`add_header` + was called with each key and value as arguments. This is often used to "spoof" + the ``User-Agent`` header, which is used by a browser to identify itself -- + some HTTP servers only allow requests coming from common browsers as opposed + to scripts. For example, Mozilla Firefox may identify itself as ``"Mozilla/5.0 + (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"``, while :mod:`urllib2`'s + default user agent string is ``"Python-urllib/2.6"`` (on Python 2.6). + + The final two arguments are only of interest for correct handling of third-party + HTTP cookies: + + *origin_req_host* should be the request-host of the origin transaction, as + defined by :rfc:`2965`. It defaults to ``http.cookiejar.request_host(self)``. + This is the host name or IP address of the original request that was + initiated by the user. For example, if the request is for an image in an + HTML document, this should be the request-host of the request for the page + containing the image. + + *unverifiable* should indicate whether the request is unverifiable, as defined + by RFC 2965. It defaults to False. An unverifiable request is one whose URL + the user did not have the option to approve. For example, if the request is for + an image in an HTML document, and the user had no option to approve the + automatic fetching of the image, this should be true. + +.. class:: URLopener([proxies[, **x509]]) + + Base class for opening and reading URLs. Unless you need to support opening + objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`, + you probably want to use :class:`FancyURLopener`. + + By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header + of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number. + Applications can define their own :mailheader:`User-Agent` header by subclassing + :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute + :attr:`version` to an appropriate string value in the subclass definition. + + The optional *proxies* parameter should be a dictionary mapping scheme names to + proxy URLs, where an empty dictionary turns proxies off completely. Its default + value is ``None``, in which case environmental proxy settings will be used if + present, as discussed in the definition of :func:`urlopen`, above. + + Additional keyword parameters, collected in *x509*, may be used for + authentication of the client when using the :file:`https:` scheme. The keywords + *key_file* and *cert_file* are supported to provide an SSL key and certificate; + both are needed to support client authentication. + + :class:`URLopener` objects will raise an :exc:`IOError` exception if the server + returns an error code. + + .. method:: open(fullurl[, data]) + + Open *fullurl* using the appropriate protocol. This method sets up cache and + proxy information, then calls the appropriate open method with its input + arguments. If the scheme is not recognized, :meth:`open_unknown` is called. + The *data* argument has the same meaning as the *data* argument of + :func:`urlopen`. + + + .. method:: open_unknown(fullurl[, data]) + + Overridable interface to open unknown URL types. + + + .. method:: retrieve(url[, filename[, reporthook[, data]]]) + + Retrieves the contents of *url* and places it in *filename*. The return value + is a tuple consisting of a local filename and either a + :class:`email.message.Message` object containing the response headers (for remote + URLs) or ``None`` (for local URLs). The caller must then open and read the + contents of *filename*. If *filename* is not given and the URL refers to a + local file, the input filename is returned. If the URL is non-local and + *filename* is not given, the filename is the output of :func:`tempfile.mktemp` + with a suffix that matches the suffix of the last path component of the input + URL. If *reporthook* is given, it must be a function accepting three numeric + parameters. It will be called after each chunk of data is read from the + network. *reporthook* is ignored for local URLs. + + If the *url* uses the :file:`http:` scheme identifier, the optional *data* + argument may be given to specify a ``POST`` request (normally the request type + is ``GET``). The *data* argument must in standard + :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` + function below. + + + .. attribute:: version + + Variable that specifies the user agent of the opener object. To get + :mod:`urllib` to tell servers that it is a particular user agent, set this in a + subclass as a class variable or in the constructor before calling the base + constructor. + + +.. class:: FancyURLopener(...) + + :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling + for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x + response codes listed above, the :mailheader:`Location` header is used to fetch + the actual URL. For 401 response codes (authentication required), basic HTTP + authentication is performed. For the 30x response codes, recursion is bounded + by the value of the *maxtries* attribute, which defaults to 10. + + For all other response codes, the method :meth:`http_error_default` is called + which you can override in subclasses to handle the error appropriately. + + .. note:: + + According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests + must not be automatically redirected without confirmation by the user. In + reality, browsers do allow automatic redirection of these responses, changing + the POST to a GET, and :mod:`urllib` reproduces this behaviour. + + The parameters to the constructor are the same as those for :class:`URLopener`. + + .. note:: + + When performing basic authentication, a :class:`FancyURLopener` instance calls + its :meth:`prompt_user_passwd` method. The default implementation asks the + users for the required information on the controlling terminal. A subclass may + override this method to support more appropriate behavior if needed. + + The :class:`FancyURLopener` class offers one additional method that should be + overloaded to provide the appropriate behavior: + + .. method:: prompt_user_passwd(host, realm) + + Return information needed to authenticate the user at the given host in the + specified security realm. The return value should be a tuple, ``(user, + password)``, which can be used for basic authentication. + + The implementation prompts for this information on the terminal; an application + should override this method to use an appropriate interaction model in the local + environment. + +.. class:: OpenerDirector() + + The :class:`OpenerDirector` class opens URLs via :class:`BaseHandler`\ s chained + together. It manages the chaining of handlers, and recovery from errors. + + +.. class:: BaseHandler() + + This is the base class for all registered handlers --- and handles only the + simple mechanics of registration. + + +.. class:: HTTPDefaultErrorHandler() + + A class which defines a default handler for HTTP error responses; all responses + are turned into :exc:`HTTPError` exceptions. + + +.. class:: HTTPRedirectHandler() + + A class to handle redirections. + + +.. class:: HTTPCookieProcessor([cookiejar]) + + A class to handle HTTP Cookies. + + +.. class:: ProxyHandler([proxies]) + + Cause requests to go through a proxy. If *proxies* is given, it must be a + dictionary mapping protocol names to URLs of proxies. The default is to read the + list of proxies from the environment variables :envvar:`_proxy`. + To disable autodetected proxy pass an empty dictionary. + + +.. class:: HTTPPasswordMgr() + + Keep a database of ``(realm, uri) -> (user, password)`` mappings. + + +.. class:: HTTPPasswordMgrWithDefaultRealm() + + Keep a database of ``(realm, uri) -> (user, password)`` mappings. A realm of + ``None`` is considered a catch-all realm, which is searched if no other realm + fits. + + +.. class:: AbstractBasicAuthHandler([password_mgr]) + + This is a mixin class that helps with HTTP authentication, both to the remote + host and to a proxy. *password_mgr*, if given, should be something that is + compatible with :class:`HTTPPasswordMgr`; refer to section + :ref:`http-password-mgr` for information on the interface that must be + supported. + + +.. class:: HTTPBasicAuthHandler([password_mgr]) + + Handle authentication with the remote host. *password_mgr*, if given, should be + something that is compatible with :class:`HTTPPasswordMgr`; refer to section + :ref:`http-password-mgr` for information on the interface that must be + supported. + + +.. class:: ProxyBasicAuthHandler([password_mgr]) + + Handle authentication with the proxy. *password_mgr*, if given, should be + something that is compatible with :class:`HTTPPasswordMgr`; refer to section + :ref:`http-password-mgr` for information on the interface that must be + supported. + + +.. class:: AbstractDigestAuthHandler([password_mgr]) + + This is a mixin class that helps with HTTP authentication, both to the remote + host and to a proxy. *password_mgr*, if given, should be something that is + compatible with :class:`HTTPPasswordMgr`; refer to section + :ref:`http-password-mgr` for information on the interface that must be + supported. + + +.. class:: HTTPDigestAuthHandler([password_mgr]) + + Handle authentication with the remote host. *password_mgr*, if given, should be + something that is compatible with :class:`HTTPPasswordMgr`; refer to section + :ref:`http-password-mgr` for information on the interface that must be + supported. + + +.. class:: ProxyDigestAuthHandler([password_mgr]) + + Handle authentication with the proxy. *password_mgr*, if given, should be + something that is compatible with :class:`HTTPPasswordMgr`; refer to section + :ref:`http-password-mgr` for information on the interface that must be + supported. + + +.. class:: HTTPHandler() + + A class to handle opening of HTTP URLs. + + +.. class:: HTTPSHandler() + + A class to handle opening of HTTPS URLs. + + +.. class:: FileHandler() + + Open local files. + + +.. class:: FTPHandler() + + Open FTP URLs. + + +.. class:: CacheFTPHandler() + + Open FTP URLs, keeping a cache of open FTP connections to minimize delays. + + +.. class:: UnknownHandler() + + A catch-all class to handle unknown URLs. + + +.. _request-objects: + +Request Objects +--------------- + +The following methods describe all of :class:`Request`'s public interface, and +so all must be overridden in subclasses. + + +.. method:: Request.add_data(data) + + Set the :class:`Request` data to *data*. This is ignored by all handlers except + HTTP handlers --- and there it should be a byte string, and will change the + request to be ``POST`` rather than ``GET``. + + +.. method:: Request.get_method() + + Return a string indicating the HTTP request method. This is only meaningful for + HTTP requests, and currently always returns ``'GET'`` or ``'POST'``. + + +.. method:: Request.has_data() + + Return whether the instance has a non-\ ``None`` data. + + +.. method:: Request.get_data() + + Return the instance's data. + + +.. method:: Request.add_header(key, val) + + Add another header to the request. Headers are currently ignored by all + handlers except HTTP handlers, where they are added to the list of headers sent + to the server. Note that there cannot be more than one header with the same + name, and later calls will overwrite previous calls in case the *key* collides. + Currently, this is no loss of HTTP functionality, since all headers which have + meaning when used more than once have a (header-specific) way of gaining the + same functionality using only one header. + + +.. method:: Request.add_unredirected_header(key, header) + + Add a header that will not be added to a redirected request. + + +.. method:: Request.has_header(header) + + Return whether the instance has the named header (checks both regular and + unredirected). + + +.. method:: Request.get_full_url() + + Return the URL given in the constructor. + + +.. method:: Request.get_type() + + Return the type of the URL --- also known as the scheme. + + +.. method:: Request.get_host() + + Return the host to which a connection will be made. + + +.. method:: Request.get_selector() + + Return the selector --- the part of the URL that is sent to the server. + + +.. method:: Request.set_proxy(host, type) + + Prepare the request by connecting to a proxy server. The *host* and *type* will + replace those of the instance, and the instance's selector will be the original + URL given in the constructor. + + +.. method:: Request.get_origin_req_host() + + Return the request-host of the origin transaction, as defined by :rfc:`2965`. + See the documentation for the :class:`Request` constructor. + + +.. method:: Request.is_unverifiable() + + Return whether the request is unverifiable, as defined by RFC 2965. See the + documentation for the :class:`Request` constructor. + + +.. _opener-director-objects: + +OpenerDirector Objects +---------------------- + +:class:`OpenerDirector` instances have the following methods: + + +.. method:: OpenerDirector.add_handler(handler) + + *handler* should be an instance of :class:`BaseHandler`. The following methods + are searched, and added to the possible chains (note that HTTP errors are a + special case). + + * :meth:`protocol_open` --- signal that the handler knows how to open *protocol* + URLs. + + * :meth:`http_error_type` --- signal that the handler knows how to handle HTTP + errors with HTTP error code *type*. + + * :meth:`protocol_error` --- signal that the handler knows how to handle errors + from (non-\ ``http``) *protocol*. + + * :meth:`protocol_request` --- signal that the handler knows how to pre-process + *protocol* requests. + + * :meth:`protocol_response` --- signal that the handler knows how to + post-process *protocol* responses. + + +.. method:: OpenerDirector.open(url[, data][, timeout]) + + Open the given *url* (which can be a request object or a string), optionally + passing the given *data*. Arguments, return values and exceptions raised are + the same as those of :func:`urlopen` (which simply calls the :meth:`open` + method on the currently installed global :class:`OpenerDirector`). The + optional *timeout* parameter specifies a timeout in seconds for blocking + operations like the connection attempt (if not specified, the global default + timeout setting will be usedi). The timeout feature actually works only for + HTTP, HTTPS, FTP and FTPS connections). + + +.. method:: OpenerDirector.error(proto[, arg[, ...]]) + + Handle an error of the given protocol. This will call the registered error + handlers for the given protocol with the given arguments (which are protocol + specific). The HTTP protocol is a special case which uses the HTTP response + code to determine the specific error handler; refer to the :meth:`http_error_\*` + methods of the handler classes. + + Return values and exceptions raised are the same as those of :func:`urlopen`. + +OpenerDirector objects open URLs in three stages: + +The order in which these methods are called within each stage is determined by +sorting the handler instances. + +#. Every handler with a method named like :meth:`protocol_request` has that + method called to pre-process the request. + +#. Handlers with a method named like :meth:`protocol_open` are called to handle + the request. This stage ends when a handler either returns a non-\ :const:`None` + value (ie. a response), or raises an exception (usually :exc:`URLError`). + Exceptions are allowed to propagate. + + In fact, the above algorithm is first tried for methods named + :meth:`default_open`. If all such methods return :const:`None`, the algorithm + is repeated for methods named like :meth:`protocol_open`. If all such methods + return :const:`None`, the algorithm is repeated for methods named + :meth:`unknown_open`. + + Note that the implementation of these methods may involve calls of the parent + :class:`OpenerDirector` instance's :meth:`.open` and :meth:`.error` methods. + +#. Every handler with a method named like :meth:`protocol_response` has that + method called to post-process the response. + + +.. _base-handler-objects: + +BaseHandler Objects +------------------- + +:class:`BaseHandler` objects provide a couple of methods that are directly +useful, and others that are meant to be used by derived classes. These are +intended for direct use: + + +.. method:: BaseHandler.add_parent(director) + + Add a director as parent. + + +.. method:: BaseHandler.close() + + Remove any parents. + +The following members and methods should only be used by classes derived from +:class:`BaseHandler`. + +.. note:: + + The convention has been adopted that subclasses defining + :meth:`protocol_request` or :meth:`protocol_response` methods are named + :class:`\*Processor`; all others are named :class:`\*Handler`. + + +.. attribute:: BaseHandler.parent + + A valid :class:`OpenerDirector`, which can be used to open using a different + protocol, or handle errors. + + +.. method:: BaseHandler.default_open(req) + + This method is *not* defined in :class:`BaseHandler`, but subclasses should + define it if they want to catch all URLs. + + This method, if implemented, will be called by the parent + :class:`OpenerDirector`. It should return a file-like object as described in + the return value of the :meth:`open` of :class:`OpenerDirector`, or ``None``. + It should raise :exc:`URLError`, unless a truly exceptional thing happens (for + example, :exc:`MemoryError` should not be mapped to :exc:`URLError`). + + This method will be called before any protocol-specific open method. + + +.. method:: BaseHandler.protocol_open(req) + :noindex: + + This method is *not* defined in :class:`BaseHandler`, but subclasses should + define it if they want to handle URLs with the given protocol. + + This method, if defined, will be called by the parent :class:`OpenerDirector`. + Return values should be the same as for :meth:`default_open`. + + +.. method:: BaseHandler.unknown_open(req) + + This method is *not* defined in :class:`BaseHandler`, but subclasses should + define it if they want to catch all URLs with no specific registered handler to + open it. + + This method, if implemented, will be called by the :attr:`parent` + :class:`OpenerDirector`. Return values should be the same as for + :meth:`default_open`. + + +.. method:: BaseHandler.http_error_default(req, fp, code, msg, hdrs) + + This method is *not* defined in :class:`BaseHandler`, but subclasses should + override it if they intend to provide a catch-all for otherwise unhandled HTTP + errors. It will be called automatically by the :class:`OpenerDirector` getting + the error, and should not normally be called in other circumstances. + + *req* will be a :class:`Request` object, *fp* will be a file-like object with + the HTTP error body, *code* will be the three-digit code of the error, *msg* + will be the user-visible explanation of the code and *hdrs* will be a mapping + object with the headers of the error. + + Return values and exceptions raised should be the same as those of + :func:`urlopen`. + + +.. method:: BaseHandler.http_error_nnn(req, fp, code, msg, hdrs) + + *nnn* should be a three-digit HTTP error code. This method is also not defined + in :class:`BaseHandler`, but will be called, if it exists, on an instance of a + subclass, when an HTTP error with code *nnn* occurs. + + Subclasses should override this method to handle specific HTTP errors. + + Arguments, return values and exceptions raised should be the same as for + :meth:`http_error_default`. + + +.. method:: BaseHandler.protocol_request(req) + :noindex: + + This method is *not* defined in :class:`BaseHandler`, but subclasses should + define it if they want to pre-process requests of the given protocol. + + This method, if defined, will be called by the parent :class:`OpenerDirector`. + *req* will be a :class:`Request` object. The return value should be a + :class:`Request` object. + + +.. method:: BaseHandler.protocol_response(req, response) + :noindex: + + This method is *not* defined in :class:`BaseHandler`, but subclasses should + define it if they want to post-process responses of the given protocol. + + This method, if defined, will be called by the parent :class:`OpenerDirector`. + *req* will be a :class:`Request` object. *response* will be an object + implementing the same interface as the return value of :func:`urlopen`. The + return value should implement the same interface as the return value of + :func:`urlopen`. + + +.. _http-redirect-handler: + +HTTPRedirectHandler Objects +--------------------------- + +.. note:: + + Some HTTP redirections require action from this module's client code. If this + is the case, :exc:`HTTPError` is raised. See :rfc:`2616` for details of the + precise meanings of the various redirection codes. + + +.. method:: HTTPRedirectHandler.redirect_request(req, fp, code, msg, hdrs) + + Return a :class:`Request` or ``None`` in response to a redirect. This is called + by the default implementations of the :meth:`http_error_30\*` methods when a + redirection is received from the server. If a redirection should take place, + return a new :class:`Request` to allow :meth:`http_error_30\*` to perform the + redirect. Otherwise, raise :exc:`HTTPError` if no other handler should try to + handle this URL, or return ``None`` if you can't but another handler might. + + .. note:: + + The default implementation of this method does not strictly follow :rfc:`2616`, + which says that 301 and 302 responses to ``POST`` requests must not be + automatically redirected without confirmation by the user. In reality, browsers + do allow automatic redirection of these responses, changing the POST to a + ``GET``, and the default implementation reproduces this behavior. + + +.. method:: HTTPRedirectHandler.http_error_301(req, fp, code, msg, hdrs) + + Redirect to the ``Location:`` URL. This method is called by the parent + :class:`OpenerDirector` when getting an HTTP 'moved permanently' response. + + +.. method:: HTTPRedirectHandler.http_error_302(req, fp, code, msg, hdrs) + + The same as :meth:`http_error_301`, but called for the 'found' response. + + +.. method:: HTTPRedirectHandler.http_error_303(req, fp, code, msg, hdrs) + + The same as :meth:`http_error_301`, but called for the 'see other' response. + + +.. method:: HTTPRedirectHandler.http_error_307(req, fp, code, msg, hdrs) + + The same as :meth:`http_error_301`, but called for the 'temporary redirect' + response. + + +.. _http-cookie-processor: + +HTTPCookieProcessor Objects +--------------------------- + +:class:`HTTPCookieProcessor` instances have one attribute: + +.. attribute:: HTTPCookieProcessor.cookiejar + + The :class:`http.cookiejar.CookieJar` in which cookies are stored. + + +.. _proxy-handler: + +ProxyHandler Objects +-------------------- + + +.. method:: ProxyHandler.protocol_open(request) + :noindex: + + The :class:`ProxyHandler` will have a method :meth:`protocol_open` for every + *protocol* which has a proxy in the *proxies* dictionary given in the + constructor. The method will modify requests to go through the proxy, by + calling ``request.set_proxy()``, and call the next handler in the chain to + actually execute the protocol. + + +.. _http-password-mgr: + +HTTPPasswordMgr Objects +----------------------- + +These methods are available on :class:`HTTPPasswordMgr` and +:class:`HTTPPasswordMgrWithDefaultRealm` objects. + + +.. method:: HTTPPasswordMgr.add_password(realm, uri, user, passwd) + + *uri* can be either a single URI, or a sequence of URIs. *realm*, *user* and + *passwd* must be strings. This causes ``(user, passwd)`` to be used as + authentication tokens when authentication for *realm* and a super-URI of any of + the given URIs is given. + + +.. method:: HTTPPasswordMgr.find_user_password(realm, authuri) + + Get user/password for given realm and URI, if any. This method will return + ``(None, None)`` if there is no matching user/password. + + For :class:`HTTPPasswordMgrWithDefaultRealm` objects, the realm ``None`` will be + searched if the given *realm* has no matching user/password. + + +.. _abstract-basic-auth-handler: + +AbstractBasicAuthHandler Objects +-------------------------------- + + +.. method:: AbstractBasicAuthHandler.http_error_auth_reqed(authreq, host, req, headers) + + Handle an authentication request by getting a user/password pair, and re-trying + the request. *authreq* should be the name of the header where the information + about the realm is included in the request, *host* specifies the URL and path to + authenticate for, *req* should be the (failed) :class:`Request` object, and + *headers* should be the error headers. + + *host* is either an authority (e.g. ``"python.org"``) or a URL containing an + authority component (e.g. ``"http://python.org/"``). In either case, the + authority must not contain a userinfo component (so, ``"python.org"`` and + ``"python.org:80"`` are fine, ``"joe:password@python.org"`` is not). + + +.. _http-basic-auth-handler: + +HTTPBasicAuthHandler Objects +---------------------------- + + +.. method:: HTTPBasicAuthHandler.http_error_401(req, fp, code, msg, hdrs) + + Retry the request with authentication information, if available. + + +.. _proxy-basic-auth-handler: + +ProxyBasicAuthHandler Objects +----------------------------- + + +.. method:: ProxyBasicAuthHandler.http_error_407(req, fp, code, msg, hdrs) + + Retry the request with authentication information, if available. + + +.. _abstract-digest-auth-handler: + +AbstractDigestAuthHandler Objects +--------------------------------- + + +.. method:: AbstractDigestAuthHandler.http_error_auth_reqed(authreq, host, req, headers) + + *authreq* should be the name of the header where the information about the realm + is included in the request, *host* should be the host to authenticate to, *req* + should be the (failed) :class:`Request` object, and *headers* should be the + error headers. + + +.. _http-digest-auth-handler: + +HTTPDigestAuthHandler Objects +----------------------------- + + +.. method:: HTTPDigestAuthHandler.http_error_401(req, fp, code, msg, hdrs) + + Retry the request with authentication information, if available. + + +.. _proxy-digest-auth-handler: + +ProxyDigestAuthHandler Objects +------------------------------ + + +.. method:: ProxyDigestAuthHandler.http_error_407(req, fp, code, msg, hdrs) + + Retry the request with authentication information, if available. + + +.. _http-handler-objects: + +HTTPHandler Objects +------------------- + + +.. method:: HTTPHandler.http_open(req) + + Send an HTTP request, which can be either GET or POST, depending on + ``req.has_data()``. + + +.. _https-handler-objects: + +HTTPSHandler Objects +-------------------- + + +.. method:: HTTPSHandler.https_open(req) + + Send an HTTPS request, which can be either GET or POST, depending on + ``req.has_data()``. + + +.. _file-handler-objects: + +FileHandler Objects +------------------- + + +.. method:: FileHandler.file_open(req) + + Open the file locally, if there is no host name, or the host name is + ``'localhost'``. Change the protocol to ``ftp`` otherwise, and retry opening it + using :attr:`parent`. + + +.. _ftp-handler-objects: + +FTPHandler Objects +------------------ + + +.. method:: FTPHandler.ftp_open(req) + + Open the FTP file indicated by *req*. The login is always done with empty + username and password. + + +.. _cacheftp-handler-objects: + +CacheFTPHandler Objects +----------------------- + +:class:`CacheFTPHandler` objects are :class:`FTPHandler` objects with the +following additional methods: + + +.. method:: CacheFTPHandler.setTimeout(t) + + Set timeout of connections to *t* seconds. + + +.. method:: CacheFTPHandler.setMaxConns(m) + + Set maximum number of cached connections to *m*. + + +.. _unknown-handler-objects: + +UnknownHandler Objects +---------------------- + + +.. method:: UnknownHandler.unknown_open() + + Raise a :exc:`URLError` exception. + + +.. _http-error-processor-objects: + +HTTPErrorProcessor Objects +-------------------------- + +.. method:: HTTPErrorProcessor.unknown_open() + + Process HTTP error responses. + + For 200 error codes, the response object is returned immediately. + + For non-200 error codes, this simply passes the job on to the + :meth:`protocol_error_code` handler methods, via :meth:`OpenerDirector.error`. + Eventually, :class:`urllib2.HTTPDefaultErrorHandler` will raise an + :exc:`HTTPError` if no other handler handles the error. + +.. _urllib2-examples: + +Examples +-------- + +This example gets the python.org main page and displays the first 100 bytes of +it:: + + >>> import urllib.request + >>> f = urllib.request.urlopen('http://www.python.org/') + >>> print(f.read(100)) + + >> import urllib.request + >>> req = urllib.request.Request(url='https://localhost/cgi-bin/test.cgi', + ... data='This data is passed to stdin of the CGI') + >>> f = urllib.request.urlopen(req) + >>> print(f.read()) + Got Data: "This data is passed to stdin of the CGI" + +The code for the sample CGI used in the above example is:: + + #!/usr/bin/env python + import sys + data = sys.stdin.read() + print('Content-type: text-plain\n\nGot Data: "%s"' % data) + +Use of Basic HTTP Authentication:: + + import urllib.request + # Create an OpenerDirector with support for Basic HTTP Authentication... + auth_handler = urllib.request.HTTPBasicAuthHandler() + auth_handler.add_password(realm='PDQ Application', + uri='https://mahler:8092/site-updates.py', + user='klem', + passwd='kadidd!ehopper') + opener = urllib.request.build_opener(auth_handler) + # ...and install it globally so it can be used with urlopen. + urllib.request.install_opener(opener) + urllib.request.urlopen('http://www.example.com/login.html') + +:func:`build_opener` provides many handlers by default, including a +:class:`ProxyHandler`. By default, :class:`ProxyHandler` uses the environment +variables named ``_proxy``, where ```` is the URL scheme +involved. For example, the :envvar:`http_proxy` environment variable is read to +obtain the HTTP proxy's URL. + +This example replaces the default :class:`ProxyHandler` with one that uses +programatically-supplied proxy URLs, and adds proxy authorization support with +:class:`ProxyBasicAuthHandler`. :: + + proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'}) + proxy_auth_handler = urllib.request.HTTPBasicAuthHandler() + proxy_auth_handler.add_password('realm', 'host', 'username', 'password') + + opener = build_opener(proxy_handler, proxy_auth_handler) + # This time, rather than install the OpenerDirector, we use it directly: + opener.open('http://www.example.com/login.html') + +Adding HTTP headers: + +Use the *headers* argument to the :class:`Request` constructor, or:: + + import urllib + req = urllib.request.Request('http://www.example.com/') + req.add_header('Referer', 'http://www.python.org/') + r = urllib.request.urlopen(req) + +:class:`OpenerDirector` automatically adds a :mailheader:`User-Agent` header to +every :class:`Request`. To change this:: + + import urllib + opener = urllib.request.build_opener() + opener.addheaders = [('User-agent', 'Mozilla/5.0')] + opener.open('http://www.example.com/') + +Also, remember that a few standard headers (:mailheader:`Content-Length`, +:mailheader:`Content-Type` and :mailheader:`Host`) are added when the +:class:`Request` is passed to :func:`urlopen` (or :meth:`OpenerDirector.open`). + +.. _urllib-examples: + +Here is an example session that uses the ``GET`` method to retrieve a URL +containing parameters:: + + >>> import urllib.request + >>> import urllib.parse + >>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) + >>> f = urllib.request.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) + >>> print(f.read()) + +The following example uses the ``POST`` method instead:: + + >>> import urllib.request + >>> import urllib.parse + >>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) + >>> f = urllib.request.urlopen("http://www.musi-cal.com/cgi-bin/query", params) + >>> print(f.read()) + +The following example uses an explicitly specified HTTP proxy, overriding +environment settings:: + + >>> import urllib.request + >>> proxies = {'http': 'http://proxy.example.com:8080/'} + >>> opener = urllib.request.FancyURLopener(proxies) + >>> f = opener.open("http://www.python.org") + >>> f.read() + +The following example uses no proxies at all, overriding environment settings:: + + >>> import urllib.request + >>> opener = urllib.request.FancyURLopener({}) + >>> f = opener.open("http://www.python.org/") + >>> f.read() + + +:mod:`urllib.request` Restrictions +---------------------------------- + + .. index:: + pair: HTTP; protocol + pair: FTP; protocol + +* Currently, only the following protocols are supported: HTTP, (versions 0.9 and + 1.0), FTP, and local files. + +* The caching feature of :func:`urlretrieve` has been disabled until I find the + time to hack proper processing of Expiration time headers. + +* There should be a function to query whether a particular URL is in the cache. + +* For backward compatibility, if a URL appears to point to a local file but the + file can't be opened, the URL is re-interpreted using the FTP protocol. This + can sometimes cause confusing error messages. + +* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily + long delays while waiting for a network connection to be set up. This means + that it is difficult to build an interactive Web client using these functions + without using threads. + + .. index:: + single: HTML + pair: HTTP; protocol + +* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data + returned by the server. This may be binary data (such as an image), plain text + or (for example) HTML. The HTTP protocol provides type information in the reply + header, which can be inspected by looking at the :mailheader:`Content-Type` + header. If the returned data is HTML, you can use the module + :mod:`html.parser` to parse it. + + .. index:: single: FTP + +* The code handling the FTP protocol cannot differentiate between a file and a + directory. This can lead to unexpected behavior when attempting to read a URL + that points to a file that is not accessible. If the URL ends in a ``/``, it is + assumed to refer to a directory and will be handled accordingly. But if an + attempt to read a file leads to a 550 error (meaning the URL cannot be found or + is not accessible, often for permission reasons), then the path is treated as a + directory in order to handle the case when a directory is specified by a URL but + the trailing ``/`` has been left off. This can cause misleading results when + you try to fetch a file whose read permissions make it inaccessible; the FTP + code will try to read it, fail with a 550 error, and then perform a directory + listing for the unreadable file. If fine-grained control is needed, consider + using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing + *_urlopener* to meet your needs. + +:mod:`urllib.response` --- Response classes used by urllib. +=========================================================== +.. module:: urllib.response + :synopsis: Response classes used by urllib. + +The :mod:`urllib.response` module defines functions and classes which define a +minimal file like interface, including read() and readline(). The typical +response object is an addinfourl instance, which defines and info() method and +that returns headers and a geturl() method that returns the url. +Functions defined by this module are used internally by the +:mod:`urllib.request` module. + diff --git a/Doc/library/urllib.robotparser.rst b/Doc/library/urllib.robotparser.rst new file mode 100644 index 0000000..e351c56 --- /dev/null +++ b/Doc/library/urllib.robotparser.rst @@ -0,0 +1,73 @@ + +:mod:`urllib.robotparser` --- Parser for robots.txt +==================================================== + +.. module:: urllib.robotparser + :synopsis: Loads a robots.txt file and answers questions about + fetchability of other URLs. +.. sectionauthor:: Skip Montanaro + + +.. index:: + single: WWW + single: World Wide Web + single: URL + single: robots.txt + +This module provides a single class, :class:`RobotFileParser`, which answers +questions about whether or not a particular user agent can fetch a URL on the +Web site that published the :file:`robots.txt` file. For more details on the +structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html. + + +.. class:: RobotFileParser() + + This class provides a set of methods to read, parse and answer questions + about a single :file:`robots.txt` file. + + + .. method:: set_url(url) + + Sets the URL referring to a :file:`robots.txt` file. + + + .. method:: read() + + Reads the :file:`robots.txt` URL and feeds it to the parser. + + + .. method:: parse(lines) + + Parses the lines argument. + + + .. method:: can_fetch(useragent, url) + + Returns ``True`` if the *useragent* is allowed to fetch the *url* + according to the rules contained in the parsed :file:`robots.txt` + file. + + + .. method:: mtime() + + Returns the time the ``robots.txt`` file was last fetched. This is + useful for long-running web spiders that need to check for new + ``robots.txt`` files periodically. + + + .. method:: modified() + + Sets the time the ``robots.txt`` file was last fetched to the current + time. + +The following example demonstrates basic use of the RobotFileParser class. :: + + >>> import urllib.robotparser + >>> rp = urllib.robotparser.RobotFileParser() + >>> rp.set_url("http://www.musi-cal.com/robots.txt") + >>> rp.read() + >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") + False + >>> rp.can_fetch("*", "http://www.musi-cal.com/") + True + diff --git a/Doc/library/urllib.rst b/Doc/library/urllib.rst deleted file mode 100644 index 3435e55..0000000 --- a/Doc/library/urllib.rst +++ /dev/null @@ -1,459 +0,0 @@ -:mod:`urllib` --- Open arbitrary resources by URL -================================================= - -.. module:: urllib - :synopsis: Open an arbitrary network resource by URL (requires sockets). - - -.. index:: - single: WWW - single: World Wide Web - single: URL - -This module provides a high-level interface for fetching data across the World -Wide Web. In particular, the :func:`urlopen` function is similar to the -built-in function :func:`open`, but accepts Universal Resource Locators (URLs) -instead of filenames. Some restrictions apply --- it can only open URLs for -reading, and no seek operations are available. - -High-level interface --------------------- - -.. function:: urlopen(url[, data[, proxies]]) - - Open a network object denoted by a URL for reading. If the URL does not have a - scheme identifier, or if it has :file:`file:` as its scheme identifier, this - opens a local file (without universal newlines); otherwise it opens a socket to - a server somewhere on the network. If the connection cannot be made the - :exc:`IOError` exception is raised. If all went well, a file-like object is - returned. This supports the following methods: :meth:`read`, :meth:`readline`, - :meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and - :meth:`geturl`. It also has proper support for the :term:`iterator` protocol. One - caveat: the :meth:`read` method, if the size argument is omitted or negative, - may not read until the end of the data stream; there is no good way to determine - that the entire stream from a socket has been read in the general case. - - Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods, - these methods have the same interface as for file objects --- see section - :ref:`bltin-file-objects` in this manual. (It is not a built-in file object, - however, so it can't be used at those few places where a true built-in file - object is required.) - - The :meth:`info` method returns an instance of the class - :class:`email.message.Message` containing meta-information associated with - the URL. When the method is HTTP, these headers are those returned by the - server at the head of the retrieved HTML page (including Content-Length and - Content-Type). When the method is FTP, a Content-Length header will be - present if (as is now usual) the server passed back a file length in response - to the FTP retrieval request. A Content-Type header will be present if the - MIME type can be guessed. When the method is local-file, returned headers - will include a Date representing the file's last-modified time, a - Content-Length giving file size, and a Content-Type containing a guess at the - file's type. - - The :meth:`geturl` method returns the real URL of the page. In some cases, the - HTTP server redirects a client to another URL. The :func:`urlopen` function - handles this transparently, but in some cases the caller needs to know which URL - the client was redirected to. The :meth:`geturl` method can be used to get at - this redirected URL. - - The :meth:`getcode` method returns the HTTP status code that was sent with the - response, or ``None`` if the URL is no HTTP URL. - - If the *url* uses the :file:`http:` scheme identifier, the optional *data* - argument may be given to specify a ``POST`` request (normally the request type - is ``GET``). The *data* argument must be in standard - :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` - function below. - - The :func:`urlopen` function works transparently with proxies which do not - require authentication. In a Unix or Windows environment, set the - :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that - identifies the proxy server before starting the Python interpreter. For example - (the ``'%'`` is the command prompt):: - - % http_proxy="http://www.someproxy.com:3128" - % export http_proxy - % python - ... - - The :envvar:`no_proxy` environment variable can be used to specify hosts which - shouldn't be reached via proxy; if set, it should be a comma-separated list - of hostname suffixes, optionally with ``:port`` appended, for example - ``cern.ch,ncsa.uiuc.edu,some.host:8080``. - - In a Windows environment, if no proxy environment variables are set, proxy - settings are obtained from the registry's Internet Settings section. - - .. index:: single: Internet Config - - In a Macintosh environment, :func:`urlopen` will retrieve proxy information from - Internet Config. - - Alternatively, the optional *proxies* argument may be used to explicitly specify - proxies. It must be a dictionary mapping scheme names to proxy URLs, where an - empty dictionary causes no proxies to be used, and ``None`` (the default value) - causes environmental proxy settings to be used as discussed above. For - example:: - - # Use http://www.someproxy.com:3128 for http proxying - proxies = {'http': 'http://www.someproxy.com:3128'} - filehandle = urllib.urlopen(some_url, proxies=proxies) - # Don't use any proxies - filehandle = urllib.urlopen(some_url, proxies={}) - # Use proxies from environment - both versions are equivalent - filehandle = urllib.urlopen(some_url, proxies=None) - filehandle = urllib.urlopen(some_url) - - Proxies which require authentication for use are not currently supported; this - is considered an implementation limitation. - - -.. function:: urlretrieve(url[, filename[, reporthook[, data]]]) - - Copy a network object denoted by a URL to a local file, if necessary. If the URL - points to a local file, or a valid cached copy of the object exists, the object - is not copied. Return a tuple ``(filename, headers)`` where *filename* is the - local file name under which the object can be found, and *headers* is whatever - the :meth:`info` method of the object returned by :func:`urlopen` returned (for - a remote object, possibly cached). Exceptions are the same as for - :func:`urlopen`. - - The second argument, if present, specifies the file location to copy to (if - absent, the location will be a tempfile with a generated name). The third - argument, if present, is a hook function that will be called once on - establishment of the network connection and once after each block read - thereafter. The hook will be passed three arguments; a count of blocks - transferred so far, a block size in bytes, and the total size of the file. The - third argument may be ``-1`` on older FTP servers which do not return a file - size in response to a retrieval request. - - If the *url* uses the :file:`http:` scheme identifier, the optional *data* - argument may be given to specify a ``POST`` request (normally the request type - is ``GET``). The *data* argument must in standard - :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` - function below. - - :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that - the amount of data available was less than the expected amount (which is the - size reported by a *Content-Length* header). This can occur, for example, when - the download is interrupted. - - The *Content-Length* is treated as a lower bound: if there's more data to read, - urlretrieve reads more data, but if less data is available, it raises the - exception. - - You can still retrieve the downloaded data in this case, it is stored in the - :attr:`content` attribute of the exception instance. - - If no *Content-Length* header was supplied, urlretrieve can not check the size - of the data it has downloaded, and just returns it. In this case you just have - to assume that the download was successful. - - -.. data:: _urlopener - - The public functions :func:`urlopen` and :func:`urlretrieve` create an instance - of the :class:`FancyURLopener` class and use it to perform their requested - actions. To override this functionality, programmers can create a subclass of - :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that - class to the ``urllib._urlopener`` variable before calling the desired function. - For example, applications may want to specify a different - :mailheader:`User-Agent` header than :class:`URLopener` defines. This can be - accomplished with the following code:: - - import urllib - - class AppURLopener(urllib.FancyURLopener): - version = "App/1.7" - - urllib._urlopener = AppURLopener() - - -.. function:: urlcleanup() - - Clear the cache that may have been built up by previous calls to - :func:`urlretrieve`. - - -Utility functions ------------------ - -.. function:: quote(string[, safe]) - - Replace special characters in *string* using the ``%xx`` escape. Letters, - digits, and the characters ``'_.-'`` are never quoted. The optional *safe* - parameter specifies additional characters that should not be quoted --- its - default value is ``'/'``. - - Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. - - -.. function:: quote_plus(string[, safe]) - - Like :func:`quote`, but also replaces spaces by plus signs, as required for - quoting HTML form values. Plus signs in the original string are escaped unless - they are included in *safe*. It also does not have *safe* default to ``'/'``. - - -.. function:: unquote(string) - - Replace ``%xx`` escapes by their single-character equivalent. - - Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. - - -.. function:: unquote_plus(string) - - Like :func:`unquote`, but also replaces plus signs by spaces, as required for - unquoting HTML form values. - - -.. function:: urlencode(query[, doseq]) - - Convert a mapping object or a sequence of two-element tuples to a "url-encoded" - string, suitable to pass to :func:`urlopen` above as the optional *data* - argument. This is useful to pass a dictionary of form fields to a ``POST`` - request. The resulting string is a series of ``key=value`` pairs separated by - ``'&'`` characters, where both *key* and *value* are quoted using - :func:`quote_plus` above. If the optional parameter *doseq* is present and - evaluates to true, individual ``key=value`` pairs are generated for each element - of the sequence. When a sequence of two-element tuples is used as the *query* - argument, the first element of each tuple is a key and the second is a value. - The order of parameters in the encoded string will match the order of parameter - tuples in the sequence. The :mod:`cgi` module provides the functions - :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings - into Python data structures. - - -.. function:: pathname2url(path) - - Convert the pathname *path* from the local syntax for a path to the form used in - the path component of a URL. This does not produce a complete URL. The return - value will already be quoted using the :func:`quote` function. - - -.. function:: url2pathname(path) - - Convert the path component *path* from an encoded URL to the local syntax for a - path. This does not accept a complete URL. This function uses :func:`unquote` - to decode *path*. - - -URL Opener objects ------------------- - -.. class:: URLopener([proxies[, **x509]]) - - Base class for opening and reading URLs. Unless you need to support opening - objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`, - you probably want to use :class:`FancyURLopener`. - - By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header - of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number. - Applications can define their own :mailheader:`User-Agent` header by subclassing - :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute - :attr:`version` to an appropriate string value in the subclass definition. - - The optional *proxies* parameter should be a dictionary mapping scheme names to - proxy URLs, where an empty dictionary turns proxies off completely. Its default - value is ``None``, in which case environmental proxy settings will be used if - present, as discussed in the definition of :func:`urlopen`, above. - - Additional keyword parameters, collected in *x509*, may be used for - authentication of the client when using the :file:`https:` scheme. The keywords - *key_file* and *cert_file* are supported to provide an SSL key and certificate; - both are needed to support client authentication. - - :class:`URLopener` objects will raise an :exc:`IOError` exception if the server - returns an error code. - - .. method:: open(fullurl[, data]) - - Open *fullurl* using the appropriate protocol. This method sets up cache and - proxy information, then calls the appropriate open method with its input - arguments. If the scheme is not recognized, :meth:`open_unknown` is called. - The *data* argument has the same meaning as the *data* argument of - :func:`urlopen`. - - - .. method:: open_unknown(fullurl[, data]) - - Overridable interface to open unknown URL types. - - - .. method:: retrieve(url[, filename[, reporthook[, data]]]) - - Retrieves the contents of *url* and places it in *filename*. The return value - is a tuple consisting of a local filename and either a - :class:`email.message.Message` object containing the response headers (for remote - URLs) or ``None`` (for local URLs). The caller must then open and read the - contents of *filename*. If *filename* is not given and the URL refers to a - local file, the input filename is returned. If the URL is non-local and - *filename* is not given, the filename is the output of :func:`tempfile.mktemp` - with a suffix that matches the suffix of the last path component of the input - URL. If *reporthook* is given, it must be a function accepting three numeric - parameters. It will be called after each chunk of data is read from the - network. *reporthook* is ignored for local URLs. - - If the *url* uses the :file:`http:` scheme identifier, the optional *data* - argument may be given to specify a ``POST`` request (normally the request type - is ``GET``). The *data* argument must in standard - :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` - function below. - - - .. attribute:: version - - Variable that specifies the user agent of the opener object. To get - :mod:`urllib` to tell servers that it is a particular user agent, set this in a - subclass as a class variable or in the constructor before calling the base - constructor. - - -.. class:: FancyURLopener(...) - - :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling - for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x - response codes listed above, the :mailheader:`Location` header is used to fetch - the actual URL. For 401 response codes (authentication required), basic HTTP - authentication is performed. For the 30x response codes, recursion is bounded - by the value of the *maxtries* attribute, which defaults to 10. - - For all other response codes, the method :meth:`http_error_default` is called - which you can override in subclasses to handle the error appropriately. - - .. note:: - - According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests - must not be automatically redirected without confirmation by the user. In - reality, browsers do allow automatic redirection of these responses, changing - the POST to a GET, and :mod:`urllib` reproduces this behaviour. - - The parameters to the constructor are the same as those for :class:`URLopener`. - - .. note:: - - When performing basic authentication, a :class:`FancyURLopener` instance calls - its :meth:`prompt_user_passwd` method. The default implementation asks the - users for the required information on the controlling terminal. A subclass may - override this method to support more appropriate behavior if needed. - - The :class:`FancyURLopener` class offers one additional method that should be - overloaded to provide the appropriate behavior: - - .. method:: prompt_user_passwd(host, realm) - - Return information needed to authenticate the user at the given host in the - specified security realm. The return value should be a tuple, ``(user, - password)``, which can be used for basic authentication. - - The implementation prompts for this information on the terminal; an application - should override this method to use an appropriate interaction model in the local - environment. - -.. exception:: ContentTooShortError(msg[, content]) - - This exception is raised when the :func:`urlretrieve` function detects that the - amount of the downloaded data is less than the expected amount (given by the - *Content-Length* header). The :attr:`content` attribute stores the downloaded - (and supposedly truncated) data. - - -:mod:`urllib` Restrictions --------------------------- - - .. index:: - pair: HTTP; protocol - pair: FTP; protocol - -* Currently, only the following protocols are supported: HTTP, (versions 0.9 and - 1.0), FTP, and local files. - -* The caching feature of :func:`urlretrieve` has been disabled until I find the - time to hack proper processing of Expiration time headers. - -* There should be a function to query whether a particular URL is in the cache. - -* For backward compatibility, if a URL appears to point to a local file but the - file can't be opened, the URL is re-interpreted using the FTP protocol. This - can sometimes cause confusing error messages. - -* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily - long delays while waiting for a network connection to be set up. This means - that it is difficult to build an interactive Web client using these functions - without using threads. - - .. index:: - single: HTML - pair: HTTP; protocol - -* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data - returned by the server. This may be binary data (such as an image), plain text - or (for example) HTML. The HTTP protocol provides type information in the reply - header, which can be inspected by looking at the :mailheader:`Content-Type` - header. If the returned data is HTML, you can use the module - :mod:`html.parser` to parse it. - - .. index:: single: FTP - -* The code handling the FTP protocol cannot differentiate between a file and a - directory. This can lead to unexpected behavior when attempting to read a URL - that points to a file that is not accessible. If the URL ends in a ``/``, it is - assumed to refer to a directory and will be handled accordingly. But if an - attempt to read a file leads to a 550 error (meaning the URL cannot be found or - is not accessible, often for permission reasons), then the path is treated as a - directory in order to handle the case when a directory is specified by a URL but - the trailing ``/`` has been left off. This can cause misleading results when - you try to fetch a file whose read permissions make it inaccessible; the FTP - code will try to read it, fail with a 550 error, and then perform a directory - listing for the unreadable file. If fine-grained control is needed, consider - using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing - *_urlopener* to meet your needs. - -* This module does not support the use of proxies which require authentication. - This may be implemented in the future. - - .. index:: module: urlparse - -* Although the :mod:`urllib` module contains (undocumented) routines to parse - and unparse URL strings, the recommended interface for URL manipulation is in - module :mod:`urlparse`. - - -.. _urllib-examples: - -Examples --------- - -Here is an example session that uses the ``GET`` method to retrieve a URL -containing parameters:: - - >>> import urllib - >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) - >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) - >>> print(f.read()) - -The following example uses the ``POST`` method instead:: - - >>> import urllib - >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) - >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) - >>> print(f.read()) - -The following example uses an explicitly specified HTTP proxy, overriding -environment settings:: - - >>> import urllib - >>> proxies = {'http': 'http://proxy.example.com:8080/'} - >>> opener = urllib.FancyURLopener(proxies) - >>> f = opener.open("http://www.python.org") - >>> f.read() - -The following example uses no proxies at all, overriding environment settings:: - - >>> import urllib - >>> opener = urllib.FancyURLopener({}) - >>> f = opener.open("http://www.python.org/") - >>> f.read() - diff --git a/Doc/library/urllib2.rst b/Doc/library/urllib2.rst deleted file mode 100644 index 06dbb44..0000000 --- a/Doc/library/urllib2.rst +++ /dev/null @@ -1,934 +0,0 @@ -:mod:`urllib2` --- extensible library for opening URLs -====================================================== - -.. module:: urllib2 - :synopsis: Next generation URL opening library. -.. moduleauthor:: Jeremy Hylton -.. sectionauthor:: Moshe Zadka - - -The :mod:`urllib2` module defines functions and classes which help in opening -URLs (mostly HTTP) in a complex world --- basic and digest authentication, -redirections, cookies and more. - -The :mod:`urllib2` module defines the following functions: - - -.. function:: urlopen(url[, data][, timeout]) - - Open the URL *url*, which can be either a string or a :class:`Request` object. - - *data* may be a string specifying additional data to send to the server, or - ``None`` if no such data is needed. Currently HTTP requests are the only ones - that use *data*; the HTTP request will be a POST instead of a GET when the - *data* parameter is provided. *data* should be a buffer in the standard - :mimetype:`application/x-www-form-urlencoded` format. The - :func:`urllib.urlencode` function takes a mapping or sequence of 2-tuples and - returns a string in this format. - - The optional *timeout* parameter specifies a timeout in seconds for blocking - operations like the connection attempt (if not specified, the global default - timeout setting will be used). This actually only works for HTTP, HTTPS, - FTP and FTPS connections. - - This function returns a file-like object with two additional methods: - - * :meth:`geturl` --- return the URL of the resource retrieved, commonly used to - determine if a redirect was followed - - * :meth:`info` --- return the meta-information of the page, such as headers, in - the form of an ``http.client.HTTPMessage`` instance - (see `Quick Reference to HTTP Headers `_) - - Raises :exc:`URLError` on errors. - - Note that ``None`` may be returned if no handler handles the request (though the - default installed global :class:`OpenerDirector` uses :class:`UnknownHandler` to - ensure this never happens). - - -.. function:: install_opener(opener) - - Install an :class:`OpenerDirector` instance as the default global opener. - Installing an opener is only necessary if you want urlopen to use that opener; - otherwise, simply call :meth:`OpenerDirector.open` instead of :func:`urlopen`. - The code does not check for a real :class:`OpenerDirector`, and any class with - the appropriate interface will work. - - -.. function:: build_opener([handler, ...]) - - Return an :class:`OpenerDirector` instance, which chains the handlers in the - order given. *handler*\s can be either instances of :class:`BaseHandler`, or - subclasses of :class:`BaseHandler` (in which case it must be possible to call - the constructor without any parameters). Instances of the following classes - will be in front of the *handler*\s, unless the *handler*\s contain them, - instances of them or subclasses of them: :class:`ProxyHandler`, - :class:`UnknownHandler`, :class:`HTTPHandler`, :class:`HTTPDefaultErrorHandler`, - :class:`HTTPRedirectHandler`, :class:`FTPHandler`, :class:`FileHandler`, - :class:`HTTPErrorProcessor`. - - If the Python installation has SSL support (i.e., if the :mod:`ssl` module can be imported), - :class:`HTTPSHandler` will also be added. - - A :class:`BaseHandler` subclass may also change its :attr:`handler_order` - member variable to modify its position in the handlers list. - -The following exceptions are raised as appropriate: - - -.. exception:: URLError - - The handlers raise this exception (or derived exceptions) when they run into a - problem. It is a subclass of :exc:`IOError`. - - .. attribute:: reason - - The reason for this error. It can be a message string or another exception - instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local - URLs). - - -.. exception:: HTTPError - - Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError` - can also function as a non-exceptional file-like return value (the same thing - that :func:`urlopen` returns). This is useful when handling exotic HTTP - errors, such as requests for authentication. - - .. attribute:: code - - An HTTP status code as defined in `RFC 2616 `_. - This numeric value corresponds to a value found in the dictionary of - codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`. - - - -The following classes are provided: - - -.. class:: Request(url[, data][, headers][, origin_req_host][, unverifiable]) - - This class is an abstraction of a URL request. - - *url* should be a string containing a valid URL. - - *data* may be a string specifying additional data to send to the server, or - ``None`` if no such data is needed. Currently HTTP requests are the only ones - that use *data*; the HTTP request will be a POST instead of a GET when the - *data* parameter is provided. *data* should be a buffer in the standard - :mimetype:`application/x-www-form-urlencoded` format. The - :func:`urllib.urlencode` function takes a mapping or sequence of 2-tuples and - returns a string in this format. - - *headers* should be a dictionary, and will be treated as if :meth:`add_header` - was called with each key and value as arguments. This is often used to "spoof" - the ``User-Agent`` header, which is used by a browser to identify itself -- - some HTTP servers only allow requests coming from common browsers as opposed - to scripts. For example, Mozilla Firefox may identify itself as ``"Mozilla/5.0 - (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"``, while :mod:`urllib2`'s - default user agent string is ``"Python-urllib/2.6"`` (on Python 2.6). - - The final two arguments are only of interest for correct handling of third-party - HTTP cookies: - - *origin_req_host* should be the request-host of the origin transaction, as - defined by :rfc:`2965`. It defaults to ``http.cookiejar.request_host(self)``. - This is the host name or IP address of the original request that was - initiated by the user. For example, if the request is for an image in an - HTML document, this should be the request-host of the request for the page - containing the image. - - *unverifiable* should indicate whether the request is unverifiable, as defined - by RFC 2965. It defaults to False. An unverifiable request is one whose URL - the user did not have the option to approve. For example, if the request is for - an image in an HTML document, and the user had no option to approve the - automatic fetching of the image, this should be true. - - -.. class:: OpenerDirector() - - The :class:`OpenerDirector` class opens URLs via :class:`BaseHandler`\ s chained - together. It manages the chaining of handlers, and recovery from errors. - - -.. class:: BaseHandler() - - This is the base class for all registered handlers --- and handles only the - simple mechanics of registration. - - -.. class:: HTTPDefaultErrorHandler() - - A class which defines a default handler for HTTP error responses; all responses - are turned into :exc:`HTTPError` exceptions. - - -.. class:: HTTPRedirectHandler() - - A class to handle redirections. - - -.. class:: HTTPCookieProcessor([cookiejar]) - - A class to handle HTTP Cookies. - - -.. class:: ProxyHandler([proxies]) - - Cause requests to go through a proxy. If *proxies* is given, it must be a - dictionary mapping protocol names to URLs of proxies. The default is to read the - list of proxies from the environment variables :envvar:`_proxy`. - To disable autodetected proxy pass an empty dictionary. - - -.. class:: HTTPPasswordMgr() - - Keep a database of ``(realm, uri) -> (user, password)`` mappings. - - -.. class:: HTTPPasswordMgrWithDefaultRealm() - - Keep a database of ``(realm, uri) -> (user, password)`` mappings. A realm of - ``None`` is considered a catch-all realm, which is searched if no other realm - fits. - - -.. class:: AbstractBasicAuthHandler([password_mgr]) - - This is a mixin class that helps with HTTP authentication, both to the remote - host and to a proxy. *password_mgr*, if given, should be something that is - compatible with :class:`HTTPPasswordMgr`; refer to section - :ref:`http-password-mgr` for information on the interface that must be - supported. - - -.. class:: HTTPBasicAuthHandler([password_mgr]) - - Handle authentication with the remote host. *password_mgr*, if given, should be - something that is compatible with :class:`HTTPPasswordMgr`; refer to section - :ref:`http-password-mgr` for information on the interface that must be - supported. - - -.. class:: ProxyBasicAuthHandler([password_mgr]) - - Handle authentication with the proxy. *password_mgr*, if given, should be - something that is compatible with :class:`HTTPPasswordMgr`; refer to section - :ref:`http-password-mgr` for information on the interface that must be - supported. - - -.. class:: AbstractDigestAuthHandler([password_mgr]) - - This is a mixin class that helps with HTTP authentication, both to the remote - host and to a proxy. *password_mgr*, if given, should be something that is - compatible with :class:`HTTPPasswordMgr`; refer to section - :ref:`http-password-mgr` for information on the interface that must be - supported. - - -.. class:: HTTPDigestAuthHandler([password_mgr]) - - Handle authentication with the remote host. *password_mgr*, if given, should be - something that is compatible with :class:`HTTPPasswordMgr`; refer to section - :ref:`http-password-mgr` for information on the interface that must be - supported. - - -.. class:: ProxyDigestAuthHandler([password_mgr]) - - Handle authentication with the proxy. *password_mgr*, if given, should be - something that is compatible with :class:`HTTPPasswordMgr`; refer to section - :ref:`http-password-mgr` for information on the interface that must be - supported. - - -.. class:: HTTPHandler() - - A class to handle opening of HTTP URLs. - - -.. class:: HTTPSHandler() - - A class to handle opening of HTTPS URLs. - - -.. class:: FileHandler() - - Open local files. - - -.. class:: FTPHandler() - - Open FTP URLs. - - -.. class:: CacheFTPHandler() - - Open FTP URLs, keeping a cache of open FTP connections to minimize delays. - - -.. class:: UnknownHandler() - - A catch-all class to handle unknown URLs. - - -.. _request-objects: - -Request Objects ---------------- - -The following methods describe all of :class:`Request`'s public interface, and -so all must be overridden in subclasses. - - -.. method:: Request.add_data(data) - - Set the :class:`Request` data to *data*. This is ignored by all handlers except - HTTP handlers --- and there it should be a byte string, and will change the - request to be ``POST`` rather than ``GET``. - - -.. method:: Request.get_method() - - Return a string indicating the HTTP request method. This is only meaningful for - HTTP requests, and currently always returns ``'GET'`` or ``'POST'``. - - -.. method:: Request.has_data() - - Return whether the instance has a non-\ ``None`` data. - - -.. method:: Request.get_data() - - Return the instance's data. - - -.. method:: Request.add_header(key, val) - - Add another header to the request. Headers are currently ignored by all - handlers except HTTP handlers, where they are added to the list of headers sent - to the server. Note that there cannot be more than one header with the same - name, and later calls will overwrite previous calls in case the *key* collides. - Currently, this is no loss of HTTP functionality, since all headers which have - meaning when used more than once have a (header-specific) way of gaining the - same functionality using only one header. - - -.. method:: Request.add_unredirected_header(key, header) - - Add a header that will not be added to a redirected request. - - -.. method:: Request.has_header(header) - - Return whether the instance has the named header (checks both regular and - unredirected). - - -.. method:: Request.get_full_url() - - Return the URL given in the constructor. - - -.. method:: Request.get_type() - - Return the type of the URL --- also known as the scheme. - - -.. method:: Request.get_host() - - Return the host to which a connection will be made. - - -.. method:: Request.get_selector() - - Return the selector --- the part of the URL that is sent to the server. - - -.. method:: Request.set_proxy(host, type) - - Prepare the request by connecting to a proxy server. The *host* and *type* will - replace those of the instance, and the instance's selector will be the original - URL given in the constructor. - - -.. method:: Request.get_origin_req_host() - - Return the request-host of the origin transaction, as defined by :rfc:`2965`. - See the documentation for the :class:`Request` constructor. - - -.. method:: Request.is_unverifiable() - - Return whether the request is unverifiable, as defined by RFC 2965. See the - documentation for the :class:`Request` constructor. - - -.. _opener-director-objects: - -OpenerDirector Objects ----------------------- - -:class:`OpenerDirector` instances have the following methods: - - -.. method:: OpenerDirector.add_handler(handler) - - *handler* should be an instance of :class:`BaseHandler`. The following methods - are searched, and added to the possible chains (note that HTTP errors are a - special case). - - * :meth:`protocol_open` --- signal that the handler knows how to open *protocol* - URLs. - - * :meth:`http_error_type` --- signal that the handler knows how to handle HTTP - errors with HTTP error code *type*. - - * :meth:`protocol_error` --- signal that the handler knows how to handle errors - from (non-\ ``http``) *protocol*. - - * :meth:`protocol_request` --- signal that the handler knows how to pre-process - *protocol* requests. - - * :meth:`protocol_response` --- signal that the handler knows how to - post-process *protocol* responses. - - -.. method:: OpenerDirector.open(url[, data][, timeout]) - - Open the given *url* (which can be a request object or a string), optionally - passing the given *data*. Arguments, return values and exceptions raised are - the same as those of :func:`urlopen` (which simply calls the :meth:`open` - method on the currently installed global :class:`OpenerDirector`). The - optional *timeout* parameter specifies a timeout in seconds for blocking - operations like the connection attempt (if not specified, the global default - timeout setting will be usedi). The timeout feature actually works only for - HTTP, HTTPS, FTP and FTPS connections). - - -.. method:: OpenerDirector.error(proto[, arg[, ...]]) - - Handle an error of the given protocol. This will call the registered error - handlers for the given protocol with the given arguments (which are protocol - specific). The HTTP protocol is a special case which uses the HTTP response - code to determine the specific error handler; refer to the :meth:`http_error_\*` - methods of the handler classes. - - Return values and exceptions raised are the same as those of :func:`urlopen`. - -OpenerDirector objects open URLs in three stages: - -The order in which these methods are called within each stage is determined by -sorting the handler instances. - -#. Every handler with a method named like :meth:`protocol_request` has that - method called to pre-process the request. - -#. Handlers with a method named like :meth:`protocol_open` are called to handle - the request. This stage ends when a handler either returns a non-\ :const:`None` - value (ie. a response), or raises an exception (usually :exc:`URLError`). - Exceptions are allowed to propagate. - - In fact, the above algorithm is first tried for methods named - :meth:`default_open`. If all such methods return :const:`None`, the algorithm - is repeated for methods named like :meth:`protocol_open`. If all such methods - return :const:`None`, the algorithm is repeated for methods named - :meth:`unknown_open`. - - Note that the implementation of these methods may involve calls of the parent - :class:`OpenerDirector` instance's :meth:`.open` and :meth:`.error` methods. - -#. Every handler with a method named like :meth:`protocol_response` has that - method called to post-process the response. - - -.. _base-handler-objects: - -BaseHandler Objects -------------------- - -:class:`BaseHandler` objects provide a couple of methods that are directly -useful, and others that are meant to be used by derived classes. These are -intended for direct use: - - -.. method:: BaseHandler.add_parent(director) - - Add a director as parent. - - -.. method:: BaseHandler.close() - - Remove any parents. - -The following members and methods should only be used by classes derived from -:class:`BaseHandler`. - -.. note:: - - The convention has been adopted that subclasses defining - :meth:`protocol_request` or :meth:`protocol_response` methods are named - :class:`\*Processor`; all others are named :class:`\*Handler`. - - -.. attribute:: BaseHandler.parent - - A valid :class:`OpenerDirector`, which can be used to open using a different - protocol, or handle errors. - - -.. method:: BaseHandler.default_open(req) - - This method is *not* defined in :class:`BaseHandler`, but subclasses should - define it if they want to catch all URLs. - - This method, if implemented, will be called by the parent - :class:`OpenerDirector`. It should return a file-like object as described in - the return value of the :meth:`open` of :class:`OpenerDirector`, or ``None``. - It should raise :exc:`URLError`, unless a truly exceptional thing happens (for - example, :exc:`MemoryError` should not be mapped to :exc:`URLError`). - - This method will be called before any protocol-specific open method. - - -.. method:: BaseHandler.protocol_open(req) - :noindex: - - This method is *not* defined in :class:`BaseHandler`, but subclasses should - define it if they want to handle URLs with the given protocol. - - This method, if defined, will be called by the parent :class:`OpenerDirector`. - Return values should be the same as for :meth:`default_open`. - - -.. method:: BaseHandler.unknown_open(req) - - This method is *not* defined in :class:`BaseHandler`, but subclasses should - define it if they want to catch all URLs with no specific registered handler to - open it. - - This method, if implemented, will be called by the :attr:`parent` - :class:`OpenerDirector`. Return values should be the same as for - :meth:`default_open`. - - -.. method:: BaseHandler.http_error_default(req, fp, code, msg, hdrs) - - This method is *not* defined in :class:`BaseHandler`, but subclasses should - override it if they intend to provide a catch-all for otherwise unhandled HTTP - errors. It will be called automatically by the :class:`OpenerDirector` getting - the error, and should not normally be called in other circumstances. - - *req* will be a :class:`Request` object, *fp* will be a file-like object with - the HTTP error body, *code* will be the three-digit code of the error, *msg* - will be the user-visible explanation of the code and *hdrs* will be a mapping - object with the headers of the error. - - Return values and exceptions raised should be the same as those of - :func:`urlopen`. - - -.. method:: BaseHandler.http_error_nnn(req, fp, code, msg, hdrs) - - *nnn* should be a three-digit HTTP error code. This method is also not defined - in :class:`BaseHandler`, but will be called, if it exists, on an instance of a - subclass, when an HTTP error with code *nnn* occurs. - - Subclasses should override this method to handle specific HTTP errors. - - Arguments, return values and exceptions raised should be the same as for - :meth:`http_error_default`. - - -.. method:: BaseHandler.protocol_request(req) - :noindex: - - This method is *not* defined in :class:`BaseHandler`, but subclasses should - define it if they want to pre-process requests of the given protocol. - - This method, if defined, will be called by the parent :class:`OpenerDirector`. - *req* will be a :class:`Request` object. The return value should be a - :class:`Request` object. - - -.. method:: BaseHandler.protocol_response(req, response) - :noindex: - - This method is *not* defined in :class:`BaseHandler`, but subclasses should - define it if they want to post-process responses of the given protocol. - - This method, if defined, will be called by the parent :class:`OpenerDirector`. - *req* will be a :class:`Request` object. *response* will be an object - implementing the same interface as the return value of :func:`urlopen`. The - return value should implement the same interface as the return value of - :func:`urlopen`. - - -.. _http-redirect-handler: - -HTTPRedirectHandler Objects ---------------------------- - -.. note:: - - Some HTTP redirections require action from this module's client code. If this - is the case, :exc:`HTTPError` is raised. See :rfc:`2616` for details of the - precise meanings of the various redirection codes. - - -.. method:: HTTPRedirectHandler.redirect_request(req, fp, code, msg, hdrs) - - Return a :class:`Request` or ``None`` in response to a redirect. This is called - by the default implementations of the :meth:`http_error_30\*` methods when a - redirection is received from the server. If a redirection should take place, - return a new :class:`Request` to allow :meth:`http_error_30\*` to perform the - redirect. Otherwise, raise :exc:`HTTPError` if no other handler should try to - handle this URL, or return ``None`` if you can't but another handler might. - - .. note:: - - The default implementation of this method does not strictly follow :rfc:`2616`, - which says that 301 and 302 responses to ``POST`` requests must not be - automatically redirected without confirmation by the user. In reality, browsers - do allow automatic redirection of these responses, changing the POST to a - ``GET``, and the default implementation reproduces this behavior. - - -.. method:: HTTPRedirectHandler.http_error_301(req, fp, code, msg, hdrs) - - Redirect to the ``Location:`` URL. This method is called by the parent - :class:`OpenerDirector` when getting an HTTP 'moved permanently' response. - - -.. method:: HTTPRedirectHandler.http_error_302(req, fp, code, msg, hdrs) - - The same as :meth:`http_error_301`, but called for the 'found' response. - - -.. method:: HTTPRedirectHandler.http_error_303(req, fp, code, msg, hdrs) - - The same as :meth:`http_error_301`, but called for the 'see other' response. - - -.. method:: HTTPRedirectHandler.http_error_307(req, fp, code, msg, hdrs) - - The same as :meth:`http_error_301`, but called for the 'temporary redirect' - response. - - -.. _http-cookie-processor: - -HTTPCookieProcessor Objects ---------------------------- - -:class:`HTTPCookieProcessor` instances have one attribute: - -.. attribute:: HTTPCookieProcessor.cookiejar - - The :class:`http.cookiejar.CookieJar` in which cookies are stored. - - -.. _proxy-handler: - -ProxyHandler Objects --------------------- - - -.. method:: ProxyHandler.protocol_open(request) - :noindex: - - The :class:`ProxyHandler` will have a method :meth:`protocol_open` for every - *protocol* which has a proxy in the *proxies* dictionary given in the - constructor. The method will modify requests to go through the proxy, by - calling ``request.set_proxy()``, and call the next handler in the chain to - actually execute the protocol. - - -.. _http-password-mgr: - -HTTPPasswordMgr Objects ------------------------ - -These methods are available on :class:`HTTPPasswordMgr` and -:class:`HTTPPasswordMgrWithDefaultRealm` objects. - - -.. method:: HTTPPasswordMgr.add_password(realm, uri, user, passwd) - - *uri* can be either a single URI, or a sequence of URIs. *realm*, *user* and - *passwd* must be strings. This causes ``(user, passwd)`` to be used as - authentication tokens when authentication for *realm* and a super-URI of any of - the given URIs is given. - - -.. method:: HTTPPasswordMgr.find_user_password(realm, authuri) - - Get user/password for given realm and URI, if any. This method will return - ``(None, None)`` if there is no matching user/password. - - For :class:`HTTPPasswordMgrWithDefaultRealm` objects, the realm ``None`` will be - searched if the given *realm* has no matching user/password. - - -.. _abstract-basic-auth-handler: - -AbstractBasicAuthHandler Objects --------------------------------- - - -.. method:: AbstractBasicAuthHandler.http_error_auth_reqed(authreq, host, req, headers) - - Handle an authentication request by getting a user/password pair, and re-trying - the request. *authreq* should be the name of the header where the information - about the realm is included in the request, *host* specifies the URL and path to - authenticate for, *req* should be the (failed) :class:`Request` object, and - *headers* should be the error headers. - - *host* is either an authority (e.g. ``"python.org"``) or a URL containing an - authority component (e.g. ``"http://python.org/"``). In either case, the - authority must not contain a userinfo component (so, ``"python.org"`` and - ``"python.org:80"`` are fine, ``"joe:password@python.org"`` is not). - - -.. _http-basic-auth-handler: - -HTTPBasicAuthHandler Objects ----------------------------- - - -.. method:: HTTPBasicAuthHandler.http_error_401(req, fp, code, msg, hdrs) - - Retry the request with authentication information, if available. - - -.. _proxy-basic-auth-handler: - -ProxyBasicAuthHandler Objects ------------------------------ - - -.. method:: ProxyBasicAuthHandler.http_error_407(req, fp, code, msg, hdrs) - - Retry the request with authentication information, if available. - - -.. _abstract-digest-auth-handler: - -AbstractDigestAuthHandler Objects ---------------------------------- - - -.. method:: AbstractDigestAuthHandler.http_error_auth_reqed(authreq, host, req, headers) - - *authreq* should be the name of the header where the information about the realm - is included in the request, *host* should be the host to authenticate to, *req* - should be the (failed) :class:`Request` object, and *headers* should be the - error headers. - - -.. _http-digest-auth-handler: - -HTTPDigestAuthHandler Objects ------------------------------ - - -.. method:: HTTPDigestAuthHandler.http_error_401(req, fp, code, msg, hdrs) - - Retry the request with authentication information, if available. - - -.. _proxy-digest-auth-handler: - -ProxyDigestAuthHandler Objects ------------------------------- - - -.. method:: ProxyDigestAuthHandler.http_error_407(req, fp, code, msg, hdrs) - - Retry the request with authentication information, if available. - - -.. _http-handler-objects: - -HTTPHandler Objects -------------------- - - -.. method:: HTTPHandler.http_open(req) - - Send an HTTP request, which can be either GET or POST, depending on - ``req.has_data()``. - - -.. _https-handler-objects: - -HTTPSHandler Objects --------------------- - - -.. method:: HTTPSHandler.https_open(req) - - Send an HTTPS request, which can be either GET or POST, depending on - ``req.has_data()``. - - -.. _file-handler-objects: - -FileHandler Objects -------------------- - - -.. method:: FileHandler.file_open(req) - - Open the file locally, if there is no host name, or the host name is - ``'localhost'``. Change the protocol to ``ftp`` otherwise, and retry opening it - using :attr:`parent`. - - -.. _ftp-handler-objects: - -FTPHandler Objects ------------------- - - -.. method:: FTPHandler.ftp_open(req) - - Open the FTP file indicated by *req*. The login is always done with empty - username and password. - - -.. _cacheftp-handler-objects: - -CacheFTPHandler Objects ------------------------ - -:class:`CacheFTPHandler` objects are :class:`FTPHandler` objects with the -following additional methods: - - -.. method:: CacheFTPHandler.setTimeout(t) - - Set timeout of connections to *t* seconds. - - -.. method:: CacheFTPHandler.setMaxConns(m) - - Set maximum number of cached connections to *m*. - - -.. _unknown-handler-objects: - -UnknownHandler Objects ----------------------- - - -.. method:: UnknownHandler.unknown_open() - - Raise a :exc:`URLError` exception. - - -.. _http-error-processor-objects: - -HTTPErrorProcessor Objects --------------------------- - -.. method:: HTTPErrorProcessor.unknown_open() - - Process HTTP error responses. - - For 200 error codes, the response object is returned immediately. - - For non-200 error codes, this simply passes the job on to the - :meth:`protocol_error_code` handler methods, via :meth:`OpenerDirector.error`. - Eventually, :class:`urllib2.HTTPDefaultErrorHandler` will raise an - :exc:`HTTPError` if no other handler handles the error. - - -.. _urllib2-examples: - -Examples --------- - -This example gets the python.org main page and displays the first 100 bytes of -it:: - - >>> import urllib2 - >>> f = urllib2.urlopen('http://www.python.org/') - >>> print(f.read(100)) - - >> import urllib2 - >>> req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi', - ... data='This data is passed to stdin of the CGI') - >>> f = urllib2.urlopen(req) - >>> print(f.read()) - Got Data: "This data is passed to stdin of the CGI" - -The code for the sample CGI used in the above example is:: - - #!/usr/bin/env python - import sys - data = sys.stdin.read() - print('Content-type: text-plain\n\nGot Data: "%s"' % data) - -Use of Basic HTTP Authentication:: - - import urllib2 - # Create an OpenerDirector with support for Basic HTTP Authentication... - auth_handler = urllib2.HTTPBasicAuthHandler() - auth_handler.add_password(realm='PDQ Application', - uri='https://mahler:8092/site-updates.py', - user='klem', - passwd='kadidd!ehopper') - opener = urllib2.build_opener(auth_handler) - # ...and install it globally so it can be used with urlopen. - urllib2.install_opener(opener) - urllib2.urlopen('http://www.example.com/login.html') - -:func:`build_opener` provides many handlers by default, including a -:class:`ProxyHandler`. By default, :class:`ProxyHandler` uses the environment -variables named ``_proxy``, where ```` is the URL scheme -involved. For example, the :envvar:`http_proxy` environment variable is read to -obtain the HTTP proxy's URL. - -This example replaces the default :class:`ProxyHandler` with one that uses -programatically-supplied proxy URLs, and adds proxy authorization support with -:class:`ProxyBasicAuthHandler`. :: - - proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'}) - proxy_auth_handler = urllib2.HTTPBasicAuthHandler() - proxy_auth_handler.add_password('realm', 'host', 'username', 'password') - - opener = build_opener(proxy_handler, proxy_auth_handler) - # This time, rather than install the OpenerDirector, we use it directly: - opener.open('http://www.example.com/login.html') - -Adding HTTP headers: - -Use the *headers* argument to the :class:`Request` constructor, or:: - - import urllib2 - req = urllib2.Request('http://www.example.com/') - req.add_header('Referer', 'http://www.python.org/') - r = urllib2.urlopen(req) - -:class:`OpenerDirector` automatically adds a :mailheader:`User-Agent` header to -every :class:`Request`. To change this:: - - import urllib2 - opener = urllib2.build_opener() - opener.addheaders = [('User-agent', 'Mozilla/5.0')] - opener.open('http://www.example.com/') - -Also, remember that a few standard headers (:mailheader:`Content-Length`, -:mailheader:`Content-Type` and :mailheader:`Host`) are added when the -:class:`Request` is passed to :func:`urlopen` (or :meth:`OpenerDirector.open`). - diff --git a/Doc/library/urlparse.rst b/Doc/library/urlparse.rst deleted file mode 100644 index e305e0b..0000000 --- a/Doc/library/urlparse.rst +++ /dev/null @@ -1,255 +0,0 @@ -:mod:`urlparse` --- Parse URLs into components -============================================== - -.. module:: urlparse - :synopsis: Parse URLs into or assemble them from components. - - -.. index:: - single: WWW - single: World Wide Web - single: URL - pair: URL; parsing - pair: relative; URL - -This module defines a standard interface to break Uniform Resource Locator (URL) -strings up in components (addressing scheme, network location, path etc.), to -combine the components back into a URL string, and to convert a "relative URL" -to an absolute URL given a "base URL." - -The module has been designed to match the Internet RFC on Relative Uniform -Resource Locators (and discovered a bug in an earlier draft!). It supports the -following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``, -``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``, -``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``, -``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``. - -The :mod:`urlparse` module defines the following functions: - - -.. function:: urlparse(urlstring[, default_scheme[, allow_fragments]]) - - Parse a URL into six components, returning a 6-tuple. This corresponds to the - general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``. - Each tuple item is a string, possibly empty. The components are not broken up in - smaller parts (for example, the network location is a single string), and % - escapes are not expanded. The delimiters as shown above are not part of the - result, except for a leading slash in the *path* component, which is retained if - present. For example: - - >>> from urlparse import urlparse - >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') - >>> o # doctest: +NORMALIZE_WHITESPACE - ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', - params='', query='', fragment='') - >>> o.scheme - 'http' - >>> o.port - 80 - >>> o.geturl() - 'http://www.cwi.nl:80/%7Eguido/Python.html' - - If the *default_scheme* argument is specified, it gives the default addressing - scheme, to be used only if the URL does not specify one. The default value for - this argument is the empty string. - - If the *allow_fragments* argument is false, fragment identifiers are not - allowed, even if the URL's addressing scheme normally does support them. The - default value for this argument is :const:`True`. - - The return value is actually an instance of a subclass of :class:`tuple`. This - class has the following additional read-only convenience attributes: - - +------------------+-------+--------------------------+----------------------+ - | Attribute | Index | Value | Value if not present | - +==================+=======+==========================+======================+ - | :attr:`scheme` | 0 | URL scheme specifier | empty string | - +------------------+-------+--------------------------+----------------------+ - | :attr:`netloc` | 1 | Network location part | empty string | - +------------------+-------+--------------------------+----------------------+ - | :attr:`path` | 2 | Hierarchical path | empty string | - +------------------+-------+--------------------------+----------------------+ - | :attr:`params` | 3 | Parameters for last path | empty string | - | | | element | | - +------------------+-------+--------------------------+----------------------+ - | :attr:`query` | 4 | Query component | empty string | - +------------------+-------+--------------------------+----------------------+ - | :attr:`fragment` | 5 | Fragment identifier | empty string | - +------------------+-------+--------------------------+----------------------+ - | :attr:`username` | | User name | :const:`None` | - +------------------+-------+--------------------------+----------------------+ - | :attr:`password` | | Password | :const:`None` | - +------------------+-------+--------------------------+----------------------+ - | :attr:`hostname` | | Host name (lower case) | :const:`None` | - +------------------+-------+--------------------------+----------------------+ - | :attr:`port` | | Port number as integer, | :const:`None` | - | | | if present | | - +------------------+-------+--------------------------+----------------------+ - - See section :ref:`urlparse-result-object` for more information on the result - object. - - -.. function:: urlunparse(parts) - - Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument - can be any six-item iterable. This may result in a slightly different, but - equivalent URL, if the URL that was parsed originally had unnecessary delimiters - (for example, a ? with an empty query; the RFC states that these are - equivalent). - - -.. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]]) - - This is similar to :func:`urlparse`, but does not split the params from the URL. - This should generally be used instead of :func:`urlparse` if the more recent URL - syntax allowing parameters to be applied to each segment of the *path* portion - of the URL (see :rfc:`2396`) is wanted. A separate function is needed to - separate the path segments and parameters. This function returns a 5-tuple: - (addressing scheme, network location, path, query, fragment identifier). - - The return value is actually an instance of a subclass of :class:`tuple`. This - class has the following additional read-only convenience attributes: - - +------------------+-------+-------------------------+----------------------+ - | Attribute | Index | Value | Value if not present | - +==================+=======+=========================+======================+ - | :attr:`scheme` | 0 | URL scheme specifier | empty string | - +------------------+-------+-------------------------+----------------------+ - | :attr:`netloc` | 1 | Network location part | empty string | - +------------------+-------+-------------------------+----------------------+ - | :attr:`path` | 2 | Hierarchical path | empty string | - +------------------+-------+-------------------------+----------------------+ - | :attr:`query` | 3 | Query component | empty string | - +------------------+-------+-------------------------+----------------------+ - | :attr:`fragment` | 4 | Fragment identifier | empty string | - +------------------+-------+-------------------------+----------------------+ - | :attr:`username` | | User name | :const:`None` | - +------------------+-------+-------------------------+----------------------+ - | :attr:`password` | | Password | :const:`None` | - +------------------+-------+-------------------------+----------------------+ - | :attr:`hostname` | | Host name (lower case) | :const:`None` | - +------------------+-------+-------------------------+----------------------+ - | :attr:`port` | | Port number as integer, | :const:`None` | - | | | if present | | - +------------------+-------+-------------------------+----------------------+ - - See section :ref:`urlparse-result-object` for more information on the result - object. - - -.. function:: urlunsplit(parts) - - Combine the elements of a tuple as returned by :func:`urlsplit` into a complete - URL as a string. The *parts* argument can be any five-item iterable. This may - result in a slightly different, but equivalent URL, if the URL that was parsed - originally had unnecessary delimiters (for example, a ? with an empty query; the - RFC states that these are equivalent). - - -.. function:: urljoin(base, url[, allow_fragments]) - - Construct a full ("absolute") URL by combining a "base URL" (*base*) with - another URL (*url*). Informally, this uses components of the base URL, in - particular the addressing scheme, the network location and (part of) the path, - to provide missing components in the relative URL. For example: - - >>> from urlparse import urljoin - >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') - 'http://www.cwi.nl/%7Eguido/FAQ.html' - - The *allow_fragments* argument has the same meaning and default as for - :func:`urlparse`. - - .. note:: - - If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``), - the *url*'s host name and/or scheme will be present in the result. For example: - - .. doctest:: - - >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', - ... '//www.python.org/%7Eguido') - 'http://www.python.org/%7Eguido' - - If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and - :func:`urlunsplit`, removing possible *scheme* and *netloc* parts. - - -.. function:: urldefrag(url) - - If *url* contains a fragment identifier, returns a modified version of *url* - with no fragment identifier, and the fragment identifier as a separate string. - If there is no fragment identifier in *url*, returns *url* unmodified and an - empty string. - - -.. seealso:: - - :rfc:`1738` - Uniform Resource Locators (URL) - This specifies the formal syntax and semantics of absolute URLs. - - :rfc:`1808` - Relative Uniform Resource Locators - This Request For Comments includes the rules for joining an absolute and a - relative URL, including a fair number of "Abnormal Examples" which govern the - treatment of border cases. - - :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax - Document describing the generic syntactic requirements for both Uniform Resource - Names (URNs) and Uniform Resource Locators (URLs). - - -.. _urlparse-result-object: - -Results of :func:`urlparse` and :func:`urlsplit` ------------------------------------------------- - -The result objects from the :func:`urlparse` and :func:`urlsplit` functions are -subclasses of the :class:`tuple` type. These subclasses add the attributes -described in those functions, as well as provide an additional method: - - -.. method:: ParseResult.geturl() - - Return the re-combined version of the original URL as a string. This may differ - from the original URL in that the scheme will always be normalized to lower case - and empty components may be dropped. Specifically, empty parameters, queries, - and fragment identifiers will be removed. - - The result of this method is a fixpoint if passed back through the original - parsing function: - - >>> import urlparse - >>> url = 'HTTP://www.Python.org/doc/#' - - >>> r1 = urlparse.urlsplit(url) - >>> r1.geturl() - 'http://www.Python.org/doc/' - - >>> r2 = urlparse.urlsplit(r1.geturl()) - >>> r2.geturl() - 'http://www.Python.org/doc/' - - -The following classes provide the implementations of the parse results:: - - -.. class:: BaseResult - - Base class for the concrete result classes. This provides most of the attribute - definitions. It does not provide a :meth:`geturl` method. It is derived from - :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__` - methods. - - -.. class:: ParseResult(scheme, netloc, path, params, query, fragment) - - Concrete class for :func:`urlparse` results. The :meth:`__new__` method is - overridden to support checking that the right number of arguments are passed. - - -.. class:: SplitResult(scheme, netloc, path, query, fragment) - - Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is - overridden to support checking that the right number of arguments are passed. - diff --git a/Doc/tutorial/stdlib.rst b/Doc/tutorial/stdlib.rst index 66e73a9..b0c6e8e 100644 --- a/Doc/tutorial/stdlib.rst +++ b/Doc/tutorial/stdlib.rst @@ -147,11 +147,11 @@ Internet Access =============== There are a number of modules for accessing the internet and processing internet -protocols. Two of the simplest are :mod:`urllib2` for retrieving data from urls -and :mod:`smtplib` for sending mail:: +protocols. Two of the simplest are :mod:`urllib.request` for retrieving data +from urls and :mod:`smtplib` for sending mail:: - >>> import urllib2 - >>> for line in urllib2.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'): + >>> import urllib.request + >>> for line in urllib.request.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'): ... if 'EST' in line or 'EDT' in line: # look for Eastern Time ... print(line) -- cgit v0.12