urllib2
— 用于打开 URL 的可扩展库
¶
注意
The
urllib2
module has been split across several modules in Python 3 named
urllib.request
and
urllib.error
。
2to3
tool will automatically adapt imports when converting your sources to Python 3.
The
urllib2
模块定义有助于在复杂环境打开 URL (主要是 HTTP) 的函数和类 — 基本和摘要身份验证、重定向、Cookie 等。
另请参阅
The Requests 包 推荐为更高级别的 HTTP 客户端接口。
The
urllib2
模块定义了下列函数:
urllib2.
urlopen
(
url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]]
)
¶
打开 URL
url
,其可以是字符串或
Request
对象。
data
may be a string specifying additional data to send to the server, or
None
若不需要这样的数据。目前,仅 HTTP 请求使用
data
; the HTTP request will be a POST instead of a GET when the
data
parameter is provided.
data
应该是缓冲,采用标准
application/x-www-form-urlencoded
格式。
urllib.urlencode()
function takes a mapping or sequence of 2-tuples and returns a string in this format. urllib2 module sends HTTP/1.1 requests with
Connection:close
header included.
可选 timeout 参数指定超时 (以秒为单位) 为阻塞像连接尝试操作 (若未指定,将使用全局默认超时设置)。这实际仅工作于 HTTP HTTPS 及 FTP 连接。
若
context
被指定,它必须是
ssl.SSLContext
实例 (描述各种 SSL 选项)。见
HTTPSConnection
了解更多细节。
可选
cafile
and
capath
参数为 HTTPS 请求指定一组受信任的 CA 证书。
cafile
应该指向包含一捆 CA 证书的单个文件,而
capath
应该指向哈希证书文件目录。可以找到更多信息在
ssl.SSLContext.load_verify_locations()
.
The cadefault 参数被忽略。
This function returns a file-like object with three additional methods:
geturl()
— return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info()
— return the meta-information of the page, such as headers, in the form of an
mimetools.Message
实例 (见
快速参考 HTTP 头
)
getcode()
— return the HTTP status code of the response.
引发
URLError
on errors.
注意,
None
可能被返回,若没有处理程序处理请求 (虽然默认安装了全局
OpenerDirector
使用
UnknownHandler
以确保这从不发生)。
此外,若检测到代理设置 (例如,当
*_proxy
环境变量像
http_proxy
有设置),
ProxyHandler
ProxyHandler 是默认安装的,并确保透过代理处理请求。
2.6 版改变: timeout 被添加。
Changed in version 2.7.9: cafile , capath , cadefault ,和 context 被添加。
urllib2.
install_opener
(
opener
)
¶
安装
OpenerDirector
instance as the default global opener. Installing an opener is only necessary if you want urlopen to use that opener; otherwise, simply call
OpenerDirector.open()
而不是
urlopen()
. The code does not check for a real
OpenerDirector
, and any class with the appropriate interface will work.
urllib2.
build_opener
(
[
handler
,
...
]
)
¶
返回
OpenerDirector
instance, which chains the handlers in the order given.
handler
s can be either instances of
BaseHandler
, or subclasses of
BaseHandler
(in which case it must be possible to call the constructor without any parameters). Instances of the following classes will be in front of the
handler
s, unless the
handler
s contain them, instances of them or subclasses of them:
ProxyHandler
(if proxy settings are detected),
UnknownHandler
,
HTTPHandler
,
HTTPDefaultErrorHandler
,
HTTPRedirectHandler
,
FTPHandler
,
FileHandler
,
HTTPErrorProcessor
.
若 Python 安装有 SSL 支持 (即:若
ssl
模块可以被导入),
HTTPSHandler
will also be added.
Beginning in Python 2.3, a
BaseHandler
子类还可以改变其
handler_order
attribute to modify its position in the handlers list.
The following exceptions are raised as appropriate:
urllib2.
URLError
¶
The handlers raise this exception (or derived exceptions) when they run into a problem. It is a subclass of
IOError
.
reason
¶
The reason for this error. It can be a message string or another exception instance (
socket.error
for remote URLs,
OSError
for local URLs).
urllib2.
HTTPError
¶
Though being an exception (a subclass of
URLError
), an
HTTPError
can also function as a non-exceptional file-like return value (the same thing that
urlopen()
returns). This is useful when handling exotic HTTP errors, such as requests for authentication.
code
¶
An HTTP status code as defined in
RFC 2616
. This numeric value corresponds to a value found in the dictionary of codes as found in
BaseHTTPServer.BaseHTTPRequestHandler.responses
.
reason
¶
The reason for this error. It can be a message string or another exception instance.
提供了下列类:
urllib2.
Request
(
url[, data][, headers][, origin_req_host][, unverifiable]
)
¶
此类是 URL 请求的抽象。
url 应是包含有效 URL 的字符串。
data
may be a string specifying additional data to send to the server, or
None
若不需要这样的数据。目前,仅 HTTP 请求使用
data
; the HTTP request will be a POST instead of a GET when the
data
parameter is provided.
data
应该是缓冲,采用标准
application/x-www-form-urlencoded
格式。
urllib.urlencode()
function takes a mapping or sequence of 2-tuples and returns a string in this format.
headers
should be a dictionary, and will be treated as if
add_header()
was called with each key and value as arguments. This is often used to “spoof” the
User-Agent
header value, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as
"Mozilla/5.0
(X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"
,而
urllib2
的默认用户代理字符串是
"Python-urllib/2.6"
(在 Python 2.6)。
最后 2 自变量仅对正确处理第 3 方 HTTP Cookie 感兴趣:
origin_req_host
应该是原始事务请求主机,作为定义通过
RFC 2965
。默认为
cookielib.request_host(self)
. This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.
unverifiable
should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to
False
. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.
urllib2.
OpenerDirector
¶
The
OpenerDirector
类打开 URL 凭借
BaseHandler
s chained together. It manages the chaining of handlers, and recovery from errors.
urllib2.
BaseHandler
¶
This is the base class for all registered handlers — and handles only the simple mechanics of registration.
urllib2.
HTTPDefaultErrorHandler
¶
A class which defines a default handler for HTTP error responses; all responses are turned into
HTTPError
异常。
urllib2.
HTTPRedirectHandler
¶
处理重定向的类。
urllib2.
HTTPCookieProcessor
(
[
cookiejar
]
)
¶
处理 HTTP Cookie 的类。
urllib2.
ProxyHandler
(
[
proxies
]
)
¶
促使请求透过代理进行。若
proxies
is given, it must be a dictionary mapping protocol names to URLs of proxies. The default is to read the list of proxies from the environment variables
<protocol>_proxy
. If no proxy environment variables are set, then in a Windows environment proxy settings are obtained from the registry’s Internet Settings section, and in a Mac OS X environment proxy information is retrieved from the OS X System Configuration Framework.
要禁用自动检测代理,传递空字典。
注意
HTTP_PROXY
会被忽略若变量
REQUEST_METHOD
被设置;见文档编制
getproxies()
.
urllib2.
HTTPPasswordMgr
¶
保持数据库的
(realm, uri) -> (user, password)
映射。
urllib2.
HTTPPasswordMgrWithDefaultRealm
¶
保持数据库的
(realm, uri) -> (user, password)
mappings. A realm of
None
is considered a catch-all realm, which is searched if no other realm fits.
urllib2.
AbstractBasicAuthHandler
(
[
password_mgr
]
)
¶
This is a mixin class that helps with HTTP authentication, both to the remote host and to a proxy.
password_mgr
, if given, should be something that is compatible with
HTTPPasswordMgr
; refer to section
HTTPPasswordMgr 对象
for information on the interface that must be supported.
urllib2.
HTTPBasicAuthHandler
(
[
password_mgr
]
)
¶
Handle authentication with the remote host.
password_mgr
, if given, should be something that is compatible with
HTTPPasswordMgr
; refer to section
HTTPPasswordMgr 对象
for information on the interface that must be supported.
urllib2.
ProxyBasicAuthHandler
(
[
password_mgr
]
)
¶
Handle authentication with the proxy.
password_mgr
, if given, should be something that is compatible with
HTTPPasswordMgr
; refer to section
HTTPPasswordMgr 对象
for information on the interface that must be supported.
urllib2.
AbstractDigestAuthHandler
(
[
password_mgr
]
)
¶
This is a mixin class that helps with HTTP authentication, both to the remote host and to a proxy.
password_mgr
, if given, should be something that is compatible with
HTTPPasswordMgr
; refer to section
HTTPPasswordMgr 对象
for information on the interface that must be supported.
urllib2.
HTTPDigestAuthHandler
(
[
password_mgr
]
)
¶
Handle authentication with the remote host.
password_mgr
, if given, should be something that is compatible with
HTTPPasswordMgr
; refer to section
HTTPPasswordMgr 对象
for information on the interface that must be supported.
urllib2.
ProxyDigestAuthHandler
(
[
password_mgr
]
)
¶
Handle authentication with the proxy.
password_mgr
, if given, should be something that is compatible with
HTTPPasswordMgr
; refer to section
HTTPPasswordMgr 对象
for information on the interface that must be supported.
urllib2.
HTTPHandler
¶
处理打开 HTTP URL 的类。
urllib2.
HTTPSHandler
(
[
debuglevel
[
,
context
]
]
)
¶
A class to handle opening of HTTPS URLs.
context
has the same meaning as for
httplib.HTTPSConnection
.
Changed in version 2.7.9: context added.
urllib2.
FileHandler
¶
打开本地文件。
urllib2.
FTPHandler
¶
打开 FTP URL。
urllib2.
CacheFTPHandler
¶
打开 FTP URL,保持打开 FTP 连接的缓存以最小化延迟。
urllib2.
UnknownHandler
¶
A catch-all class to handle unknown URLs.
urllib2.
HTTPErrorProcessor
¶
处理 HTTP 错误响应。
The following methods describe all of
Request
’s public interface, and so all must be overridden in subclasses.
Request.
add_data
(
data
)
¶
设置
Request
data to
data
. This is ignored by all handlers except HTTP handlers — and there it should be a byte string, and will change the request to be
POST
而不是
GET
.
Request.
get_method
(
)
¶
Return a string indicating the HTTP request method. This is only meaningful for HTTP requests, and currently always returns
'GET'
or
'POST'
.
Request.
has_data
(
)
¶
Return whether the instance has a non-
None
数据。
Request.
get_data
(
)
¶
Return the instance’s data.
Request.
add_header
(
key
,
val
)
¶
Add another header to the request. Headers are currently ignored by all handlers except HTTP handlers, where they are added to the list of headers sent to the server. Note that there cannot be more than one header with the same name, and later calls will overwrite previous calls in case the key collides. Currently, this is no loss of HTTP functionality, since all headers which have meaning when used more than once have a (header-specific) way of gaining the same functionality using only one header.
Request.
add_unredirected_header
(
key
,
header
)
¶
添加不会被添加到重定向请求的 Header 头。
2.4 版新增。
Request.
has_header
(
header
)
¶
Return whether the instance has the named header (checks both regular and unredirected).
2.4 版新增。
Request.
get_full_url
(
)
¶
返回在构造函数中给定的 URL。
Request.
get_type
(
)
¶
Return the type of the URL — also known as the scheme.
Request.
get_host
(
)
¶
Return the host to which a connection will be made.
Request.
get_selector
(
)
¶
Return the selector — the part of the URL that is sent to the server.
Request.
get_header
(
header_name
,
default=None
)
¶
Return the value of the given header. If the header is not present, return the default value.
Request.
header_items
(
)
¶
Return a list of tuples (header_name, header_value) of the Request headers.
Request.
set_proxy
(
host
,
type
)
¶
Prepare the request by connecting to a proxy server. The host and type will replace those of the instance, and the instance’s selector will be the original URL given in the constructor.
Request.
get_origin_req_host
(
)
¶
Return the request-host of the origin transaction, as defined by
RFC 2965
. See the documentation for the
Request
构造函数。
Request.
is_unverifiable
(
)
¶
Return whether the request is unverifiable, as defined by RFC 2965. See the documentation for the
Request
构造函数。
OpenerDirector
实例具有下列方法:
OpenerDirector.
add_handler
(
handler
)
¶
handler
should be an instance of
BaseHandler
. The following methods are searched, and added to the possible chains (note that HTTP errors are a special case).
protocol_open
— signal that the handler knows how to open
protocol
URLs.
http_error_type
— signal that the handler knows how to handle HTTP errors with HTTP error code
type
.
protocol_error
— signal that the handler knows how to handle errors from (non-
http
)
protocol
.
protocol_request
— signal that the handler knows how to pre-process
protocol
requests.
protocol_response
— signal that the handler knows how to post-process
protocol
responses.
OpenerDirector.
open
(
url[, data][, timeout]
)
¶
打开给定
url
(which can be a request object or a string), optionally passing the given
data
. Arguments, return values and exceptions raised are the same as those of
urlopen()
(which simply calls the
open()
method on the currently installed global
OpenerDirector
). The optional
timeout
parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). The timeout feature actually works only for HTTP, HTTPS and FTP connections).
2.6 版改变: timeout 被添加。
OpenerDirector.
error
(
proto
[
,
arg
[
,
...
]
]
)
¶
Handle an error of the given protocol. This will call the registered error handlers for the given protocol with the given arguments (which are protocol specific). The HTTP protocol is a special case which uses the HTTP response code to determine the specific error handler; refer to the
http_error_*()
methods of the handler classes.
Return values and exceptions raised are the same as those of
urlopen()
.
OpenerDirector 对象按 3 阶段打开 URL:
The order in which these methods are called within each stage is determined by sorting the handler instances.
Every handler with a method named like
protocol_request
has that method called to pre-process the request.
Handlers with a method named like
protocol_open
are called to handle the request. This stage ends when a handler either returns a non-
None
value (ie. a response), or raises an exception (usually
URLError
). Exceptions are allowed to propagate.
In fact, the above algorithm is first tried for methods named
default_open()
. If all such methods return
None
, the algorithm is repeated for methods named like
protocol_open
. If all such methods return
None
, the algorithm is repeated for methods named
unknown_open()
.
Note that the implementation of these methods may involve calls of the parent
OpenerDirector
instance’s
open()
and
error()
方法。
Every handler with a method named like
protocol_response
has that method called to post-process the response.
BaseHandler
objects provide a couple of methods that are directly useful, and others that are meant to be used by derived classes. These are intended for direct use:
BaseHandler.
add_parent
(
director
)
¶
Add a director as parent.
BaseHandler.
close
(
)
¶
移除任何父级。
The following attributes and methods should only be used by classes derived from
BaseHandler
.
注意
The convention has been adopted that subclasses defining
protocol_request()
or
protocol_response()
methods are named
*Processor
; all others are named
*Handler
.
BaseHandler.
parent
¶
有效
OpenerDirector
, which can be used to open using a different protocol, or handle errors.
BaseHandler.
default_open
(
req
)
¶
This method is
not
defined in
BaseHandler
, but subclasses should define it if they want to catch all URLs.
This method, if implemented, will be called by the parent
OpenerDirector
. It should return a file-like object as described in the return value of the
open()
of
OpenerDirector
,或
None
. It should raise
URLError
, unless a truly exceptional thing happens (for example,
MemoryError
should not be mapped to
URLError
).
This method will be called before any protocol-specific open method.
BaseHandler.
protocol_open
(
req
)
(“protocol” is to be replaced by the protocol name.)
This method is
not
defined in
BaseHandler
, but subclasses should define it if they want to handle URLs with the given
protocol
.
This method, if defined, will be called by the parent
OpenerDirector
. Return values should be the same as for
default_open()
.
BaseHandler.
unknown_open
(
req
)
¶
This method is
not
defined in
BaseHandler
, but subclasses should define it if they want to catch all URLs with no specific registered handler to open it.
This method, if implemented, will be called by the
parent
OpenerDirector
. Return values should be the same as for
default_open()
.
BaseHandler.
http_error_default
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
This method is
not
defined in
BaseHandler
, but subclasses should override it if they intend to provide a catch-all for otherwise unhandled HTTP errors. It will be called automatically by the
OpenerDirector
getting the error, and should not normally be called in other circumstances.
req
将是
Request
对象,
fp
will be a file-like object with the HTTP error body,
code
will be the three-digit code of the error,
msg
will be the user-visible explanation of the code and
hdrs
will be a mapping object with the headers of the error.
Return values and exceptions raised should be the same as those of
urlopen()
.
BaseHandler.
http_error_nnn
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
nnn
should be a three-digit HTTP error code. This method is also not defined in
BaseHandler
, but will be called, if it exists, on an instance of a subclass, when an HTTP error with code
nnn
出现。
Subclasses should override this method to handle specific HTTP errors.
Arguments, return values and exceptions raised should be the same as for
http_error_default()
.
BaseHandler.
protocol_request
(
req
)
(“protocol” is to be replaced by the protocol name.)
This method is
not
defined in
BaseHandler
, but subclasses should define it if they want to pre-process requests of the given
protocol
.
This method, if defined, will be called by the parent
OpenerDirector
.
req
将是
Request
object. The return value should be a
Request
对象。
BaseHandler.
protocol_response
(
req
,
response
)
(“protocol” is to be replaced by the protocol name.)
This method is
not
defined in
BaseHandler
, but subclasses should define it if they want to post-process responses of the given
protocol
.
This method, if defined, will be called by the parent
OpenerDirector
.
req
将是
Request
对象。
response
will be an object implementing the same interface as the return value of
urlopen()
. The return value should implement the same interface as the return value of
urlopen()
.
注意
Some HTTP redirections require action from this module’s client code. If this is the case,
HTTPError
被引发。见
RFC 2616
for details of the precise meanings of the various redirection codes.
HTTPRedirectHandler.
redirect_request
(
req
,
fp
,
code
,
msg
,
hdrs
,
newurl
)
¶
返回
Request
or
None
in response to a redirect. This is called by the default implementations of the
http_error_30*()
methods when a redirection is received from the server. If a redirection should take place, return a new
Request
to allow
http_error_30*()
to perform the redirect to
newurl
. Otherwise, raise
HTTPError
if no other handler should try to handle this URL, or return
None
if you can’t but another handler might.
注意
The default implementation of this method does not strictly follow
RFC 2616
, which says that 301 and 302 responses to
POST
requests must not be automatically redirected without confirmation by the user. In reality, browsers do allow automatic redirection of these responses, changing the POST to a
GET
, and the default implementation reproduces this behavior.
HTTPRedirectHandler.
http_error_301
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
重定向到
Location:
or
URI:
URL. This method is called by the parent
OpenerDirector
when getting an HTTP ‘moved permanently’ response.
HTTPRedirectHandler.
http_error_302
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
如同
http_error_301()
, but called for the ‘found’ response.
HTTPRedirectHandler.
http_error_303
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
如同
http_error_301()
, but called for the ‘see other’ response.
HTTPRedirectHandler.
http_error_307
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
如同
http_error_301()
, but called for the ‘temporary redirect’ response.
2.4 版新增。
HTTPCookieProcessor
实例有一属性:
The
cookielib.CookieJar
在其中存储 Cookie。
ProxyHandler.
protocol_open
(
request
)
(“protocol” is to be replaced by the protocol name.)
The
ProxyHandler
will have a method
protocol_open
for every
protocol
which has a proxy in the
proxies
dictionary given in the constructor. The method will modify requests to go through the proxy, by calling
request.set_proxy()
, and call the next handler in the chain to actually execute the protocol.
这些方法可用于
HTTPPasswordMgr
and
HTTPPasswordMgrWithDefaultRealm
对象。
HTTPPasswordMgr.
add_password
(
realm
,
uri
,
user
,
passwd
)
¶
uri
can be either a single URI, or a sequence of URIs.
realm
,
user
and
passwd
must be strings. This causes
(user, passwd)
to be used as authentication tokens when authentication for
realm
and a super-URI of any of the given URIs is given.
HTTPPasswordMgr.
find_user_password
(
realm
,
authuri
)
¶
Get user/password for given realm and URI, if any. This method will return
(None, None)
if there is no matching user/password.
For
HTTPPasswordMgrWithDefaultRealm
对象,领域
None
will be searched if the given
realm
has no matching user/password.
AbstractBasicAuthHandler.
http_error_auth_reqed
(
authreq
,
host
,
req
,
headers
)
¶
Handle an authentication request by getting a user/password pair, and re-trying the request.
authreq
should be the name of the header where the information about the realm is included in the request,
host
specifies the URL and path to authenticate for,
req
should be the (failed)
Request
object, and
headers
should be the error headers.
host
is either an authority (e.g.
"python.org"
) or a URL containing an authority component (e.g.
"http://python.org/"
). In either case, the authority must not contain a userinfo component (so,
"python.org"
and
"python.org:80"
are fine,
"joe:password@python.org"
is not).
HTTPBasicAuthHandler.
http_error_401
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
Retry the request with authentication information, if available.
ProxyBasicAuthHandler.
http_error_407
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
Retry the request with authentication information, if available.
AbstractDigestAuthHandler.
http_error_auth_reqed
(
authreq
,
host
,
req
,
headers
)
¶
authreq
should be the name of the header where the information about the realm is included in the request,
host
should be the host to authenticate to,
req
should be the (failed)
Request
object, and
headers
should be the error headers.
HTTPDigestAuthHandler.
http_error_401
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
Retry the request with authentication information, if available.
ProxyDigestAuthHandler.
http_error_407
(
req
,
fp
,
code
,
msg
,
hdrs
)
¶
Retry the request with authentication information, if available.
HTTPHandler.
http_open
(
req
)
¶
Send an HTTP request, which can be either GET or POST, depending on
req.has_data()
.
HTTPSHandler.
https_open
(
req
)
¶
Send an HTTPS request, which can be either GET or POST, depending on
req.has_data()
.
FileHandler.
file_open
(
req
)
¶
打开本地文件,若没有主机名,或主机名为
'localhost'
. Change the protocol to
ftp
otherwise, and retry opening it using
parent
.
FTPHandler.
ftp_open
(
req
)
¶
打开 FTP (文件传输协议) 文件指示通过 req 。登录始终使用空用户名和口令完成。
CacheFTPHandler
对象是
FTPHandler
对象,具有以下额外方法:
CacheFTPHandler.
setTimeout
(
t
)
¶
把连接超时设为 t 秒。
CacheFTPHandler.
setMaxConns
(
m
)
¶
把缓存的最大连接数设为 m .
2.4 版新增。
HTTPErrorProcessor.
http_response
(
)
¶
处理 HTTP 错误响应。
对于 200 错误代码,响应对象被立即返回。
For non-200 error codes, this simply passes the job on to the
protocol_error_code
处理程序方法,凭借
OpenerDirector.error()
. Eventually,
urllib2.HTTPDefaultErrorHandler
将引发
HTTPError
若没有其它处理程序处理错误。
HTTPErrorProcessor.
https_response
(
)
¶
处理 HTTPS 错误响应。
行为如同
http_response()
.
除以下范例外,更多范例给出于 如何使用 urllib2 抓取 Internet 资源 .
This example gets the python.org main page and displays the first 100 bytes of it:
>>> import urllib2 >>> f = urllib2.urlopen('http://www.python.org/') >>> print f.read(100) <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <?xml-stylesheet href="./css/ht2html
Here we are sending a data-stream to the stdin of a CGI and reading the data it returns to us. Note that this example will only work when the Python installation supports SSL.
>>> import urllib2 >>> req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi', ... data='This data is passed to stdin of the CGI') >>> f = urllib2.urlopen(req) >>> print f.read() Got Data: "This data is passed to stdin of the CGI"
The code for the sample CGI used in the above example is:
#!/usr/bin/env python import sys data = sys.stdin.read() print 'Content-type: text-plain\n\nGot Data: "%s"' % data
Use of Basic HTTP Authentication:
import urllib2 # Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password(realm='PDQ Application', uri='https://mahler:8092/site-updates.py', user='klem', passwd='kadidd!ehopper') opener = urllib2.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib2.install_opener(opener) urllib2.urlopen('http://www.example.com/login.html')
build_opener()
provides many handlers by default, including a
ProxyHandler
。默认情况下,
ProxyHandler
uses the environment variables named
<scheme>_proxy
,其中
<scheme>
is the URL scheme involved. For example, the
http_proxy
environment variable is read to obtain the HTTP proxy’s URL.
This example replaces the default
ProxyHandler
with one that uses programmatically-supplied proxy URLs, and adds proxy authorization support with
ProxyBasicAuthHandler
.
proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'}) proxy_auth_handler = urllib2.ProxyBasicAuthHandler() proxy_auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib2.build_opener(proxy_handler, proxy_auth_handler) # This time, rather than install the OpenerDirector, we use it directly: opener.open('http://www.example.com/login.html')
添加 HTTP 头:
使用
headers
自变量到
Request
构造函数,或:
import urllib2 req = urllib2.Request('http://www.example.com/') req.add_header('Referer', 'http://www.python.org/') # Customize the default User-Agent header value: req.add_header('User-Agent', 'urllib-example/0.1 (Contact: . . .)') r = urllib2.urlopen(req)
OpenerDirector
自动添加
User-Agent
头到每个
Request
。要改变这:
import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] opener.open('http://www.example.com/')
Also, remember that a few standard headers (
Content-Length
,
Content-Type
and
Host
) are added when the
Request
被传递给
urlopen()
(或
OpenerDirector.open()
).
urllib2
— 用于打开 URL 的可扩展库