Py学习  »  Python

[精华] 最好用的爬虫利器 Requests (HTTP for Humans)

Py站长 • 12 年前 • 23582 次点击  

推荐理由:

  1. 官方介绍:(很强大!)

    “Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

    Things shouldn’t be this way. Not in Python.”

  2. stackoverflow的问题Should I use urllib or urllib2 or requests?

    也是推荐它的!

用起来非常不错哦。 经常抓网页的可以考虑下,抓取效率有10%的提升。

源码位置 :https://github.com/kennethreitz/requests

常用功能罗列如下

  1. 认证、状态码、header、编码、json

    >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
    >>> r.status_code
    200
    >>> r.headers['content-type']
    'application/json; charset=utf8'
    >>> r.encoding
    'utf-8'
    >>> r.text
    u'{"type":"User"...'
    >>> r.json()
    {u'private_gists': 419, u'total_private_repos': 77, ...}
    
  2. 发起请求

    import requests
    URL="http://www.bsdmap.com/"
    r = requests.get(URL)
    r = requests.post(URL)
    r = requests.put(URL)
    r = requests.delete(URL)
    r = requests.head(URL)
    r = requests.options(URL)
    
  3. 通过URL传递参数

    >>> payload = {'key1': 'value1', 'key2': 'value2'}
    >>> r = requests.get("http://httpbin.org/get", params=payload)
    >>> print r.url
    u'http://httpbin.org/get?key2=value2&key1=value1'
    
  4. 返回内容

    >>> import requests
    >>> r = requests.get('https://github.com/timeline.json')
    >>> r.text
    '[{"repository":{"open_issues":0,"url":"https://github.com/...
    >>> r.encoding
    'utf-8'
    >>> r.encoding = 'ISO-8859-1'
    
  5. 二进制内容

    You can also access the response body as bytes, for non-text requests:
    
    >>> r.content
    b'[{"repository":{"open_issues":0,"url":"https://github.com/...
    
    The gzip and deflate transfer-encodings are automatically decoded for you.
    
    For example, to create an image from binary data returned by a request,
     ou can use the following code:
    
    >>> from PIL import Image
    >>> from StringIO import StringIO
    >>> i = Image.open(StringIO(r.content))
    
  6. JSON

    >>> import requests
    >>> r = requests.get('https://github.com/timeline.json')
    >>> r.json()
    [{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...
    
  7. 超时

    >>> requests.get('http://github.com', timeout=0.001)
    
  8. 自定义header

    >>> import json
    >>> url = 'https://api.github.com/some/endpoint'
    >>> payload = {'some': 'data'}
    >>> headers = {'content-type': 'application/json'}
    
    >>> r = requests.post(url, data=json.dumps(payload), headers=headers)
    

更多见官方文档:

http://docs.python-requests.org/en/latest/user/quickstart/

http://docs.python-requests.org/en/latest/user/advanced/#advanced

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/120
 
23582 次点击  
文章 [ 13 ]  |  最新文章 10 年前