社区教程 Wiki

注册登录

创作新主题

社区所有版块导航

Python

python开源 Django Python DjangoApp pycharm

DATA

docker Elasticsearch

分享

问与答闲聊招聘翻译创业分享发现分享创造求职区块链支付之战

aigc

aigc chatgpt

WEB开发

linux MongoDB Redis DATABASE NGINX 其他Web框架 web工具 zookeeper tornado NoSql Bootstrap js peewee Git bottle IE MQ Jquery

机器学习

机器学习算法

Python88.com

反馈公告社区推广

产品

短视频

印度

印度

关注

Py学习 » Python

[精华] 最好用的爬虫利器 Requests (HTTP for Humans)

Py站长 • 10 年前 • 22445 次点击

推荐理由：

官方介绍：（很强大！）

“Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

Things shouldn’t be this way. Not in Python.”
stackoverflow的问题Should I use urllib or urllib2 or requests?

也是推荐它的！

用起来非常不错哦。经常抓网页的可以考虑下，抓取效率有10%的提升。

源码位置：https://github.com/kennethreitz/requests

常用功能罗列如下

认证、状态码、header、编码、json

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

发起请求

import requests
URL="http://www.bsdmap.com/"
r = requests.get(URL)
r = requests.post(URL)
r = requests.put(URL)
r = requests.delete(URL)
r = requests.head(URL)
r = requests.options(URL)

通过URL传递参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)
>>> print r.url
u'http://httpbin.org/get?key2=value2&amp;key1=value1'

返回内容

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

二进制内容

You can also access the response body as bytes, for non-text requests:

>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

The gzip and deflate transfer-encodings are automatically decoded for you.

For example, to create an image from binary data returned by a request,
 ou can use the following code:

>>> from PIL import Image
>>> from StringIO import StringIO
>>> i = Image.open(StringIO(r.content))

JSON

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.json()
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

超时

>>> requests.get('http://github.com', timeout=0.001)

自定义header

>>> import json
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> headers = {'content-type': 'application/json'}

>>> r = requests.post(url, data=json.dumps(payload), headers=headers)

更多见官方文档：

http://docs.python-requests.org/en/latest/user/quickstart/

http://docs.python-requests.org/en/latest/user/advanced/#advanced

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/120

22445 次点击

文章 [ 13 ] | 最新文章 9 年前

Reply

• 1 楼

Py站长 9 年前

@olivetree 没看源码，不过有一点可以肯定的是，更人性化的使用。

Reply

• 2 楼

olivetree 9 年前

requests 用的是urllib3，那么他在 urllib3 的基础上做了哪些改进？

Reply

• 3 楼

olivetree 9 年前

@Django中国社区默认已经压缩了

Reply

• 4 楼

Py站长 9 年前

@olivetree 遵循的是HTTP协议啊，设定为gzip应该就会自动压缩

Reply

• 5 楼

olivetree 9 年前

这个能不能压缩传输呢？

Reply

• 6 楼

Py站长 10 年前

@zyloveszjj 赞~

Reply

• 7 楼

lzjun567 10 年前

配合BeautifulSoup使用太方便了

Reply

• 8 楼

zyloveszjj 10 年前

那个存图片的可以这样写: with open('test.png', 'wb') as f: f.write(res.content) 个人觉得比用StringIO方便的多~

Reply

• 9 楼

boostbob 10 年前

看起来不错，用java写过爬虫，一般都推荐usrlib，看起来用这个应该爽....

Reply

• 10 楼

powgolf 10 年前

@Django中国社区谢谢啦

Reply

• 11 楼

Py站长 10 年前

@powgolf

类似：

payload = {
           options.VERSION: '1_0', \
           options.PRODUCT_LINE: 'dan', \
           options.SERVICE: 'dnwebbilling', \
           options.ENV: 'online', \
           options.MAIN: 'master', \
           options.FILENAME: 'a.properties', \
           }
r = requests.get("http://localhost:8000/obj", params = payload)
print r.url

Reply

• 12 楼

powgolf 10 年前

问一下，requests如何获得最终的跳转链接？就像 urllib2 里的 geturl() 谢谢

Reply

• 13 楼

Py站长 10 年前

有一点需要注意的是，在写爬虫时，在Requests中可以设置keep-alive=False的，否则，可能会被网站屏蔽。

登录后回复

关于移动版 · 三行代码 · 今天看啥 · Code · link之家 · 卧龙搜索 · 藏经阁 · 小百科

Py学习 - 专注于Python技术发展的社区(原Django社区)

沪ICP备11025650号