10个Python爬虫实用技巧，帮助提升工作效率。

长按关注《AI科技论谈》

想要快速掌握网络爬虫技术，首选语言非Python莫属。Python不仅用途广泛，包括快速Web开发、网络爬虫和自动化操作等，还能用来搭建简单的网站、编写自动发帖脚本、处理邮件的发送与接收，开发基础的验证码识别工具。

在网络爬虫的开发中，有很多流程是可以反复使用的。本文分享10个实用技巧，帮助提升工作效率。

1 基础网络爬虫

使用get方法

import urllib2

url = "http://www.test.com"
response = urllib2.urlopen(url)
print response.read()

使用post方法

import urllib
import urllib2

url = "http://test.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url,form_data)
response = urllib2.urlopen(request)
print response.read()

2 使用代理IP绕过IP封锁

开发网络爬虫时，经常会遇到IP被封的尴尬局面。出现这种情况，就得用代理IP来继续你的爬虫任务。Python的urllib2库里，有个ProxyHandler类，用它能配置代理，让爬虫正常访问网页。

以下是代码片段：

import urllib.request

proxy_handler = urllib.request.ProxyHandler({'http': 'http://your_proxy_ip:port'})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)

response = urllib.request.urlopen('http://example.com')
html = response.read()
print(html)

3 管理Cookies

Cookies是网站在用户本地电脑里储存的一些小数据，一般都是加密的，用来识别用户和跟踪会话。

Python提供了cookielib模块来专门用来处理这些Cookies。cookielib模块的主要功能是提供可以存储Cookies的对象，让结合urllib2模块访问互联网资源更加方便。

import urllib2, cookielib

cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)

content = urllib2.urlopen('http://XXXX').read()

关键在于CookieJar()这个对象，它是用来管理HTTP Cookie的，存储由HTTP请求生成的Cookies，并把这些Cookies加到发出去的HTTP请求里。这些Cookies都存储在内存中，一旦CookieJar对象没用了，被Python的垃圾回收机制清理掉，里面的Cookies也就跟着没了。整个过程是自动的，不需要单独操作。

创建Cookie罐对象来保存Cookies

cookie_jar = http.cookiejar.CookieJar()

# 创建处理HTTP请求的打开器对象

opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))

# 手动添加Cookie

cookie = http.cookiejar.Cookie(
    version=0, name='example_cookie', value='cookie_value'


    
, port=None, port_specified=False,
    domain='example.com', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True,
    secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False
)

cookie_jar.set_cookie(cookie)

4 伪装成浏览器

有些网站不喜欢网络爬虫的访问，会直接拒绝它们的访问请求。所以，如果直接用urllib2访问网站通常会得到HTTP错误403：禁止访问。这时候，你需要特别注意请求的头部信息，因为服务器会检查这些信息来判断请求是不是来自一个真实的浏览器：

User-Agent：一些服务器或代理会检查这个值，以确定请求是否由浏览器发出。
Content-Type：在使用REST接口时，服务器会检查这个值，以确定如何解析HTTP正文中的内容。

以下是演示如何伪装成浏览器的代码片段：

import urllib2

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}

request = urllib2.Request(
    url = 'http://my.test.net/xxx/blog?catalog=xxxx',
    headers = headers
)

print urllib2.urlopen(request).read()

5 页面解析

对于页面解析，最强大的工具当然是正则表达式，不同用户和不同网站的需求各不相同，所以不需要过多阐述。接下来是解析库，常用的两个是lxml和BeautifulSoup。

它们都是HTML/XML处理库。BeautifulSoup是用纯Python实现的，效率较低但非常实用，比如可以搜索结果以获取特定HTML节点的源代码。

lxml是用C语言编写的，效率很高，支持XPath。

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...



"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 提取标题

title = soup.title.string
print(f"Title: {title}")

# 提取所有链接

links = soup.find_all('a')
for link in links:
    print(f"Link: {link.get('href')}


    
, Text: {link.string}")

6 数据压缩

你有没有碰到过这样的情况：不管怎么解码，网页内容还是乱七八糟？这可能是因为你不知道很多网络服务都能发送压缩数据，这样可以把通过网络传输的数据量减少60%以上，特别是对于XML这种数据，压缩率特别高。

但是，服务器一般不会主动给你发送压缩数据，除非你告诉它你能处理这种数据。

因此，你需要这样修改你的代码：

import urllib2, httplib

request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)

然后解压：

import StringIO
import gzip

compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()

7 多线程并发抓取

如果觉得单线程跑得太慢，那就得上多线程了。这里有个简单的线程池模板，虽然这个程序只是简单地打印出1到10，但已经可以看出它是能同时处理多个任务的。

虽说Python的多线程通常不是它的强项，但对于涉及频繁网络操作的网络爬虫来说，多线程还是能在一定程度上提高效率的。

from threading import Thread
from queue import Queue
from time import sleep

# q是任务队列
# NUM是并发线程的总数
# JOBS是任务的数量
q = Queue()
NUM = 2
JOBS = 10

# 具体的处理函数，负责处理单个任务
def do_something_using(arguments):
    print(arguments)

# 这是工作进程，负责不断从队列中取数据并处理
def working():
    while True:
        arguments = q.get()
        do_something_using(arguments)
        sleep(1)
        q.task_done()

# 启动NUM个线程等待队列
for i in range(NUM):
    t = Thread(target=working)
    t.setDaemon(True)
    t.start()

# 将JOBS放入队列
for i in range(JOBS):
    q.put(i)

# 等待所有JOBS完成
q.join()

8 本地缓存

在抓取大型网站时，最好的做法是缓存已经下载的数据。这样，如果你在抓取过程中再次需要同一个页面，就不必重新加载网站。使用像Redis这样的键值存储很简单，但也可以使用MySQL或任何其他文件系统缓存机制。

import redis
import requests

# 连接到本地Redis服务器
cache = redis.StrictRedis(host='localhost', port=6379, db=0)

def fetch_url(url):
    # 检查URL是否已经在缓存中
    cached_data = cache.get(url)
    if cached_data:
        print("使用缓存数据")
        return cached_data
    else:
        print("获取新数据")
        response = requests.get(url)
        data = response.text
        # 将数据存储在缓存中，URL作为键
        cache.set(url, data)
        return data

# 示例用法
url = "http://example.com"
content = fetch_url(url)
print(content)