Python网络爬虫，怎样改造成并发爬取！

背景目标

一个爬虫程序，默认情况下，是单线程爬取的，速度会比较慢

如果改造成多线程爬取，就可以利用多CPU能力，加速爬取。

如下代码，爬取了一个小说的内容，存储到文件里。

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
root_url = "http://antpython.net/novels/01.html"

resp_root = requests.get(root_url, headers=headers)

soup = BeautifulSoup(resp_root.text, "html.parser")

chapter_links = soup.find("div", id="novel_chapters").find_all("div", class_="chapter_link")

import time

start = time.time()
# file out
fout = open("小说.txt", "w", encoding="utf8")
count = len(chapter_links)
for idx, chapter_link in enumerate(chapter_links):
    link = chapter_link.find("a")
    href = "http://antpython.net%s" % link["href"]
    title = link.get_text()

    print("爬取链接：", href, title, idx, count, idx / count * 100)
    resp_cont = requests.get(href, headers=headers)
    soup_cont = BeautifulSoup(resp_cont.text, "html.parser")
    cont = soup_cont.find("div", id="chapter_content").get_text()
    fout.write(title + "\n")
    fout.write(cont + "\n")

fout.close()
print("爬取时间：", time.time() - start)

执行后，看到花费时间为爬取时间：56.08秒钟。

代码改造

需要注意的是，如果是并发爬取，那么爬取的顺序是不一致的。我们可以给每次URL给一个序号，将来做排序。

首先，将每章爬取改造成函数

def craw_single(index, title, chapter_link):
    """爬取单章内容，返回需要、标题、内容"""
    resp_cont = requests.get(chapter_link, headers=headers)
    soup_cont = BeautifulSoup(resp_cont.text, "html.parser")
    cont = soup_cont.find("div", id="chapter_content").get_text()
    return index, title, cont

其中index参数，纯粹是为了将来的排序使用。

然后，启动每章的爬取，提交给线程池

import time

start = time.time()
count = len(chapter_links)
pool = ThreadPoolExecutor()
futures = []
for idx, chapter_link in enumerate(chapter_links):
    link = chapter_link.find("a")
    href = "http://antpython.net%s" % link["href"]
    title = link.get_text()

    futures.append(pool.submit(craw_single, index=idx, title=title, chapter_link=href))

我们使用pool.submit做任务的提交，然后用futures收集future的结果对象。

等待所有线程的结束

results = []
for future in concurrent.futures.as_completed(futures):
    results.append(future.result())

该代码会挨个等待子线程的结束。将结果future.result()，也就是函数的返回数据，存入列表中

将结果存入文件

results.sort(key=lambda x: x[0])
with open("小说结果.txt", "w", encoding="utf8") as fout:
    for index, title, cont in results:
        fout.write(title + "\n")
        fout.write(cont + "\n")
pool.shutdown()
print("爬取时间：", time.time() - start)

这里对数据做了排序，按章节的顺序。

然后打开文件写入内容。

最后关闭了线程池。

总结

要把任务改造成多线程，先把要拆分的任务改成单个函数。然后用线程池做任务提交。都提交后，可以等待获取任务的返回。对返回数据做处理后，写出到文件里。

如果想要跟蚂蚁老师学习Python技术，

这是蚂蚁老师的视频全集

https://mayibiancheng.net/

涵盖了8个学习路线，包含数据分析、WEB开发、机器学习、办公自动化等方向；

课程永久有效，新课全都免费看；

蚂蚁老师本人提供答疑、群聊答疑等服务；

课程重复回看，永久有效；

提供副业兼职渠道；

课程可以单独买，也可以购买全套课程；

全套课原价1998元，本月优惠价格998元。

如果想要更多了解：

蚂蚁老师每晚21~23点直播，抖音账号：Python导师-蚂蚁

任何问题可以微信扫码咨询蚂蚁老师

点击下方“阅读原文”，可以直达课程主页