Py学习  »  Python

Python的scrapy之爬取6毛小说网的圣墟!

Python学习交流 • 5 年前 • 464 次点击  

闲来无事想看个小说,打算下载到电脑上看,找了半天,没找到可以下载的网站,于是就想自己爬取一下小说内容并保存到本地

圣墟 第一章 沙漠中的彼岸花 - 辰东 - 6毛小说网 http://www.6mao.com/html/40/40184/12601161.html

这是要爬取的网页

观察结构

私信小编01 02 03 04 05 即可获取数十套PDF哦!


下一章


然后开始创建scrapy项目:


其中sixmaospider.py:

# -*- coding: utf-8 -*-
import scrapy
from ..items import SixmaoItem
class SixmaospiderSpider(scrapy.Spider):
name = 'sixmaospider'
#allowed_domains = ['http://www.6mao.com']
start_urls = ['http://www.6mao.com/html/40/40184/12601161.html'] #圣墟
def parse(self, response):
novel_biaoti = response.xpath('//div[@id="content"]/h1/text()').extract()
#print(novel_biaoti)
novel_neirong=response.xpath('//div[@id="neirong"]/text()').extract()
print(novel_neirong)
#print(len(novel_neirong))
novelitem = SixmaoItem()
novelitem['novel_biaoti'] = novel_biaoti[0]
print(novelitem['novel_biaoti'])
for i in range(0,len(novel_neirong),2):
#print(novel_neirong[i])
novelitem['novel_neirong'] = novel_neirong[i]
yield novelitem
#下一章
nextPageURL = response.xpath('//div[@class="s_page"]/a/@href').extract() # 取下一页的地址
nexturl='http://www.6mao.com'+nextPageURL[2]
print('下一章',nexturl)
if nexturl:
url = response.urljoin(nexturl)
# 发送下一页请求并调用parse()函数继续解析
yield scrapy.Request(url, self.parse, dont_filter=False)
pass
else:
print("退出")
pass

pipelinesio.py 将内容保存到本地文件

import os
print(os.getcwd())
class SixmaoPipeline(object):
def process_item(self, item, spider):
#print(item['novel'])
with open('./data/圣墟.txt', 'a', encoding='utf-8') as fp:
fp.write(item['novel_neirong'])
fp.flush()
fp.close()
return item
print('写入文件成功')

items.py

import scrapy
class SixmaoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
novel_biaoti=scrapy.Field()
novel_neirong=scrapy.Field()
pass

startsixmao.py,直接右键这个运行,项目就开始运行了

from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'sixmaospider'])

settings.py

LOG_LEVEL='INFO' #这是加日志
LOG_FILE='novel.log'
DOWNLOADER_MIDDLEWARES = {
'sixmao.middlewares.SixmaoDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'sixmao.rotate_useragent.RotateUserAgentMiddleware' :400 #这行是使用代理
}
ITEM_PIPELINES = {
#'sixmao.pipelines.SixmaoPipeline': 300,
'sixmao.pipelinesio.SixmaoPipeline': 300,
} #在pipelines输出管道加入这个
SPIDER_MIDDLEWARES = {
'sixmao.middlewares.SixmaoSpiderMiddleware': 543,
} #打开中间件 其余地方应该不需要改变

rotate_useragent.py 给项目加代理,防止被服务器禁止

# 导入random模块
import random
# 导入useragent用户代理模块中的UserAgentMiddleware类
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
# RotateUserAgentMiddleware类,继承 UserAgentMiddleware 父类
# 作用:创建动态代理列表,随机选取列表中的用户代理头部信息,伪装请求。
# 绑定爬虫程序的每一次请求,一并发送到访问网址。
# 发爬虫技术:由于很多网站设置反爬虫技术,禁止爬虫程序直接访问网页,
# 因此需要创建动态代理,将爬虫程序模拟伪装成浏览器进行网页访问。
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
#这句话用于随机轮换user-agent
ua = random.choice(self.user_agent_list)
if ua:
# 输出自动轮换的user-agent
print(ua)
request.headers.setdefault('User-Agent', ua)
# the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
# for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
# 编写头部请求代理列表
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

最终运行结果:


呐呐呐,这就是一个小的scrapy项目了



今天看啥 - 高品质阅读平台
本文地址:http://www.jintiankansha.me/t/BnfJpST87e
Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/25591
 
464 次点击