Py学习  »  Python

希望在执行scrappy[重复]后打开python脚本

Mute Glider • 4 年前 • 540 次点击  

有没有一种方法可以在蜘蛛类终止之前触发它?

我可以自己终止蜘蛛,就像这样:

class MySpider(CrawlSpider):
    #Config stuff goes here...

    def quit(self):
        #Do some stuff...
        raise CloseSpider('MySpider is quitting now.')

    def my_parser(self, response):
        if termination_condition:
            self.quit()

        #Parsing stuff goes here...

但是我找不到任何关于如何确定蜘蛛何时会自然退出的信息。

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/37922
 
540 次点击  
文章 [ 6 ]  |  最新文章 4 年前
slavugan
Reply   •   1 楼
slavugan    7 年前

如果你有很多蜘蛛并且想在它们关闭之前做些什么,也许在你的项目中添加statscollector会很方便。

在设置中:

STATS_CLASS = 'scraper.stats.MyStatsCollector'

和收集器:

from scrapy.statscollectors import StatsCollector

class MyStatsCollector(StatsCollector):
    def _persist_stats(self, stats, spider):
        do something here
saurabh thomasXu
Reply   •   2 楼
saurabh thomasXu    4 年前

对于最新版本(v1.7),只需定义 closed(reason) 方法。

关闭(原因) :

蜘蛛关闭时调用。此方法提供了 signals.connect()表示蜘蛛网关闭信号。

Scrapy Doc : scrapy.spiders.Spider.closed

THIS USER NEEDS HELP Chris
Reply   •   3 楼
THIS USER NEEDS HELP Chris    6 年前

对我来说,被接受的不起作用/已经过时了,至少是零碎的0.19。 不过,我让它与以下内容一起工作:

from scrapy.signalmanager import SignalManager
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        SignalManager(dispatcher.Any).connect(
            self.closed_handler, signal=signals.spider_closed)

    def closed_handler(self, spider):
        # do stuff here
Levon
Reply   •   4 楼
Levon    7 年前

对于报废版本 1.0.0+ (它也适用于旧版本)。

from scrapy import signals

class MySpider(CrawlSpider):
    name = 'myspider'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        print('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        print('Closing {} spider'.format(spider.name))

一个好用法是添加 tqdm 进度条到废蜘蛛。

# -*- coding: utf-8 -*-
from scrapy import signals
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tqdm import tqdm


class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['somedomain.comm']
    start_urls = ['http://www.somedomain.comm/ccid.php']

    rules = (
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccds.php\?id=.*'),
             callback='parse_item',
             ),
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccid.php$',
                           restrict_xpaths='//table/tr[contains(., "SMTH")]'), follow=True),
    )

    def parse_item(self, response):
        self.pbar.update()  # update progress bar by 1
        item = MyItem()
        # parse response
        return item

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        self.pbar = tqdm()  # initialize progress bar
        self.pbar.clear()
        self.pbar.write('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        self.pbar.clear()
        self.pbar.write('Closing {} spider'.format(spider.name))
        self.pbar.close()  # close progress bar
THIS USER NEEDS HELP Chris
Reply   •   5 楼
THIS USER NEEDS HELP Chris    8 年前

只需更新,您可以拨打 closed 功能如下:

class MySpider(CrawlSpider):
    def closed(self, reason):
        do-something()
alecxe
Reply   •   6 楼
alecxe    9 年前

看起来您可以通过 dispatcher .

我会尝试以下方法:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
      # second param is instance of spder about to be closed.