Py学习  »  Python

在python脚本中运行scrappy程序

PyRar • 6 年前 • 1618 次点击  

< = i'm trying to run scrapy fro我是一个python脚本。我想我差不多成功了,但有些事情做不到。在我的代码中,我有一行这样的行 run_spider(quotes5) quotes5 is the name of my scrappy that I used to execute like this in Cmd: scrappy crawl quotes5 。有什么帮助吗? 错误是 Quotes5 is undefined.

这是我的代码:

导入废料 来自twisted.internet进口反应堆 从scrappy.crawler导入Crawlerrunner 从scrapy.utils.log导入配置日志 导入JSON 进口CSV 进口再 从钩针导入设置 从importlib导入导入_模块 从scrapy.utils.project导入获取项目设置 设置() def run_spider(蜘蛛名称): module_name=“ws_vardata.spiders.”.格式(spidername) scrapy_var=导入_模块(模块_名称)对选定的spider进行动态导入 spiderobj=scrappy_var.quotesspider()从spider模块获取myspider对象 crawler=crawlerrunner(get_project_settings())来自Scrapy文档 爬行(spiderobj) 跑蜘蛛(引言5) < /代码>

剪贴代码(引号_spider.py):。

导入废料 导入JSON 进口CSV 进口再 QuotessPider类(scrappy.spider): name=“Quotes5” def start_请求(self): 以open(‘input.csv’,‘r’)作为csvf: urlreader=csv.reader(csvf,delimiter='',quotechar='') 对于urlreader中的url: 如果URL〔0〕==“Y”: yield scrapy.request(url[1]) #打开(“so__out.csv”,“w”)作为csvfile: #字段名=['category'、'type'、'model'、'sk'] #writer=csv.dictwriter(csvfile,fieldname=fieldname) #writer.writeHeader()。 定义解析(self,response): regex=re.compile(r'“产品\s*:\s*(.+?)\}),re.dotall) regex1=re.compile(r'“pathindiscator”\s*:\s*(.+?\}),re.dotall) source_json1=response.xpath(“//script[包含(,'var digitaldata')]/text()”).re_first(regex) source_json2=response.xpath(“//script[包含(,'var digitaldata')]/text()”).re_first(regex1) model_code=response.xpath('//script').re_first('model code.*?)(*)“” 如果source_json1和source_json2: source_json1=re.sub(r'/[^\n]+',“”,source_json1) source_json2=re.sub(r'/[^\n]+',“”,source_json2) product=json.loads(源代码\json1) path=json.loads(源代码\json2) product_category=product[“pvi_type_name”] product_type=product[“pvi_subtype_name”] 产品型号=路径[“深度”] 产品名称=产品[“型号名称”] 如果source_json1和source_json2: source1=源代码[0] source2=source_json2[0] 将open('output.csv','a',newline='')作为csvfile: 字段名=['category'、'type'、'model'、'name'、'sk'] writer=csv.dictwriter(csvfile,fieldname=fieldname) 如果产品类别: writer.writerow('category':product_category,'type':product_type,'model':product_model,'name':product_name,'sk':model) < /代码>

Enter image description here enter image description here 我正在尝试从一个python脚本运行scrappy。我想我差不多成功了,但有些事情做不到。在我的代码里我有一行这样的 run_spider(quotes5) . quotes5 是我在命令中用来执行的碎片的名称: scrapy crawl quotes5 . 有什么帮助吗? 错误是 引文5 是未定义的。

这是我的代码:

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
import json
import csv
import re
from crochet import setup
from importlib import import_module
from scrapy.utils.project import get_project_settings
setup()


def run_spider(spiderName):
    module_name="WS_Vardata.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj= scrapy_var.QuotesSpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)  

run_spider(quotes5)

剪贴代码(引号_spider.py):

import scrapy
import json
import csv
import re

class QuotesSpider(scrapy.Spider):
name = "quotes5"

def start_requests(self):
    with open('input.csv','r') as csvf:
        urlreader = csv.reader(csvf, delimiter=',',quotechar='"')
        for url in urlreader:
            if url[0]=="y":
                yield scrapy.Request(url[1])
    #with open('so_52069753_out.csv', 'w') as csvfile:
        #fieldnames = ['Category', 'Type', 'Model', 'SK']
        #writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        #writer.writeheader()

def parse(self, response):

    regex = re.compile(r'"product"\s*:\s*(.+?\})', re.DOTALL)
    regex1 = re.compile(r'"pathIndicator"\s*:\s*(.+?\})', re.DOTALL)
    source_json1 = response.xpath("//script[contains(., 'var digitalData')]/text()").re_first(regex)
    source_json2 = response.xpath("//script[contains(., 'var digitalData')]/text()").re_first(regex1)
    model_code = response.xpath('//script').re_first('modelCode.*?"(.*)"')

    if source_json1 and source_json2:
        source_json1 = re.sub(r'//[^\n]+', "", source_json1)
        source_json2 = re.sub(r'//[^\n]+', "", source_json2)
        product = json.loads(source_json1)
        path = json.loads(source_json2)
        product_category = product["pvi_type_name"]
        product_type = product["pvi_subtype_name"]
        product_model = path["depth_5"]
        product_name = product["model_name"]


    if source_json1 and source_json2:
        source1 = source_json1[0]
        source2 = source_json2[0]
        with open('output.csv','a',newline='') as csvfile:
            fieldnames = ['Category','Type','Model','Name','SK']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            if product_category:
                writer.writerow({'Category': product_category, 'Type': product_type, 'Model': product_model, 'Name': product_name, 'SK': model_code})

enter image description here

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/30671
 
1618 次点击  
文章 [ 1 ]  |  最新文章 6 年前
ARR
Reply   •   1 楼
ARR    6 年前

正如错误所说,Quote5未定义,您需要在将其传递给方法之前定义Quote5。或者尝试这样的方法:

run_spider(“quotes5”)

编辑:

import WS_Vardata.spiders.quotes_spiders as quote_spider_module
def run_spider(spiderName):
    #get the class from within the module
    spiderClass = getattr(quote_spider_module, spiderName)
    #create the object and your good to go
    spiderObj= spiderClass()
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)  

run_spider("QuotesSpider")

此脚本应与ws_vardata在同一目录中运行

所以在你的例子中:

- TEST
| the_code.py
| WS_Vardata
   | spiders
     | quotes_spider <= containing QuotesSpider class