私信  •  关注

HedgeHog

HedgeHog 最近创建的主题
HedgeHog 最近回复了
4 年前
回复了 HedgeHog 创建的主题 » 如何使用python web scraper获取标题?

从你的 <h1> 使用 .text .get_text() 当需要将自定义参数传递给 strip 空格,。。。或者添加一个分隔符(例如。 title.get_text(strip=True, seperator=',') ).

print(title.text)

print(title.get_text())

实例

from bs4 import BeautifulSoup

URL = 'https://kbdfans.com/collections/cherry-switches/products/cherry-mx-silent-red'

headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}

page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, 'html.parser')

title = soup.find('h1', {'class' : 'product-detail__title small-title'})

print(title.text)

输出

CHERRY MX SILENT RED(10pcs)

我担心为每个新页面寻找新的调整而不是仅仅使用selenium获取html会变得相当烦人。

原则上,您可以调用单独的请求来调用相应的 container-managers

<script>CNN.covCon.push({id: "coverageContainer_8DDF4E26-8632-6418-1586-B910547ED120",layout: "list-hierarchical-xs",src: "/data/ocs/container/coverageContainer_8DDF4E26-8632-6418-1586-B910547ED120:list-hierarchical-xs/views/containers/common/container-manager.html"});</script>

再次单独进行,这样你就不必与 selenium 但是,你也必须对其他页面进行这样的调整,这需要花费时间,而且根本不稳定。

以防万一,这不需要太多努力,您可以使用 BeautifulSoup :

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

service = Service(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
driver = webdriver.Chrome(service=service)
driver.get('https://edition.cnn.com/world')

soup = BeautifulSoup(driver.page_source,'html.parser' )
len(soup.select('.cd__wrapper'))

输出-->116

4 年前
回复了 HedgeHog 创建的主题 » 如何用Python从列表中提取HTML链接?

简单使用 css selectors 直接访问 <a> 具有 href /ShowUserReviews :

[a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]

或者使用baseUrl的concat https://www.tripadvisor.com/ :

['https://www.tripadvisor.com/'+a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]
实例
html='''
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel.  Great Staff.  Wonderful walking tour with David.</span></span></a></div>

'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
urls = [a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]
输出URL
['/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html']
4 年前
回复了 HedgeHog 创建的主题 » python中web报废中的属性错误

笔记 首先,一定要看看你的汤——这就是真相。内容总是可能与开发工具中的视图略有不同,甚至相差甚远。

会发生什么?

你应该记住以下不同的问题:

  • base_url='https://books.toscrape.com/catalogue/page-1.html' 将导致 404错误 是不是第一个导致你 “非类型对象没有属性文本”

  • 你试着找到这样的分类 cat=book_soup.find('"../category/books/poetry_23/index.html">Poetry').text.strip() 什么不起作用,会导致同样的错误

  • 还有一些选择不会带来预期的结果,看看我的例子,编辑它们,给你一个如何实现目标的线索。

如何修复?

  1. 改变 基本的https://books.toscrape.com/catalogue/page-1.html' base_url='https://books.toscrape.com/catalogue/'

  2. 选择更具体的类别,它是最后一个 <a> 在面包屑中:

    cat=book_soup.select('.breadcrumb a')[-1].text.strip()
    

实例

import requests
from bs4 import BeautifulSoup
import pandas as pd


all_books=[]

url='https://books.toscrape.com/catalogue/page-1.html'
headers=('https://developers.whatismybrowser.com/useragents/parse/22526098chrome-windows-blink')
def get_page(url):
    page=requests.get(url,headers)
    status=page.status_code
    soup=BeautifulSoup(page.text,'html.parser')
    return [soup,status]

#get all books links
def get_links(soup):
    links=[]
    listings=soup.find_all(class_='product_pod')
    for listing in listings:
        bk_link=listing.find("h3").a.get("href")
        base_url='https://books.toscrape.com/catalogue/'
        cmplt_link=base_url+bk_link
        links.append(cmplt_link)
    return links
    
#extraxt info from each link
def extract_info(links):
    for link in links:
        r=requests.get(link).text
        book_soup=BeautifulSoup(r,'html.parser')
        name= name.text.strip() if (name := book_soup.h1) else None
        price= price.text.strip() if (price := book_soup.select_one('h1 + p')) else None
        desc= desc.text.strip() if (desc := book_soup.select_one('#product_description + p')) else None
        cat= cat.text.strip() if (cat := book_soup.select('.breadcrumb a')[-1]) else None
        book={'name':name,'price':price,'desc':desc,'cat':cat}
        all_books.append(book)

pg=48
while True:
    url=f'https://books.toscrape.com/catalogue/page-{pg}.html'
    soup_status=get_page(url)
    if soup_status[1]==200:
        print(f"scrapping page{pg}")
        extract_info(get_links(soup_status[0]))
        pg+=1
    else:
        print("The End")
        break

all_books
4 年前
回复了 HedgeHog 创建的主题 » 在python中使用beautifulsoup获取href URL

选择更具体的元素,例如 css selectors 要知道你必须在 href 具有 baseUrl :

['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]

或者简单地更改代码并使用 find() 而不是 findAll() 要定位表,是什么导致以下属性错误:

AttributeError:ResultSet对象没有“find_all”属性。您可能将元素列表视为单个元素。当你打算调用find()时,你调用find_all()了吗?

market_dataset = soup.find("table",{"class":"table table-striped table-condensed table-clean"})

注: 在新代码中使用严格的 find_all() 而不是旧的语法 芬德尔() 或者两者兼而有之。

实例

from bs4 import BeautifulSoup
import requests

url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]

输出

['https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220318_FinalEnergyPrices_I.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220317_FinalEnergyPrices_I.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220316_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220315_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220314_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220313_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220312_FinalEnergyPrices.csv',...]