笔记
首先,一定要看看你的汤——这就是真相。内容总是可能与开发工具中的视图略有不同,甚至相差甚远。
会发生什么?
你应该记住以下不同的问题:
-
base_url='https://books.toscrape.com/catalogue/page-1.html'
将导致
404错误
是不是第一个导致你
“非类型对象没有属性文本”
-
你试着找到这样的分类
cat=book_soup.find('"../category/books/poetry_23/index.html">Poetry').text.strip()
什么不起作用,会导致同样的错误
-
还有一些选择不会带来预期的结果,看看我的例子,编辑它们,给你一个如何实现目标的线索。
如何修复?
-
改变
基本的https://books.toscrape.com/catalogue/page-1.html'
到
base_url='https://books.toscrape.com/catalogue/'
-
选择更具体的类别,它是最后一个
<a>
在面包屑中:
cat=book_soup.select('.breadcrumb a')[-1].text.strip()
实例
import requests
from bs4 import BeautifulSoup
import pandas as pd
all_books=[]
url='https://books.toscrape.com/catalogue/page-1.html'
headers=('https://developers.whatismybrowser.com/useragents/parse/22526098chrome-windows-blink')
def get_page(url):
page=requests.get(url,headers)
status=page.status_code
soup=BeautifulSoup(page.text,'html.parser')
return [soup,status]
#get all books links
def get_links(soup):
links=[]
listings=soup.find_all(class_='product_pod')
for listing in listings:
bk_link=listing.find("h3").a.get("href")
base_url='https://books.toscrape.com/catalogue/'
cmplt_link=base_url+bk_link
links.append(cmplt_link)
return links
#extraxt info from each link
def extract_info(links):
for link in links:
r=requests.get(link).text
book_soup=BeautifulSoup(r,'html.parser')
name= name.text.strip() if (name := book_soup.h1) else None
price= price.text.strip() if (price := book_soup.select_one('h1 + p')) else None
desc= desc.text.strip() if (desc := book_soup.select_one('#product_description + p')) else None
cat= cat.text.strip() if (cat := book_soup.select('.breadcrumb a')[-1]) else None
book={'name':name,'price':price,'desc':desc,'cat':cat}
all_books.append(book)
pg=48
while True:
url=f'https://books.toscrape.com/catalogue/page-{pg}.html'
soup_status=get_page(url)
if soup_status[1]==200:
print(f"scrapping page{pg}")
extract_info(get_links(soup_status[0]))
pg+=1
else:
print("The End")
break
all_books