Py学习  »  Python

在python中使用beautifulsoup获取href URL

user86907 • 3 年前 • 1398 次点击  

我正在尝试从以下url下载所有csv文件: https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices ,但不幸的是,我未能如期成功。以下是我的尝试:

soup = BeautifulSoup(page.content, "html.parser")
market_dataset = soup.findAll("table",{"class":"table table-striped table-condensed table-clean"})
for a in market_dataset.find_all('a', href=True):
    print("Found the URL:", a['href'])

谁能帮帮我吗。如何获取csv文件的所有URL。

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/129416
 
1398 次点击  
文章 [ 1 ]  |  最新文章 3 年前
HedgeHog
Reply   •   1 楼
HedgeHog    3 年前

选择更具体的元素,例如 css selectors 要知道你必须在 href 具有 baseUrl :

['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]

或者简单地更改代码并使用 find() 而不是 findAll() 要定位表,是什么导致以下属性错误:

AttributeError:ResultSet对象没有“find_all”属性。您可能将元素列表视为单个元素。当你打算调用find()时,你调用find_all()了吗?

market_dataset = soup.find("table",{"class":"table table-striped table-condensed table-clean"})

注: 在新代码中使用严格的 find_all() 而不是旧的语法 芬德尔() 或者两者兼而有之。

实例

from bs4 import BeautifulSoup
import requests

url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]

输出

['https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220318_FinalEnergyPrices_I.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220317_FinalEnergyPrices_I.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220316_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220315_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220314_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220313_FinalEnergyPrices.csv',
 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220312_FinalEnergyPrices.csv',...]