社区所有版块导航
Python
python开源   Django   Python   DjangoApp   pycharm  
DATA
docker   Elasticsearch  
aigc
aigc   chatgpt  
WEB开发
linux   MongoDB   Redis   DATABASE   NGINX   其他Web框架   web工具   zookeeper   tornado   NoSql   Bootstrap   js   peewee   Git   bottle   IE   MQ   Jquery  
机器学习
机器学习算法  
Python88.com
反馈   公告   社区推广  
产品
短视频  
印度
印度  
Py学习  »  Python

Python-沃尔玛的类别名称Web抓取

Wicaledon • 3 年前 • 1625 次点击  

我想从这个沃尔玛找到部门名称 link .你可以看到,首先,里面左边有7个部门 Departments (巧克力饼干、饼干、黄油饼干等)。当我点击 See All Departments ,增加了9个类别,因此现在的数字是16。我正试图让所有16个部门都自动。我写了这段代码;

from selenium import webdriver

n_links = []

driver = webdriver.Chrome(executable_path='D:/Desktop/demo/chromedriver.exe')
url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391" 
driver.get(url)

search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()
search2 = driver.find_element_by_xpath("//*[@id='Departments']/div/div/div/div").text

sep = search.split('\n')
sep2 = search2.split('\n')

lngth = len(sep)
lngth2 = len(sep2)

for i in range (1,lngth):
    path = "//*[@id='Departments']/div/div/ul/li"+"["+ str(i) + "]/a"
    nav_links = driver.find_element_by_xpath(path).get_attribute('href')
    n_links.append(nav_links)
    
for i in range (1,lngth2):
    path = "//*[@id='Departments']/div/div/div/div/ul/li"+"["+ str(i) + "]/a"
    nav_links2 = driver.find_element_by_xpath(path).get_attribute('href')
    n_links.append(nav_links2)   
    
print(n_links)
print(len(n_links))

最后,当我运行代码时,我可以看到里面的链接 n_links 大堆但问题是;有时有13个链接,有时有14个。应该是16岁,我还没见过16岁,只有13或14岁。我试图补充 time.sleep(3) 之前 search2 线路,但不起作用。你能帮助我吗?

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/132829
 
1625 次点击  
文章 [ 4 ]  |  最新文章 3 年前
frianH
Reply   •   1 楼
frianH    4 年前

为什么不使用 .visibility_of_all_elements_located ?

texts = []
links =[]

driver.get('https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391')
wait = WebDriverWait(driver, 60)
wait.until(EC.element_to_be_clickable((By.XPATH, "//span[text()='See all Departments']/parent::button"))).click()
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.department-single-level a")))
for element in elements:
    #to get text
    texts.append(element.text)
    #to get link by attribute name
    links.append(element.get_attribute('href'))
    
print(texts)
print(links)

控制台输出:

[u'Chocolate Cookies', u'Cookies', u'Butter Cookies', u'Shortbread Cookies', u'Coconut Cookies', u'Healthy Cookies', u'Keebler Cookies', u'Biscotti', u'Gluten-Free Cookies', u'Molasses Cookies', u'Peanut Butter Cookies', u'Pepperidge Farm Cookies', u'Snickerdoodle Cookies', u'Sugar-Free Cookies', u"Tate's Cookies", u'Vegan Cookies']
[u'https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', u'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', u'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', u'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', u'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', u'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', u'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', u'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', u'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', u'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', u'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', u'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', u'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', u'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', u'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', u'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']

需要以下导入:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Andrej Kesely
Reply   •   2 楼
Andrej Kesely    4 年前

仅使用 beautifulsoup :

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
    "Accept-Language": "en-US,en;q=0.5",
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.select_one("#searchContent").contents[0])

# uncomment to see all data:
# print(json.dumps(data, indent=4))


def find_departments(data):
    if isinstance(data, dict):
        if "name" in data and data["name"] == "Departments":
            yield data
        else:
            for v in data.values():
                yield from find_departments(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_departments(v)


departments = next(find_departments(data), {})

for d in departments.get("values", []):
    print(
        "{:<30} {}".format(
            d["name"], "https://www.walmart.com" + d["baseSeoURL"]
        )
    )

印刷品:

Chocolate Cookies              https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138
Cookies                        https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066
Butter Cookies                 https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640
Shortbread Cookies             https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949
Coconut Cookies                https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757
Healthy Cookies                https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302
Keebler Cookies                https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825
Biscotti                       https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095
Gluten-Free Cookies            https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193
Molasses Cookies               https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971
Peanut Butter Cookies          https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174
Pepperidge Farm Cookies        https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932
Snickerdoodle Cookies          https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167
Sugar-Free Cookies             https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659
Tate's Cookies                 https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535
Vegan Cookies                  https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359
MendelG
Reply   •   3 楼
MendelG    4 年前

要打印所有产品(16),您可以尝试使用CSS选择器进行搜索: .collapsible-content > ul a, .sometimes-shown a

在你的例子中:

from selenium import webdriver

driver = webdriver.Chrome()
url = (
    "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"
)
driver.get(url)

search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()

all_departments = [
    link.get_attribute("href")
    for link in driver.find_elements_by_css_selector(
        ".collapsible-content > ul a, .sometimes-shown a"
    )
]

print(len(all_departments))
print(all_departments)

输出:

16
['https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', 'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', 'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', 'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', 'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', 'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', 'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', 'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', 'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', 'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', 'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', 'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', 'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', 'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', 'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', 'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']
JD308
Reply   •   4 楼
JD308    4 年前

我觉得你让事情变得更复杂了。如果单击按钮,您可能需要等待获取部门,这是正确的。

# This code will get all the departments shown
    departments = []
    departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]") 
 
# Click on the show all departments button
    driver.find_element_by_xpath("//button[@data-automation-id='button']//span[contains(text(),'all Departments')]").click()

# Will get the departments shown
    departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]")
    
# Iterate through the departments
for d in departments:
            print(d)