社区所有版块导航
Python
python开源   Django   Python   DjangoApp   pycharm  
DATA
docker   Elasticsearch  
aigc
aigc   chatgpt  
WEB开发
linux   MongoDB   Redis   DATABASE   NGINX   其他Web框架   web工具   zookeeper   tornado   NoSql   Bootstrap   js   peewee   Git   bottle   IE   MQ   Jquery  
机器学习
机器学习算法  
Python88.com
反馈   公告   社区推广  
产品
短视频  
印度
印度  
Py学习  »  Python

学会python爬虫,这简直就是二次元宅男的福利

python • 4 年前 • 582 次点击  



目标网站 https://divnil.com

1、先去主页面获取每个图片的详细页面的链接

这链接还是比较好获取的,直接 F12 审核元素,或者右键查看代码,手机上chrome和firefox在url前面加上 "view-source"

比如:

view-source:https://www.baidu.com/


2、从详细页面获取图片大图地址

随便打开一个图片详细页面,接着按 F12 审核元素,我们需要定位该图片的链接,首先单击左上角的这玩意儿,像一个鼠标的图标,接着只需要单击网页上的图片就能定位到代码了:


3、用大图地址下载该图片

这个很简单,看代码

先安装 Requests 和 BeautifulSoup 库

pip install requests bs4

导入库

import requestsfrom bs4 import BeautifulSoupimport sys

请求获取网页源代码

url = "https://divnil.com/wallpaper/iphone8/%E3%82%A2%E3%83%8B%E3%83%A1%E3%81%AE%E5%A3%81%E7%B4%99_2.html"headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0",}resp = requests.get(url, headers=headers)if resp.status_code != requests.codes.OK:print("Request Error, Code: %d"% resp.status_code)sys.exit()

然后解析出所有图片的详细地址

soup = BeautifulSoup(resp.text, "html.parser")contents = soup.findAll("div", id="contents")[0]wallpapers = contents.findAll("a", rel="wallpaper")links = []for wallpaper in wallpapers: links.append(wallpaper['href'])

接着在详细网页里获取那个看似高清的图片的不确定是否为真实图片链接并下载(/滑稽)

import os
head = "https://divnil.com/wallpaper/iphone8/"if os.path.exists("./Divnil") != True: os.mkdir("./Divnil")
for url in links: url = head + url resp = requests.get(url, headers=headers) if resp.status_code != requests.codes.OK: print("URL: %s REQUESTS ERROR. CODE: %d" % (url, resp.status_code)) continue soup = BeautifulSoup(resp.text, "html.parser") img = soup.find("div", id="contents").contents.find("img", id="main_content") img_url = head + img['"original'].replace("../", "") img_name = img['alt'] print("start download %s ..." % img_url)
resp = requests.get(img_url, headers=headers) if resp.status_code != requests.codes.OK: print("IMAGE %s DOWNLOAD FAILED." % img_name)
with open("./Divnil/" + img_name + ".jpg", "wb") as f: f.write(resp.content)


贴上所有代码

import requestsfrom bs4 import BeautifulSoupimport sysimport os

class Divnil:
def __init__(self): self.url = "https://divnil.com/wallpaper/iphone8/%E3%82%A2%E3%83%8B%E3%83%A1%E3%81%AE%E5%A3%81%E7%B4%99.html" self.head = "https://divnil.com/wallpaper/iphone8/" self.headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0", }

def getImageInfoUrl(self):
resp = requests.get(self.url, headers=self.headers) if resp.status_code != requests.codes.OK: print("Request Error, Code: %d"% resp.status_code) sys.exit()
soup = BeautifulSoup(resp.text, "html.parser")
contents = soup.find("div", id="contents") wallpapers = contents.findAll("a", rel="wallpaper")
self.links = [] for wallpaper in wallpapers: self.links.append(wallpaper['href'])

def downloadImage(self):
if os.path.exists("./Divnil") != True: os.mkdir("./Divnil")
for url in self.links:
url = self.head + url
resp = requests.get(url, headers=self.headers) if resp.status_code != requests.codes.OK: print("URL: %s REQUESTS ERROR. CODE: %d" % (url, resp.status_code)) continue
soup = BeautifulSoup(resp.text, "html.parser")
img = soup.find("div", id="contents").find("img", id="main_content") img_url = self.head + img['original'].replace("../", "") img_name = img['alt']
print("start download %s ..." % img_url)
resp = requests.get(img_url, headers=self.headers) if resp.status_code != requests.codes.OK: print("IMAGE %s DOWNLOAD FAILED." % img_name) continue
if '/' in img_name: img_name = img_name.split('/')[1]
with open("./Divnil/" + img_name + ".jpg", "wb") as f: f.write(resp.content)

def main(self): self.getImageInfoUrl() self.downloadImage()

if __name__ == "__main__": divnil = Divnil() divnil.main()


*声明:本文于网络整理,版权归原作者所有,如来源信息有误或侵犯权益,请联系我们删除或授权事宜。

觉得不错,点个“在看”然后转发出去

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/48103
 
582 次点击