3分钟Python爬取楼盘网和价格数据分析

背景目标

爬取楼盘网北京市的房价数据，分析查看房价的价格分布，就是看一下房价集中在哪个区域。

因为北京各大区的房屋小区很多，以石景山区进行演示。

1. 引入相关的库

import requests
import pandas as pd
from bs4 import BeautifulSoup
import time

pandas用于数据分析，requests用于获取数据、bs4用于解析数据、time用于爬取期间的停顿。

2. 定义headers

这个网站没有反爬，基本的Headers够用

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
}

3. 测试爬取单个URL

url = "https://bj.loupan.com/community/shijingshan/p2/"
response = requests.get(url, headers=headers)
print(response.status_code)

执行，输出状态码是200，说明请求成功。

print(response.text)

输出了网页内容，说明正常得到了HTML数据。

4. 解析单个页面数据

soup = BeautifulSoup(response.text, "html.parser")
li_list = soup.find("ul", class_="list").find_all("li")

for li in li_list:
    # print(li)
    # break
    title = li.find("h2").find("a").get_text().strip()
    price = li.find("div", class_="price").get_text().strip().replace("元/㎡", "")
    print(title,price)

以此方式，可以得到小区的名称和价格

输出形如：

西井三区 48827
西黄新村南里 48569
苹果汇 48411
金顶北路20号院 价格待定
金顶北路22号院 价格待定
金顶北路59号院 价格待定
金顶街9号院 55555
金顶山路168号院 价格待定
金冬苑 价格待定
金夏苑 价格待定
西福园 价格待定
九中教工楼 50059
宏开花园 价格待定
西山汇 22274
琅山村小区 价格待定
东下庄路38号院 价格待定

5. 封装成函数

目的是实现批量爬取

def craw_page(url):
    response = requests.get(url, headers=headers)



    
    soup = BeautifulSoup(response.text, "html.parser")
    li_list = soup.find("ul", class_="list").find_all("li")
    datas = []
    for li in li_list:
        title = li.find("h2").find("a").get_text().strip()
        price = li.find("div", class_="price").get_text().strip().replace("元/㎡", "")
        datas.append([title, price])
    return datas

测试一下函数；

datas = craw_page("https://bj.loupan.com/community/shijingshan/p2/")
print(datas)

打印结果如下：

[['西井三区', '48827'], ['西黄新村南里', '48569'], ['苹果汇', '48411'], ['金顶北路20号院', '价格待定'], ['金顶北路22号院', '价格待定'], ['金顶北路59号院', '价格待定'], ['金顶街9号院', '55555'], ['金顶山路168号院', '价格待定'], ['金冬苑', '价格待定'], ['金夏苑', '价格待定'], ['西福园', '价格待定'], ['九中教工楼', '50059'], ['宏开花园', '价格待定'], ['西山汇', '22274'], ['琅山村小区', '价格待定'], ['东下庄路38号院', '价格待定'], ['馨领域', '65673'], ['重兴园小区', '52239'], ['六场宿舍', '51261'], ['京汉铂寓（京汉旭城三期）', '66176'], ['京源路1号院', '价格待定'], ['久筑社区', '价格待定'], ['京原路68号院', '42442'], ['古城西路15号小区', '56896'], ['安和家园(石景山)', '31111']]

6、批量调用存入Dataframe

爬取所有数据

all_datas = []
for page in range(1, 25):
    if page == 1:
        url = "https://bj.loupan.com/community/shijingshan/"
    else:
        url = f'https://bj.loupan.com/community/shijingshan/p{page}/'
    print("爬取url：", url)
    datas = craw_page(url)
    all_datas.extend(datas)
    time.sleep(0.5)

爬取过程，会得到所有的数据，存到了all_datas。

爬取url：https://bj.loupan.com/community/shijingshan/
爬取url：https://bj.loupan.com/community/shijingshan/p2/
爬取url：https://bj.loupan.com/community/shijingshan/p3/
爬取url：https://bj.loupan.com/community/shijingshan/p4/
爬取url：https://bj.loupan.com/community/shijingshan/p5/
爬取url：https://bj.loupan.com/community/shijingshan/p6/
爬取url：https://bj.loupan.com/community/shijingshan/p7/
爬取url：https://bj.loupan.com/community/shijingshan/p8/
爬取url：https://bj.loupan.com/community/shijingshan/p9/
爬取url：https://bj.loupan.com/community/shijingshan/p10/
爬取url：https://bj.loupan.com/community/shijingshan/p11/
爬取url：https://bj.loupan.com/community/shijingshan/p12/
爬取url：https://bj.loupan.com/community/shijingshan/p13/
爬取url：https://bj.loupan.com/community/shijingshan/p14/
爬取url：https://bj.loupan.com/community/shijingshan/p15/
爬取url：https://bj.loupan.com/community/shijingshan/p16/
爬取url：https://bj.loupan.com/community/shijingshan/p17/
爬取url：https://bj.loupan.com/community/shijingshan/p18/
爬取url：https://bj.loupan.com/community/shijingshan/p19/
爬取url：https://bj.loupan.com/community/shijingshan/p20/
爬取url：https://bj.loupan.com/community/shijingshan/p21/
爬取url：https://bj.loupan.com/community/shijingshan/p22/
爬取url：https://bj.loupan.com/community/shijingshan/p23/
爬取url：https://bj.loupan.com/community/shijingshan/p24/

构造dataframe

df = pd.DataFrame(all_datas, columns=["小区名称", "价格"])
df

会得到DataFrame的打印。

7、数据的处理和分析

数据中有“价格待定”的数据，我们用pandas过滤掉，然后把这一列变成数字。




    
df_new = df[df["价格"].str.isnumeric()].copy()
df_new["价格"] = df_new["价格"].astype(int)

这时候输出全是数字了：

查看价格的分布

df_new["价格"].hist(bins=10)

pandas库中的hist函数用于绘制数据的直方图。

直方图是一种常见的数据可视化方法，可以帮助我们快速了解数据的分布情况。

通过直方图，可以观察数据的中心趋势、离散程度、异常值等信息。

其中bins参数用于控制直方图的柱子数量或边界。可以是一个整数，也可以是一个表示边界的序列。

如果想要跟蚂蚁老师学习Python技术，

这是蚂蚁老师的视频全集

https://study.163.com/series/1202914611.htm，

涵盖了8个学习路线，包含数据分析、WEB开发、机器学习、办公自动化等方向；

课程永久有效，新课全都免费看；

蚂蚁老师本人提供答疑、群聊答疑等服务；

课程重复回看，永久有效；

提供副业兼职渠道；

课程可以单独买，也可以购买全套课程；

全套课原价1998元，本月优惠价格998元。

如果想要更多了解：

蚂蚁老师每晚21~23点直播，抖音账号：Python导师-蚂蚁

任何问题可以微信扫码咨询蚂蚁老师

点击下方“阅读原文”，可以直达课程主页