社区所有版块导航
Python
python开源   Django   Python   DjangoApp   pycharm  
DATA
docker   Elasticsearch  
aigc
aigc   chatgpt  
WEB开发
linux   MongoDB   Redis   DATABASE   NGINX   其他Web框架   web工具   zookeeper   tornado   NoSql   Bootstrap   js   peewee   Git   bottle   IE   MQ   Jquery  
机器学习
机器学习算法  
Python88.com
反馈   公告   社区推广  
产品
短视频  
印度
印度  
Py学习  »  chitown88  »  全部回复
回复总数  9
3 年前
回复了 chitown88 创建的主题 » 使用Selenium和Python删除足球网站上的一些数据

你确定你需要用硒吗?你可以很容易地拉那些有熊猫和请求的桌子。

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.soccerstats.com/matches.asp?matchday=1#'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', text='stats')

filtered_links = []
for link in links:
    if 'pmatch' in link['href']:
        filtered_links.append(link['href'])

tables = {}
for count, link in enumerate(filtered_links, start=1):
    try:
        html = requests.get('https://www.soccerstats.com/' + link).text
        soup = BeautifulSoup(html, 'html.parser')
        
        goalsTable = soup.find('h2', text='Goal statistics')
        
        teams = goalsTable.find_next('table')
        teamsStr = teams.find_all('td')[0].text + ' ' + teams.find_all('td')[-1].text
        
        goalsTable = teams.find_next('table')
        df = pd.read_html(str(goalsTable))[0]
        
        print(f'{count} of {len(filtered_links)}: {teamsStr}')
        tables[teamsStr] = df
        
    except:
        print(f'{count} of {len(filtered_links)}: {teamsStr} !! NO GOALS STATISTICS !!')

输出:

enter image description here

3 年前
回复了 chitown88 创建的主题 » Python:基于组求和,并将其显示为附加列

做一个 .grouby() channel ,并得到 units .然后简单地划分 单位 通过 units_per_channel

import pandas as pd


df = pd.DataFrame([['Offline',    'Bournemouth',    62],
['Offline' ,    'Kettering'  ,    90],
['Offline' ,    'Manchester' ,    145],
['Online'  ,    'Bournemouth',    220],
['Online'  ,    'Kettering',      212],
['Online'  ,    'Manchester',     272]],
                  columns=['channel','store','units'],)


df['units_per_channel'] = df.groupby('channel')['units'].transform('sum')
df['store_share'] = df['units'] / df['units_per_channel']

输出:

print(df)
   channel        store  units  units_per_channel  store_share
0  Offline  Bournemouth     62                297     0.208754
1  Offline    Kettering     90                297     0.303030
2  Offline   Manchester    145                297     0.488215
3   Online  Bournemouth    220                704     0.312500
4   Online    Kettering    212                704     0.301136
5   Online   Manchester    272                704     0.386364
6 年前
回复了 chitown88 创建的主题 » 从具有多个表的web页面获取python中的数据

该站点有一个api端点,它将数据以良好的json格式返回给您。您可以获取json格式的响应,然后将其规范化以创建表。现在当它这样做时,它返回两个表,所以我不确定您是否需要第二个表。如果没有,我将它们分开存放,然后将它们附加在一起。

import requests    
from pandas.io.json import json_normalize

url = 'https://api.bseindia.com/BseIndiaAPI/api/MktHighLowData/w?Grpcode=&HLflag=H&indexcode=&scripcode='

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}

payload = {
'Grpcode':'', 
'HLflag': 'H',
'indexcode':'' ,
'scripcode':'' }

jsonObj = requests.get(url, headers=headers, params=payload).json()

df_table = json_normalize(jsonObj['Table'])
df_table1 = json_normalize(jsonObj['Table1'])

df = df_table.append(df_table1)

输出:

print (df)
     ALLTimeHigh         ...                         dt_tm
0        1019.95         ...           2019-02-25T16:00:03
1         263.00         ...           2019-02-25T16:00:03
2          24.00         ...           2019-02-25T16:00:03
3          35.90         ...           2019-02-25T16:00:03
4          29.75         ...           2019-02-25T16:00:03
5          43.00         ...           2019-02-25T16:00:03
6         140.40         ...           2019-02-25T16:00:03
7          15.39         ...           2019-02-25T16:00:03
8         724.00         ...           2019-02-25T16:00:03
9        1495.00         ...           2019-02-25T16:00:03
10        123.15         ...           2019-02-25T16:00:03
11        121.00         ...           2019-02-25T16:00:03
12        238.50         ...           2019-02-25T16:00:03
13         89.00         ...           2019-02-25T16:00:03
14        819.95         ...           2019-02-25T16:00:03
15        112.40         ...           2019-02-25T16:00:03
16         49.95         ...           2019-02-25T16:00:03
17        330.85         ...           2019-02-25T16:00:03
18        167.45         ...           2019-02-25T16:00:03
19         25.10         ...           2019-02-25T16:00:03
20        940.00         ...           2019-02-25T16:00:03
21        165.00         ...           2019-02-25T16:00:03
22           NaN         ...           2019-02-25T16:00:03
23        239.00         ...           2019-02-25T16:00:03
24        151.55         ...           2019-02-25T16:00:03
25         34.35         ...           2019-02-25T16:00:03
26        256.15         ...           2019-02-25T16:00:03
27         49.75         ...           2019-02-25T16:00:03
28        103.25         ...           2019-02-25T16:00:03
29         50.50         ...           2019-02-25T16:00:03
..           ...         ...                           ...
87        135.00         ...           2019-02-25T16:00:03
88        219.80         ...           2019-02-25T16:00:03
89         58.00         ...           2019-02-25T16:00:03
90        494.00         ...           2019-02-25T16:00:03
91        285.30         ...           2019-02-25T16:00:03
92         55.65         ...           2019-02-25T16:00:03
93          4.45         ...           2019-02-25T16:00:03
94         50.00         ...           2019-02-25T16:00:03
95         50.00         ...           2019-02-25T16:00:03
96         92.50         ...           2019-02-25T16:00:03
97        154.80         ...           2019-02-25T16:00:03
98         82.40         ...           2019-02-25T16:00:03
99        293.85         ...           2019-02-25T16:00:03
100       396.00         ...           2019-02-25T16:00:03
101        98.00         ...           2019-02-25T16:00:03
102       144.60         ...           2019-02-25T16:00:03
103        11.50         ...           2019-02-25T16:00:03
104        42.95         ...           2019-02-25T16:00:03
105       313.00         ...           2019-02-25T16:00:03
106      1120.00         ...           2019-02-25T16:00:03
107        87.00         ...           2019-02-25T16:00:03
108        82.00         ...           2019-02-25T16:00:03
109       214.00         ...           2019-02-25T16:00:03
110       505.00         ...           2019-02-25T16:00:03
111      1525.00         ...           2019-02-25T16:00:03
112       220.00         ...           2019-02-25T16:00:03
113        36.00         ...           2019-02-25T16:00:03
114       170.00         ...           2019-02-25T16:00:03
115       549.50         ...           2019-02-25T16:00:03
116      4990.00         ...           2019-02-25T16:00:03

[168 rows x 19 columns]
6 年前
回复了 chitown88 创建的主题 » 在python中读取Json

有几件事:

1-那个 emotionsAll 钥匙在 objects 键,列表中的第一个元素 [0] , attributes 钥匙

2-您的json文件是以bytes前缀写入的,因此在读取时,它以 b' . 您可以a)通过解码/编码使该文件在不使用该模式的情况下写入,或者仅操作该字符串。

import json

data = repr(open('file.json', 'rb').read())

data = data.split('{', 1)[-1]
data = data.rsplit('}', 1)[0]

data = ''.join(['{', data, '}'])
jsonObj = json.loads(data)

print(jsonObj['objects'][0]['attributes']['emotionsAll']['neutral']) 

输出:

print(jsonObj['objects'][0]['attributes']['emotionsAll']['neutral']) 
0.0
6 年前
回复了 chitown88 创建的主题 » 如何在Python中构造具有多个特性的元素

在我看来,熊猫是一种很好的方式。但你当然可以用字典:

elements = ['A', 'B', 'C', 'D']
colors = ['red','red', 'blue', 'red']
shapes = ['square', 'circle', 'circle', 'triangle']


dict1 = { element: {'color':colors[index], 'shape':shapes[index]} for index,element in enumerate(elements)}


def find_keys(keyword):
    result = []
    for key, val in dict1.items():
        for k, v in val.items():
            if v == keyword:
                result.append(key)
    return result

print (find_keys('red'))

输出:

 print (find_keys('red'))
['A', 'B', 'D']

print (find_keys('circle')) 
['B', 'C']
6 年前
回复了 chitown88 创建的主题 » 使用python selenium从html标记提取属性[duplicate]

使用 .attrs

import bs4

html = '''<input  
class="text header_login_text_box ignore_interaction" 
type="text" 
name="email" tabindex="1"
data-group="js-editable"
placeholder="Email"
w2cid="wZgD2YHa18" 
id="__w2_wZgD2YHa18_email">'''

soup = bs4.BeautifulSoup(html, 'html.parser')


for tag in soup:
    attr_dict = (tag.attrs)

输出: print (attr_dict)

{'class': ['text', 'header_login_text_box', 'ignore_interaction'], 
'type': 'text', 
'name': 'email', 
'tabindex': '1', 
'data-group': 'js-editable', 
'placeholder': 'Email', 
'w2cid': 'wZgD2YHa18', 
'id': '__w2_wZgD2YHa18_email'}

这将打开浏览器,然后单击下拉菜单。您可以单击所需的选项继续:

from selenium import webdriver 

driver = webdriver.Chrome()
url = 'http://www.mpcci.com/members_list.php'
driver.get(url) 

driver.find_element_by_xpath('//*[@id="select_gp_id"]').click()
6 年前
回复了 chitown88 创建的主题 » 需要了解如何使用python刮取实时流数据的帮助

你的代码不完整。具体来说,1)您实际上没有使用beautifulsoup来做任何事情,2)您的函数没有返回任何东西,这就是为什么它会打印“none”

import pandas as pd
import bs4 
from requests_html import HTMLSession 
import time

def get_count():

    url = 'http://10.0.0.206/apps/cy8ckit_062_demo/main.html'

    session = HTMLSession()
    r = session.get(url)
    r.html.render(sleep=5,timeout=8)

    soup = bs4.BeautifulSoup(r.text,'html.parser')

    data = soup.findAll('div', {'id':'currentData'})[0]
    temp_data = data.findAll('p')
    current_time = temp_data[0].text
    current_date = temp_data[1].text
    current_usage = temp_data[2].text

    print ('%s\n%s\n%s' %(current_time, current_date, current_usage))



while True:
    get_count()
    time.sleep(8)
5 年前
回复了 chitown88 创建的主题 » python列表只打印到csv中的一行

按原样的方式不会逐行写入,因为您基本上是按列创建列表的。有一种方法可以用 zip 我相信,但您也可以将每一行写入数据帧,然后将数据帧写入熊猫的文件:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def write_output(data):
    data.to_csv('data.csv', index=False)    



def fetch_data():
    df = pd.DataFrame()
    base_url = 'http://leevers.com/'
    r = requests.get(base_url)
    soup = BeautifulSoup(r.text, 'lxml')

    locations = soup.find_all('div',{'class':'border'})

    for stores in locations:
        store = stores.find_all('p')

        name = store[0].text
        address = store[1].text
        city, state_zip = store[2].text.split(',')
        state, zip_code = state_zip.strip().split(' ')
        phone = store[3].text

        temp_df = pd.DataFrame([[base_url,name,address,city,state, zip_code,'US','<MISSING>',
                                phone,'<MISSING>','<MISSING>','<MISSING>','<MISSING>']],
                                columns=["locator_domain", "location_name", "street_address", "city", "state", "zip", "country_code",
                                         "store_number", "phone", "location_type", "latitude", "longitude", "hours_of_operation"])

        df = df.append(temp_df).reset_index(drop=True)
    return df

data = fetch_data()
write_output(data)

退出:

print (df.to_string())
         locator_domain          location_name        street_address           city state    zip country_code store_number             phone location_type   latitude  longitude hours_of_operation
0   http://leevers.com/  Colorado Ranch Market   11505 E. Colfax Ave         Aurora    CO  80010           US    <MISSING>  PH: 720-343-2195     <MISSING>  <MISSING>  <MISSING>          <MISSING>
1   http://leevers.com/             Save-A-Lot    4255 W Florida Ave         Denver    CO  80219           US    <MISSING>  PH: 303-935-0880     <MISSING>  <MISSING>  <MISSING>          <MISSING>
2   http://leevers.com/             Save-A-Lot      15220 E. 6th Ave         Aurora    CO  80011           US    <MISSING>  PH: 720-343-2011     <MISSING>  <MISSING>  <MISSING>          <MISSING>
3   http://leevers.com/             Save-A-Lot      3045 W. 74th Ave    Westminster    CO  80030           US    <MISSING>  PH: 303-339-2610     <MISSING>  <MISSING>  <MISSING>          <MISSING>
4   http://leevers.com/             Save-A-Lot    1110 Bonforte Blvd         Pueblo    CO  81001           US    <MISSING>  PH: 719-544-6057     <MISSING>  <MISSING>  <MISSING>          <MISSING>
5   http://leevers.com/             Save-A-Lot          698 Peria St         Aurora    CO  80011           US    <MISSING>  PH: 303-365-0393     <MISSING>  <MISSING>  <MISSING>          <MISSING>
6   http://leevers.com/             Save-A-Lot         4860 Pecos St         Denver    CO  80221           US    <MISSING>  PH: 720-235-3900     <MISSING>  <MISSING>  <MISSING>          <MISSING>
7   http://leevers.com/             Save-A-Lot      2630 W. 38th Ave         Denver    CO  80211           US    <MISSING>  PH: 303-433-4405     <MISSING>  <MISSING>  <MISSING>          <MISSING>
8   http://leevers.com/             Save-A-Lot       405 S Circle Dr  Colo. Springs    CO  80910           US    <MISSING>  PH: 719-520-5620     <MISSING>  <MISSING>  <MISSING>          <MISSING>
9   http://leevers.com/             Save-A-Lot       1750 N. Main St       Longmont    CO  80501           US    <MISSING>  PH: 720-864-8060     <MISSING>  <MISSING>  <MISSING>          <MISSING>
10  http://leevers.com/             Save-A-Lot       630 W. 84th Ave       Thornton    CO  80260           US    <MISSING>  PH: 303-468-6290     <MISSING>  <MISSING>  <MISSING>          <MISSING>
11  http://leevers.com/             Save-A-Lot  1951 S. Federal Blvd         Denver    CO  80219           US    <MISSING>  PH: 303-407-0430     <MISSING>  <MISSING>  <MISSING>          <MISSING>
12  http://leevers.com/             Save-A-Lot        7290 Manaco St  Commerce City    CO  80022           US    <MISSING>  PH: 303-288-1747     <MISSING>  <MISSING>  <MISSING>          <MISSING>
13  http://leevers.com/             Save-A-Lot    6601 W. Colfax Ave       Lakewood    CO  80214           US    <MISSING>  PH: 303-468-6290     <MISSING>  <MISSING>  <MISSING>          <MISSING>
14  http://leevers.com/             Save-A-Lot           816 25th St        Greeley    CO  80631           US    <MISSING>  PH: 970-356-7498     <MISSING>  <MISSING>  <MISSING>          <MISSING>