私信  •  关注

chitown88

chitown88 最近创建的主题
chitown88 最近回复了
3 年前
回复了 chitown88 创建的主题 » 使用Selenium和Python删除足球网站上的一些数据

你确定你需要用硒吗?你可以很容易地拉那些有熊猫和请求的桌子。

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.soccerstats.com/matches.asp?matchday=1#'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', text='stats')

filtered_links = []
for link in links:
    if 'pmatch' in link['href']:
        filtered_links.append(link['href'])

tables = {}
for count, link in enumerate(filtered_links, start=1):
    try:
        html = requests.get('https://www.soccerstats.com/' + link).text
        soup = BeautifulSoup(html, 'html.parser')
        
        goalsTable = soup.find('h2', text='Goal statistics')
        
        teams = goalsTable.find_next('table')
        teamsStr = teams.find_all('td')[0].text + ' ' + teams.find_all('td')[-1].text
        
        goalsTable = teams.find_next('table')
        df = pd.read_html(str(goalsTable))[0]
        
        print(f'{count} of {len(filtered_links)}: {teamsStr}')
        tables[teamsStr] = df
        
    except:
        print(f'{count} of {len(filtered_links)}: {teamsStr} !! NO GOALS STATISTICS !!')

输出:

enter image description here

3 年前
回复了 chitown88 创建的主题 » Python:基于组求和,并将其显示为附加列

做一个 .grouby() channel ,并得到 units .然后简单地划分 单位 通过 units_per_channel

import pandas as pd


df = pd.DataFrame([['Offline',    'Bournemouth',    62],
['Offline' ,    'Kettering'  ,    90],
['Offline' ,    'Manchester' ,    145],
['Online'  ,    'Bournemouth',    220],
['Online'  ,    'Kettering',      212],
['Online'  ,    'Manchester',     272]],
                  columns=['channel','store','units'],)


df['units_per_channel'] = df.groupby('channel')['units'].transform('sum')
df['store_share'] = df['units'] / df['units_per_channel']

输出:

print(df)
   channel        store  units  units_per_channel  store_share
0  Offline  Bournemouth     62                297     0.208754
1  Offline    Kettering     90                297     0.303030
2  Offline   Manchester    145                297     0.488215
3   Online  Bournemouth    220                704     0.312500
4   Online    Kettering    212                704     0.301136
5   Online   Manchester    272                704     0.386364
6 年前
回复了 chitown88 创建的主题 » 从具有多个表的web页面获取python中的数据

该站点有一个api端点,它将数据以良好的json格式返回给您。您可以获取json格式的响应,然后将其规范化以创建表。现在当它这样做时,它返回两个表,所以我不确定您是否需要第二个表。如果没有,我将它们分开存放,然后将它们附加在一起。

import requests    
from pandas.io.json import json_normalize

url = 'https://api.bseindia.com/BseIndiaAPI/api/MktHighLowData/w?Grpcode=&HLflag=H&indexcode=&scripcode='

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}

payload = {
'Grpcode':'', 
'HLflag': 'H',
'indexcode':'' ,
'scripcode':'' }

jsonObj = requests.get(url, headers=headers, params=payload).json()

df_table = json_normalize(jsonObj['Table'])
df_table1 = json_normalize(jsonObj['Table1'])

df = df_table.append(df_table1)

输出:

print (df)
     ALLTimeHigh         ...                         dt_tm
0        1019.95         ...           2019-02-25T16:00:03
1         263.00         ...           2019-02-25T16:00:03
2          24.00         ...           2019-02-25T16:00:03
3          35.90         ...           2019-02-25T16:00:03
4          29.75         ...           2019-02-25T16:00:03
5          43.00         ...           2019-02-25T16:00:03
6         140.40         ...           2019-02-25T16:00:03
7          15.39         ...           2019-02-25T16:00:03
8         724.00         ...           2019-02-25T16:00:03
9        1495.00         ...           2019-02-25T16:00:03
10        123.15         ...           2019-02-25T16:00:03
11        121.00         ...           2019-02-25T16:00:03
12        238.50         ...           2019-02-25T16:00:03
13         89.00         ...           2019-02-25T16:00:03
14        819.95         ...           2019-02-25T16:00:03
15        112.40         ...           2019-02-25T16:00:03
16         49.95         ...           2019-02-25T16:00:03
17        330.85         ...           2019-02-25T16:00:03
18        167.45         ...           2019-02-25T16:00:03
19         25.10         ...           2019-02-25T16:00:03
20        940.00         ...           2019-02-25T16:00:03
21        165.00         ...           2019-02-25T16:00:03
22           NaN         ...           2019-02-25T16:00:03
23        239.00         ...           2019-02-25T16:00:03
24        151.55         ...           2019-02-25T16:00:03
25         34.35         ...           2019-02-25T16:00:03
26        256.15         ...           2019-02-25T16:00:03
27         49.75         ...           2019-02-25T16:00:03
28        103.25         ...           2019-02-25T16:00:03
29         50.50         ...           2019-02-25T16:00:03
..           ...         ...                           ...
87        135.00         ...           2019-02-25T16:00:03
88        219.80         ...           2019-02-25T16:00:03
89         58.00         ...           2019-02-25T16:00:03
90        494.00         ...           2019-02-25T16:00:03
91        285.30         ...           2019-02-25T16:00:03
92         55.65         ...           2019-02-25T16:00:03
93          4.45         ...           2019-02-25T16:00:03
94         50.00         ...           2019-02-25T16:00:03
95         50.00         ...           2019-02-25T16:00:03
96         92.50         ...           2019-02-25T16:00:03
97        154.80         ...           2019-02-25T16:00:03
98         82.40         ...           2019-02-25T16:00:03
99        293.85         ...           2019-02-25T16:00:03
100       396.00         ...           2019-02-25T16:00:03
101        98.00         ...           2019-02-25T16:00:03
102       144.60         ...           2019-02-25T16:00:03
103        11.50         ...           2019-02-25T16:00:03
104        42.95         ...           2019-02-25T16:00:03
105       313.00         ...           2019-02-25T16:00:03
106      1120.00         ...           2019-02-25T16:00:03
107        87.00         ...           2019-02-25T16:00:03
108        82.00         ...           2019-02-25T16:00:03
109       214.00         ...           2019-02-25T16:00:03
110       505.00         ...           2019-02-25T16:00:03
111      1525.00         ...           2019-02-25T16:00:03
112       220.00         ...           2019-02-25T16:00:03
113        36.00         ...           2019-02-25T16:00:03
114       170.00         ...           2019-02-25T16:00:03
115       549.50         ...           2019-02-25T16:00:03
116      4990.00         ...           2019-02-25T16:00:03

[168 rows x 19 columns]
6 年前
回复了 chitown88 创建的主题 » 在python中读取Json

有几件事:

1-那个 emotionsAll 钥匙在 objects 键,列表中的第一个元素 [0] , attributes 钥匙

2-您的json文件是以bytes前缀写入的,因此在读取时,它以 b' . 您可以a)通过解码/编码使该文件在不使用该模式的情况下写入,或者仅操作该字符串。

import json

data = repr(open('file.json', 'rb').read())

data = data.split('{', 1)[-1]
data = data.rsplit('}', 1)[0]

data = ''.join(['{', data, '}'])
jsonObj = json.loads(data)

print(jsonObj['objects'][0]['attributes']['emotionsAll']['neutral']) 

输出:

print(jsonObj['objects'][0]['attributes']['emotionsAll']['neutral']) 
0.0
6 年前
回复了 chitown88 创建的主题 » 如何在Python中构造具有多个特性的元素

在我看来,熊猫是一种很好的方式。但你当然可以用字典:

elements = ['A', 'B', 'C', 'D']
colors = ['red','red', 'blue', 'red']
shapes = ['square', 'circle', 'circle', 'triangle']


dict1 = { element: {'color':colors[index], 'shape':shapes[index]} for index,element in enumerate(elements)}


def find_keys(keyword):
    result = []
    for key, val in dict1.items():
        for k, v in val.items():
            if v == keyword:
                result.append(key)
    return result

print (find_keys('red'))

输出:

 print (find_keys('red'))
['A', 'B', 'D']

print (find_keys('circle')) 
['B', 'C']
6 年前
回复了 chitown88 创建的主题 » 使用python selenium从html标记提取属性[duplicate]

使用 .attrs

import bs4

html = '''<input  
class="text header_login_text_box ignore_interaction" 
type="text" 
name="email" tabindex="1"
data-group="js-editable"
placeholder="Email"
w2cid="wZgD2YHa18" 
id="__w2_wZgD2YHa18_email">'''

soup = bs4.BeautifulSoup(html, 'html.parser')


for tag in soup:
    attr_dict = (tag.attrs)

输出: print (attr_dict)

{'class': ['text', 'header_login_text_box', 'ignore_interaction'], 
'type': 'text', 
'name': 'email', 
'tabindex': '1', 
'data-group': 'js-editable', 
'placeholder': 'Email', 
'w2cid': 'wZgD2YHa18', 
'id': '__w2_wZgD2YHa18_email'}

这将打开浏览器,然后单击下拉菜单。您可以单击所需的选项继续:

from selenium import webdriver 

driver = webdriver.Chrome()
url = 'http://www.mpcci.com/members_list.php'
driver.get(url) 

driver.find_element_by_xpath('//*[@id="select_gp_id"]').click()
6 年前
回复了 chitown88 创建的主题 » 需要了解如何使用python刮取实时流数据的帮助

你的代码不完整。具体来说,1)您实际上没有使用beautifulsoup来做任何事情,2)您的函数没有返回任何东西,这就是为什么它会打印“none”

import pandas as pd
import bs4 
from requests_html import HTMLSession 
import time

def get_count():

    url = 'http://10.0.0.206/apps/cy8ckit_062_demo/main.html'

    session = HTMLSession()
    r = session.get(url)
    r.html.render(sleep=5,timeout=8)

    soup = bs4.BeautifulSoup(r.text,'html.parser')

    data = soup.findAll('div', {'id':'currentData'})[0]
    temp_data = data.findAll('p')
    current_time = temp_data[0].text
    current_date = temp_data[1].text
    current_usage = temp_data[2].text

    print ('%s\n%s\n%s' %(current_time, current_date, current_usage))



while True:
    get_count()
    time.sleep(8)
5 年前
回复了 chitown88 创建的主题 » python列表只打印到csv中的一行

按原样的方式不会逐行写入,因为您基本上是按列创建列表的。有一种方法可以用 zip 我相信,但您也可以将每一行写入数据帧,然后将数据帧写入熊猫的文件:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def write_output(data):
    data.to_csv('data.csv', index=False)    



def fetch_data():
    df = pd.DataFrame()
    base_url = 'http://leevers.com/'
    r = requests.get(base_url)
    soup = BeautifulSoup(r.text, 'lxml')

    locations = soup.find_all('div',{'class':'border'})

    for stores in locations:
        store = stores.find_all('p')

        name = store[0].text
        address = store[1].text
        city, state_zip = store[2].text.split(',')
        state, zip_code = state_zip.strip().split(' ')
        phone = store[3].text

        temp_df = pd.DataFrame([[base_url,name,address,city,state, zip_code,'US','<MISSING>',
                                phone,'<MISSING>','<MISSING>','<MISSING>','<MISSING>']],
                                columns=["locator_domain", "location_name", "street_address", "city", "state", "zip", "country_code",
                                         "store_number", "phone", "location_type", "latitude", "longitude", "hours_of_operation"])

        df = df.append(temp_df).reset_index(drop=True)
    return df

data = fetch_data()
write_output(data)

退出:

print (df.to_string())
         locator_domain          location_name        street_address           city state    zip country_code store_number             phone location_type   latitude  longitude hours_of_operation
0   http://leevers.com/  Colorado Ranch Market   11505 E. Colfax Ave         Aurora    CO  80010           US    <MISSING>  PH: 720-343-2195     <MISSING>  <MISSING>  <MISSING>          <MISSING>
1   http://leevers.com/             Save-A-Lot    4255 W Florida Ave         Denver    CO  80219           US    <MISSING>  PH: 303-935-0880     <MISSING>  <MISSING>  <MISSING>          <MISSING>
2   http://leevers.com/             Save-A-Lot      15220 E. 6th Ave         Aurora    CO  80011           US    <MISSING>  PH: 720-343-2011     <MISSING>  <MISSING>  <MISSING>          <MISSING>
3   http://leevers.com/             Save-A-Lot      3045 W. 74th Ave    Westminster    CO  80030           US    <MISSING>  PH: 303-339-2610     <MISSING>  <MISSING>  <MISSING>          <MISSING>
4   http://leevers.com/             Save-A-Lot    1110 Bonforte Blvd         Pueblo    CO  81001           US    <MISSING>  PH: 719-544-6057     <MISSING>  <MISSING>  <MISSING>          <MISSING>
5   http://leevers.com/             Save-A-Lot          698 Peria St         Aurora    CO  80011           US    <MISSING>  PH: 303-365-0393     <MISSING>  <MISSING>  <MISSING>          <MISSING>
6   http://leevers.com/             Save-A-Lot         4860 Pecos St         Denver    CO  80221           US    <MISSING>  PH: 720-235-3900     <MISSING>  <MISSING>  <MISSING>          <MISSING>
7   http://leevers.com/             Save-A-Lot      2630 W. 38th Ave         Denver    CO  80211           US    <MISSING>  PH: 303-433-4405     <MISSING>  <MISSING>  <MISSING>          <MISSING>
8   http://leevers.com/             Save-A-Lot       405 S Circle Dr  Colo. Springs    CO  80910           US    <MISSING>  PH: 719-520-5620     <MISSING>  <MISSING>  <MISSING>          <MISSING>
9   http://leevers.com/             Save-A-Lot       1750 N. Main St       Longmont    CO  80501           US    <MISSING>  PH: 720-864-8060     <MISSING>  <MISSING>  <MISSING>          <MISSING>
10  http://leevers.com/             Save-A-Lot       630 W. 84th Ave       Thornton    CO  80260           US    <MISSING>  PH: 303-468-6290     <MISSING>  <MISSING>  <MISSING>          <MISSING>
11  http://leevers.com/             Save-A-Lot  1951 S. Federal Blvd         Denver    CO  80219           US    <MISSING>  PH: 303-407-0430     <MISSING>  <MISSING>  <MISSING>          <MISSING>
12  http://leevers.com/             Save-A-Lot        7290 Manaco St  Commerce City    CO  80022           US    <MISSING>  PH: 303-288-1747     <MISSING>  <MISSING>  <MISSING>          <MISSING>
13  http://leevers.com/             Save-A-Lot    6601 W. Colfax Ave       Lakewood    CO  80214           US    <MISSING>  PH: 303-468-6290     <MISSING>  <MISSING>  <MISSING>          <MISSING>
14  http://leevers.com/             Save-A-Lot           816 25th St        Greeley    CO  80631           US    <MISSING>  PH: 970-356-7498     <MISSING>  <MISSING>  <MISSING>          <MISSING>