如何从给定的数据中找到Python中的公共数据对

Sam • 6 年前 • 1937 次点击

我有这样的数据

Start Time         End Time       Trip Duration    Start Station   End Station 
01/01/17 15:09    01/01/17 15:14     321           A               B
01/02/17 15:09    01/02/17 15:14     321           C               D
12/03/17 15:09    12/03/17 15:14     321           E               F
05/01/17 15:09    05/01/17 15:14     321           B               D
17/02/17 15:09    17/02/17 15:14     321           A               B
12/04/17 15:09    12/04/17 15:14     321           E               H
13/05/17 15:09    13/05/17 15:14     321           S               K
17/01/17 15:09    17/01/17 15:14     321           A               B

使用以下代码,我可以找到最常见的起点站

start_station = filtered['Start Station'].mode()[0]

我需要找到最常见的旅行,即一对起点站和终点站是相同的。根据上述数据,最常见的跳闸应为B/W A和B。

有人能告诉我怎么找一个普通的旅行吗

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/38384

文章 [ 4 ] | 最新文章 6 年前

• 1 楼

Humpe 7 年前

看看这个 Groupby Split apply combine

这将为您提供广泛的聚合函数。

使用GroupBy:

import pandas as pd

counts = df.groupby(["Start_Station","End_Station"]).count()

print(counts)

                           Start_Time  End_Time  Trip_Duration  trip_id
Start_Station End_Station                                              
A             B                     3         3              3        3
B             D                     1         1              1        1
C             D                     1         1              1        1
E             F                     1         1              1        1
              H                     1         1              1        1
S             K                     1         1              1        1

使用值计数和虚拟列:

import pandas as pd

df["trip_id"] = df.Start_Station + df.End_Station

counts = df["trip_id"].value_counts()

print(counts)

AB    3
BD    1
EH    1
SK    1
EF    1
CD    1

• 2 楼

Edgar Ramírez Mondragón 7 年前

我可以这么做

trip = (filtered["Start Station"] + " -> " + filtered["End Station"]).mode()
# A -> B

• 3 楼

Rudolf Morkovskyi 7 年前

你需要数数?然后试试这个:

df = pd.DataFrame({'Start':['A','B','C','D','A'],'End':['B']*5,'Trip Duration':[321]*5})
df.groupby(['Start','End'])['Trip Duration'].count().sort_values(ascending=False, na_position='first')

• 4 楼

jezrael 7 年前

使用 GroupBy.size 具有 nlargest 或 sort_values 具有 iloc 选择最后一个值。

功能 remove_unused_levels 用于按删除的值删除多索引值 Series .

a = (df.groupby(['Start Station','End Station'])
       .size()
       .nlargest(1)
       .index.remove_unused_levels()
       .tolist()
     )

或:

a = (df.groupby(['Start Station','End Station'])
       .size()
       .sort_values()
       .iloc[[-1]]
       .index.remove_unused_levels()
       .tolist()
       )

print(a)
[('A', 'B')]

如果需要输出 DataFrame :

df1 = (df.groupby(['Start Station','End Station'])
       .size()
       .reset_index(name='count')
       .nlargest(1, 'count')[['Start Station','End Station']]
)
print (df1)
  Start Station End Station
0             A           B

登录后回复