Py学习  »  Python

Python Dataframe:基于特定条件删除重复项

est • 4 年前 • 867 次点击  

具有重复的车间ID的数据帧,其中某些车间ID发生两次,而有些发生三次:
我只想根据分配给其区域的最短店铺距离保留唯一的店铺ID。

    Area  Shop Name  Shop Distance  Shop ID   

0   AAA   Ly         86             5d87790c46a77300
1   AAA   Hi         230            5ce5522012138400
2   BBB   Hi         780            5ce5522012138400
3   CCC   Ly         450            5d87790c46a77300
...
91  MMM   Ju         43             4f76d0c0e4b01af7
92  MMM   Hi         1150           5ce5522012138400
...

使用pandas drop_duplicates drop the row duplicates,但条件基于第一个/最后一个出现的店铺ID,这不允许我按距离排序:

shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')

我也试着按商店ID分组然后排序,但是sort返回错误:重复

bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)

到目前为止,我试着一直做到这个阶段:

# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]

# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()

# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index

# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index

# create a mask for the Shops that are under radius 750 
dist_1 = df_toclean['Shop Distance']<=750

# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]

* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results

我想我已经做了很长的一段时间了,但仍然没有找到删除副本的方法,有谁知道如何用更短的方式解决这个问题?我是python新手,谢谢你的帮助!

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/50769
 
867 次点击  
文章 [ 2 ]  |  最新文章 4 年前