私信  •  关注

jezrael

jezrael 最近创建的主题
jezrael 最近回复了
3 年前
回复了 jezrael 创建的主题 » Python:无法成功删除仅包含空格值的行

将“否”、“一个或多个空格”替换为缺少的值,然后删除行:

df = df.replace(r'^\s*$', np.nan, regex=True).dropna(subset=["NUMBER","STREET","POSTCODE"])

或者,如果需要删除某些列中的值,而不更改原始数据帧:

cols = ["NUMBER","STREET","POSTCODE"]
df = df[df[cols].replace(r'^\s*$', np.nan, regex=True).notna().all(axis=1)]
3 年前
回复了 jezrael 创建的主题 » Python Dataframe将索引映射到一列列表以提取元素

对于tet非缺失值,请使用 notna 将索引转换为整数:

df['A_element'] = [a[int(i)] if pd.notna(i) else np.nan
                               for a, i in zip(df['A'], df['match_idx'])]

或者:

df['A_element'] = df.apply(lambda x: x['A'][int(x['match_idx'])] 
                           if pd.notna(x['match_idx']) else np.nan,axis=1)

print (df)
             A  match_idx  A_element
0       [7, 8]        1.0        8.0
1  [4, 5, nan]        0.0        4.0
2  [nan, 1, 9]        NaN        NaN
3 年前
回复了 jezrael 创建的主题 » 用最大值和最小值对数据帧中的循环进行Python优化

使用 numpy.select 相反 max 使用 numpy.maximum :

m1 = df['Forecast'] == 0
m2 = df['def'] == 1
m3 = df['def'] == 0

s1 = df['Qty'].clip(lower=0)
s3 = round(np.maximum(df['Qty'] - (np.maximum(df['Forecast_total']*14,(df['Qty_12m_1']+df['Qty_12m_2'])) * np.maximum(1, (df['Total']/df['Forecast'])/54)),0 ))

df['Calc2'] = np.select([m1, m2, m3], [s1, 0, s3], default=None)
3 年前
回复了 jezrael 创建的主题 » 使用Python计算径流数据的月平均值

使用从生成的月份周期 DatetimeIndex 为了避免丢失有关 months :

n = df.groupby(df.index.to_period('m'))[['Flow_1', 'Flow_2']].mean()

或者:

n = df.groupby(pd.Grouper(freq='m'))[['Flow_1', 'Flow_2']].mean()

您的解决方案应该更改:

n = df.groupby(['Year', 'Month'])[['Flow_1', 'Flow_2']].mean()
4 年前
回复了 jezrael 创建的主题 » Python更改列值-一次更改每一行

使用 DataFrame.mask 相比之下 DataFrame.lt DataFrame.gt 被束缚 | 按位计算 OR :

df = df.mask(df.lt(lower_df['l'], axis=0) | df.gt(upper_df['u'], axis=0), 0)
print (df)
             A   B   C   D
06/09/2019  56   0  33   0
06/10/2019  54  66  47  23
06/11/2019   0   0  43  19
06/12/2019  21  38   0   0

使用 crosstab 限制计数到 1 通过 DataFrame.clip :

df1  = (pd.crosstab(df['ID'], df['value'])
          .clip(upper=1)
          .reset_index()
          .rename_axis(None, axis=1))
print (df1)
   ID  A  B  C
0   1  1  1  1
1   2  0  1  0
2   4  1  0  1
3  10  0  0  1
3 年前
回复了 jezrael 创建的主题 » 在数据帧(python)中按时间戳对数据序列进行分组

使用列表理解创建 Series ,通过 concat 最后一个问题是转换为 datetime 如有必要:

print (df)
                                 redacted_name_1  \
0     [1628377576.0,1628377939.98,1628377942.04]   
1  [295.257080078125,295.1255187988281,295.2570]   

                                 redacted_name_2  
0  [1628377494.927,1628377855.377,1628377957.39]  
1                      [9.3e-09,9.3e-09,9.2e-09] 

#if necessary convert strings to lists
#import ast
#df = df.applymap(ast.literal_eval)

L = [pd.Series(df.loc[1, x], index=df.loc[0, x], name=x) for x in df.columns]
df = pd.concat(L,axis=1)

df.index = np.round(df.index).astype(np.int64)
print (df)
            redacted_name_1  redacted_name_2
1628377495              NaN     9.300000e-09
1628377576       295.257080              NaN
1628377855              NaN     9.300000e-09
1628377940       295.125519              NaN
1628377942       295.257000              NaN
1628377957              NaN     9.200000e-09

或者:

df.index = pd.to_datetime(np.round(df.index).astype(np.int64), unit='s')
print (df)
                     redacted_name_1  redacted_name_2
2021-08-07 23:04:55              NaN     9.300000e-09
2021-08-07 23:06:16       295.257080              NaN
2021-08-07 23:10:55              NaN     9.300000e-09
2021-08-07 23:12:20       295.125519              NaN
2021-08-07 23:12:22       295.257000              NaN
2021-08-07 23:12:37              NaN     9.200000e-09

一种方法是将日期时间转换为字符串替代项,然后将缺少的值替换为原始值:

df['col1'] = (pd.to_datetime(df['col1'], errors='coerce').dt.strftime('%Y-%m-%d')
                .fillna(df['col1']))

或将日期时间转换为python日期:

df['col1'] = pd.to_datetime(df['col1'], errors='coerce').dt.date.fillna(df['col1'])

print (df)
         col1
0  2018-01-30
1         Yes
2            
3         NaN
4        None
5 年前
回复了 jezrael 创建的主题 » Python日志记录不记录pd.info()

更改缓冲区参数 DataFrame.info StringIO .getvalue() :

from io import StringIO
buf = StringIO()
df.info(buf=buf)

logger.info(type(df))    
logger.info(buf.getvalue())
5 年前
回复了 jezrael 创建的主题 » Python错误时间数据与格式不匹配

使用 %Y 为了比赛 YYYY %y 它匹配 YY

df.Timestamp = pd.to_datetime(df.Datetime,format='%d-%m-%Y %H:%M') 

然后创建 DatetimeIndex :

df = df.set_index('Timestamp').resample('W').mean()
5 年前
回复了 jezrael 创建的主题 » 在python pandas中,将group by之后的列的多个值合并为一列

使用 GroupBy.agg 使用lambda函数是最普遍的解决方案:

df1 = df.groupby(['ID1','ID2'])['Status'].agg(lambda x: ','.join(x.dropna())).reset_index()
print (df1)
   ID1 ID2 Status
0    1   a      1
1    2   b      1
2    3   c    2,1

另一个想法是在第一步中删除重复项,但是如果某个组只有错误的值,那么它将从输出中删除,因此有必要进行如下处理 merge

#first group with only NaNs
df= pd.DataFrame({'ID1' : [1,1,2,2,3,3,3],'ID2' : ['a','a','b','b','c','c','c'],
             'Status' : pd.Series([np.nan,np.nan, np.nan,'1','2',np.nan,'1'], 
                                  dtype="category")})


#first group is dropped
df11 = (df.dropna(subset=['Status'])
          .groupby(['ID1','ID2'])['Status']
          .agg(','.join)
          .reset_index())
print (df11)
   ID1 ID2 Status
0    2   b      1
1    3   c    2,1

#added missing pairs and `NaN`s converted to empty strings:
df2 = df.drop_duplicates(['ID1','ID2'])[['ID1','ID2']].merge(df11, how='left').fillna('')
print (df2)
   ID1 ID2 Status
0    1   a       
1    2   b      1
2    3   c    2,1

第一种解决方案:

df1 = df.groupby(['ID1','ID2'])['Status'].agg(lambda x: ','.join(x.dropna())).reset_index()
print (df1)
   ID1 ID2 Status
0    1   a       
1    2   b      1
2    3   c    2,1
5 年前
回复了 jezrael 创建的主题 » 如何使用python组合df的行?

创意是创造助手 Series 用于分组。

NaN s由 DataFrame.isna 具有 DataFrame.any 系列 按构造函数,将不匹配的值替换为 s由 Series.where 并用 limit=1 只替换上面的一行。

最后将所有缺少的值替换为空值、分组和聚合 join :

m = df.isna().any(axis=1)
s = pd.Series(np.arange(len(m)), index=df.index)
g = s.where(m).bfill(limit=1).fillna(s)

df = df.fillna('').groupby(g).agg(''.join).reset_index(drop=True)
print (df)
    0  1  2  3   4
0  dd  t  u  y  eo
1   y  p  p  w   r

使用:

#first sorting data if necessary
df1 = df.sort_values('count', ascending=False)

#then get top 4 rows
df2 = df1.head(4)
#filter column `count` for all values after 4 rows
summed = df1.loc[df1.index[4:], 'count'].sum()

#create DataFrame by another counts
df3 = pd.DataFrame({'useragent':['Other'], 'count':[summed]})

#join together
df4 = pd.concat([df2, df3], sort=False, ignore_index=True)
print (df4)
  useragent  count
0    iPhone  11298
1       Mac   3206
2      iPad    627
3  SM-N960F    433
4     Other    435

#filter by threshold
mask = df['count'] > 500
#filtered rows by boolean indexing
df2 = df[mask]
#inverted mask - sum by count
summed = df.loc[~mask, 'count'].sum()
#same like above
df3 = pd.DataFrame({'useragent':['Other'], 'count':[summed]})

df5 = pd.concat([df2, df3], sort=False, ignore_index=True)
print (df5)
  useragent  count
0    iPhone  11298
1       Mac   3206
2      iPad    627
3     Other    868
5 年前
回复了 jezrael 创建的主题 » 我想更改DataFrame(Python)中的字符串值

使用 Series.str.strip -如果 "

df['id'] = df['id'].str.strip('"')

或者必要时移除 '' :

df['id'] = df['id'].str.strip("'")
6 年前
回复了 jezrael 创建的主题 » 如何使用Python和Pandas添加24小时时间滚动

我建议和 Timedelta 学生:

df = pd.read_csv("stackoverflow.txt", header=None)

to_timedelta ,然后得到差异,与 Timedelta(0) pd.Timedelta(24, 'h') .

td = pd.to_timedelta(df[0])
df['new'] = td.mask(td.diff().lt(pd.Timedelta(0)).cumsum().gt(0), td + pd.Timedelta(1, 'days'))
df['newint'] = (df['new'].dt.total_seconds() * 1000).astype(int)

print (df)
               0                    new    newint
0   23:58:03.458 0 days 23:58:03.458000  86283458
1   23:58:13.446 0 days 23:58:13.446000  86293446
2   23:58:23.447 0 days 23:58:23.447000  86303447
3   23:58:33.440 0 days 23:58:33.440000  86313440
4   23:58:43.440 0 days 23:58:43.440000  86323440
5   23:58:53.440 0 days 23:58:53.440000  86333440
6   23:59:03.434 0 days 23:59:03.434000  86343434
7   23:59:13.435 0 days 23:59:13.435000  86353435
8   23:59:23.428 0 days 23:59:23.428000  86363428
9   23:59:33.428 0 days 23:59:33.428000  86373428
10  23:59:43.429 0 days 23:59:43.429000  86383429
11  23:59:53.435 0 days 23:59:53.435000  86393435
12  00:00:03.429 1 days 00:00:03.429000  86403429
13  00:00:13.423 1 days 00:00:13.423000  86413423
14  00:00:23.417 1 days 00:00:23.417000  86423417
15  00:00:33.411 1 days 00:00:33.411000  86433411
16  00:00:43.418 1 days 00:00:43.418000  86443418
17  00:00:53.411 1 days 00:00:53.411000  86453411
18  00:01:03.405 1 days 00:01:03.405000  86463405
19  00:01:13.406 1 days 00:01:13.406000  86473406
20  00:01:23.400 1 days 00:01:23.400000  86483400
21  00:01:33.406 1 days 00:01:33.406000  86493406
22  00:01:43.400 1 days 00:01:43.400000  86503400
23  00:01:53.411 1 days 00:01:53.411000  86513411
24  00:02:03.400 1 days 00:02:03.400000  86523400
25  00:02:13.406 1 days 00:02:13.406000  86533406
26  00:02:23.394 1 days 00:02:23.394000  86543394
27  00:02:33.400 1 days 00:02:33.400000  86553400
28  00:02:43.394 1 days 00:02:43.394000  86563394

解决方案是多天的数据-所以对于第一个更改添加1天,对于下一个2天。。。

创建差异,添加累计和,并将输出转换为日时间增量,添加到原始数据中的内容:

print (df)
               0
0   23:59:23.428
1   23:59:33.428
2   23:59:43.429
3   23:59:53.435
4   00:00:03.429
5   00:00:13.423
6   00:00:23.417
7   00:00:33.411
8   23:59:23.428
9   23:59:33.428
10  23:59:43.429
11  23:59:53.435
12  00:00:03.429
13  00:00:13.423
14  00:00:23.417
15  00:00:33.411

td = pd.to_timedelta(df[0])
days = pd.to_timedelta(td.diff().lt(pd.Timedelta(0)).cumsum(), unit='d')

df['new'] = td + days
df['newint'] = (df['new'].dt.total_seconds() * 1000).astype(int)
print (df)
               0                    new     newint
0   23:59:23.428 0 days 23:59:23.428000   86363428
1   23:59:33.428 0 days 23:59:33.428000   86373428
2   23:59:43.429 0 days 23:59:43.429000   86383429
3   23:59:53.435 0 days 23:59:53.435000   86393435
4   00:00:03.429 1 days 00:00:03.429000   86403429
5   00:00:13.423 1 days 00:00:13.423000   86413423
6   00:00:23.417 1 days 00:00:23.417000   86423417
7   00:00:33.411 1 days 00:00:33.411000   86433411
8   23:59:23.428 1 days 23:59:23.428000  172763428
9   23:59:33.428 1 days 23:59:33.428000  172773428
10  23:59:43.429 1 days 23:59:43.429000  172783429
11  23:59:53.435 1 days 23:59:53.435000  172793435
12  00:00:03.429 2 days 00:00:03.429000  172803429
13  00:00:13.423 2 days 00:00:13.423000  172813423
14  00:00:23.417 2 days 00:00:23.417000  172823417
15  00:00:33.411 2 days 00:00:33.411000  172833411

编辑:

先得到差异 diff :

print (td.diff())
0                         NaT
1                    00:00:10
2             00:00:10.001000
3             00:00:10.006000
4    -1 days +00:00:09.994000
5             00:00:09.994000
6             00:00:09.994000
7             00:00:09.994000
8             23:58:50.017000
9                    00:00:10
10            00:00:10.001000
11            00:00:10.006000
12   -1 days +00:00:09.994000
13            00:00:09.994000
14            00:00:09.994000
15            00:00:09.994000
Name: 0, dtype: timedelta64[ns]

然后比较 lt ( < )对于负时间增量:

print (td.diff().lt(pd.Timedelta(0)))
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13    False
14    False
15    False
Name: 0, dtype: bool

cumsum :

print (td.diff().lt(pd.Timedelta(0)).cumsum())
0     0
1     0
2     0
3     0
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    2
13    2
14    2
15    2
Name: 0, dtype: int32

最后转换为天时间增量:

days = pd.to_timedelta(td.diff().lt(pd.Timedelta(0)).cumsum(), unit='d')

print (days)
0    0 days
1    0 days
2    0 days
3    0 days
4    1 days
5    1 days
6    1 days
7    1 days
8    1 days
9    1 days
10   1 days
11   1 days
12   2 days
13   2 days
14   2 days
15   2 days
Name: 0, dtype: timedelta64[ns]

在您的解决方案中可以使用相同的ide:

...
df['Total Time(ms)'] = df['Hours']*3600000 + df['Minutes']*60000 + 
                       df['Seconds']*1000 + df['Milliseconds']

s = df['Total Time(ms)'].diff().lt(0).cumsum() * 24 * 60 * 60 * 1000
df['newint'] = s + df['Total Time(ms)']

print (df)
               0  Hours  Minutes  Seconds  Milliseconds  Total Time(ms)  \
0   23:59:23.428     23       59       23           428        86363428   
1   23:59:33.428     23       59       33           428        86373428   
2   23:59:43.429     23       59       43           429        86383429   
3   23:59:53.435     23       59       53           435        86393435   
4   00:00:03.429      0        0        3           429            3429   
5   00:00:13.423      0        0       13           423           13423   
6   00:00:23.417      0        0       23           417           23417   
7   00:00:33.411      0        0       33           411           33411   
8   23:59:23.428     23       59       23           428        86363428   
9   23:59:33.428     23       59       33           428        86373428   
10  23:59:43.429     23       59       43           429        86383429   
11  23:59:53.435     23       59       53           435        86393435   
12  00:00:03.429      0        0        3           429            3429   
13  00:00:13.423      0        0       13           423           13423   
14  00:00:23.417      0        0       23           417           23417   
15  00:00:33.411      0        0       33           411           33411   

       newint  
0    86363428  
1    86373428  
2    86383429  
3    86393435  
4    86403429  
5    86413423  
6    86423417  
7    86433411  
8   172763428  
9   172773428  
10  172783429  
11  172793435  
12  172803429  
13  172813423  
14  172823417  
15  172833411 
5 年前
回复了 jezrael 创建的主题 » 用Python中最长列的值填充NaN

Idea是通过 DataFrame.apply 具有 Series.str.len DataFrame.where ,按位置填充缺少的值和最后获取第一列:

df1 = df.apply(lambda x: x.str.len())

df['v5'] = df.where(df1.eq(df1.max(axis=1), axis=0)).bfill(axis=1).iloc[:, 0]
print (df)
     v1   v2   v3    v4     v5
0     a   ab  abc  abcd  abcde
1  abcd  abc   ab   NaN   abcd
2     a  abc   ac   cde    abc
3   cde    c  NaN    cd    cde
5 年前
回复了 jezrael 创建的主题 » Python:panda dataframe在每行中拆分字符串,偶尔有空行

因为 | 是否需要特殊的正则表达式字符 \ 在里面 Series.str.split ,然后通过 Series.str.len ,替换缺少的值并转换为整数:

df["num_attr"] = df["attr"].str.split("\|\|").str.len().fillna(0).astype(int)
print (df)
               attr  num_attr
0              valA         1
1               NaN         0
2  valA||valB||valC         3
3         vaB||valC         2

另一个类似的解决方案 Series.str.count 加上 1 :

df["num_attr"] = df["attr"].str.count("\|\|").add(1).fillna(0).astype(int)

如果想使用你的解决方案添加 if-else 声明 pandas.notna :

df["num_attr"] = df["attr"].map(lambda a: len(a.split("||")) if pd.notna(a) else 0)

或:

df["num_attr"] = df["attr"].map(lambda a: a.count("||") + 1 if pd.notna(a) else 0)
5 年前
回复了 jezrael 创建的主题 » 在Pandas Python中基于列文本创建列

使用 numpy.select 具有 Series.str.startswith :

df['Company'] = np.select([df.Name.str.startswith('A'), 
                           df.Name.str.startswith('B')], 
                           ['Alpha', 'Bravo'], 
                           default='Other')
print (df)
    Name Company
0  A1001   Alpha
1  A1002   Alpha
2  B1001   Bravo
3  C1001   Other
4  A1003   Alpha
5  B1002   Bravo
6  B1003   Bravo
7  C1002   Other
8  D1001   Other
9  D1002   Other
6 年前
回复了 jezrael 创建的主题 » 用python将数据帧转换成列表

使用 dropna 第一个指定列 outlier 核对 NaN 学生:

L = df.dropna(subset=['outlier']).values.tolist()
print (L)
[['12:28:31', 3, 3, 3.0], ['14:28:31', 6, 7, 7.0], ['14:28:31', 4, 9, 9.0]]

使用 Index.difference 对于未在列表中指定的所有列 sum 和列表中的列 mean 具有 Series.append 用于连接:

cols = ["Repeat Rate", "NewCust Rate"]
diff_cols = Table2.columns.difference(cols)
Table2.loc['Total'] = Table2[diff_cols].sum().append(Table2[cols].mean())

使用 DataFrame.to_json 带参数 orient='records' lines=True :

df.to_json(file, orient='records', lines=True)

使用 applymap 具有 get 通过DICT DataFrame 用于颜色并传递给 Styler.apply :

df1 = pd.DataFrame({
         'B':[4,5,4],
         'C':[7,8,9],
         'D':[1,3,5],


})

df2 = pd.DataFrame({
         'B':[1,np.nan,4],
         'C':[np.nan,2,np.nan],
         'D':[1,3,np.nan],

})

def apply_color(x):
    colors = {1: 'green',2: 'blue', 3: 'yellow', 4: 'orange', 5: 'grey'}
    return df2.applymap(lambda val: 'background-color: {}'.format(colors.get(val,'')))

df1.style.apply(apply_color, axis=None)

pic

7 年前
回复了 jezrael 创建的主题 » 如何使用python[duplicate]及时执行比较操作

你可以转换 timedelta 秒到秒 total_seconds 与标量相比:

df = df[df['col'].dt.total_seconds() < 30]

或与 Timedelta :

df = df[df['col'] < pd.Timedelta(30, unit='s')]

样品 :

df = pd.DataFrame({'col':pd.to_timedelta(['25:10:01','00:01:20','00:00:20'])})
print (df)
              col
0 1 days 01:10:01
1 0 days 00:01:20
2 0 days 00:00:20

df = df[df['col'].dt.total_seconds() < 30]
print (df)
       col
2 00:00:20
6 年前
回复了 jezrael 创建的主题 » python:如何用中值逐列替换丢失的值

用途:

df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan], 
                   'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
                   'E':list('abcde')}) 
print (df)
   A     B  C      D  E
0  1  1.45  4   4.55  a
1  2  2.33  5   7.36  b
2  3   NaN  6    NaN  c
3  5   NaN  8   9.00  d
4  7   NaN  7  10.00  e

def treat_mis_value_nu(df):
    #get only numeric columns to dataframe
    df_nu = df.select_dtypes(include=['number'])
    #get only columns with NaNs
    df_nu = df_nu.loc[:, df_nu.isnull().any()]
    #get columns for remove with mean instead sum/len, it is same
    cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
    #replace missing values of original columns and remove above thresh    
    return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)

print (treat_mis_value_nu(df))
   A  C      D  E
0  1  4   4.55  a
1  2  5   7.36  b
2  3  6   8.18  c
3  5  8   9.00  d
4  7  7  10.00  e
6 年前
回复了 jezrael 创建的主题 » python:groupby,根据其他列的最大值创建新列[duplicate]

通过比较创建布尔掩码 eq ( == )把它转换成 integer S - 0, 1 False, True :

s = dframe.groupby("Loc_Id")["Revenue"].transform('max')
dframe["max_value"]= s.eq(dframe["Revenue"]).astype(int)
print (dframe)
   Loc_Id Store  Revenue  max_value
0       1     A       50          0
1       2     B       70          0
2       2     C       45          0
3       1     B       35          0
4       2     D       80          1
5       1     B       70          1
6       3     A       90          1
7       3     C       65          0
6 年前
回复了 jezrael 创建的主题 » python-numpy.where和字典

我相信需要改变:

df.replace({'Hr':HourMap})

map 如果某些值不匹配并返回 NaN 将其替换为原始值 fillna :

df['Hr'].map(HourMap).fillna(df['Hr'])
#alternative solution if performance is not important in large df
#df['Hr'].replace(HourMap)

因为 df.replace 返回包含替换列的dataframe的所有列 Hr

6 年前
回复了 jezrael 创建的主题 » python中的下采样

如果数据文件来自 link ,问题是某些缺少的值是 ? .

所以必要的参数 na_values='?' .

dataset = pd.read_csv('household_power_consumption.txt', 
                      sep=';', 
                      header=0, 
                      low_memory=False, 
                      infer_datetime_format=True, 
                      parse_dates={'datetime': [0,1]},  #Date and time has been combined
                      index_col=['datetime'],
                      na_values='?') 
print(dataset.head())
                     Global_active_power  Global_reactive_power  Voltage  \
datetime                                                                   
2006-12-16 17:24:00                4.216                  0.418   234.84   
2006-12-16 17:25:00                5.360                  0.436   233.63   
2006-12-16 17:26:00                5.374                  0.498   233.29   
2006-12-16 17:27:00                5.388                  0.502   233.74   
2006-12-16 17:28:00                3.666                  0.528   235.68   

                     Global_intensity  Sub_metering_1  Sub_metering_2  \
datetime                                                                
2006-12-16 17:24:00              18.4             0.0             1.0   
2006-12-16 17:25:00              23.0             0.0             1.0   
2006-12-16 17:26:00              23.0             0.0             2.0   
2006-12-16 17:27:00              23.0             0.0             1.0   
2006-12-16 17:28:00              15.8             0.0             1.0   

                     Sub_metering_3  
datetime                             
2006-12-16 17:24:00            17.0  
2006-12-16 17:25:00            16.0  
2006-12-16 17:26:00            17.0  
2006-12-16 17:27:00            17.0  
2006-12-16 17:28:00            17.0  

print (dataset.info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
Global_active_power      float64
Global_reactive_power    float64
Voltage                  float64
Global_intensity         float64
Sub_metering_1           float64
Sub_metering_2           float64
Sub_metering_3           float64
dtypes: float64(7)
memory usage: 126.7 MB
None

dataset=dataset.resample('H', label='left').mean()
print(dataset.head())
                     Global_active_power  Global_reactive_power     Voltage  \
datetime                                                                      
2006-12-16 17:00:00             4.222889               0.229000  234.643889   
2006-12-16 18:00:00             3.632200               0.080033  234.580167   
2006-12-16 19:00:00             3.400233               0.085233  233.232500   
2006-12-16 20:00:00             3.268567               0.075100  234.071500   
2006-12-16 21:00:00             3.056467               0.076667  237.158667   

                     Global_intensity  Sub_metering_1  Sub_metering_2  \
datetime                                                                
2006-12-16 17:00:00         18.100000             0.0        0.527778   
2006-12-16 18:00:00         15.600000             0.0        6.716667   
2006-12-16 19:00:00         14.503333             0.0        1.433333   
2006-12-16 20:00:00         13.916667             0.0        0.000000   
2006-12-16 21:00:00         13.046667             0.0        0.416667   

                     Sub_metering_3  
datetime                             
2006-12-16 17:00:00       16.861111  
2006-12-16 18:00:00       16.866667  
2006-12-16 19:00:00       16.683333  
2006-12-16 20:00:00       16.783333  
2006-12-16 21:00:00       17.216667  
6 年前
回复了 jezrael 创建的主题 » 用python总结三个数据帧

我相信你需要 concat 含骨料 sum :

df = pd.concat([df1, df2, df3]).groupby('account_number', as_index=False)['hkd_margin'].sum()
print (df)
   account_number    hkd_margin
0              20  2.361950e+05
1              40  1.375593e+07
2              60  2.174999e+07
3              90  4.695600e+04
5 年前
回复了 jezrael 创建的主题 » python读取具有启动和停止条件的文件

使用 Series.str.startswith 具有 Series.cumsum 对于组,然后按 GroupBy.agg 具有 join :

df1 = (df['col'].groupby(df['col'].str.startswith('SRV').cumsum())
                .agg(' '.join)
                .reset_index(drop=True)
                .to_frame(name='new'))
print (df1)
                                                 new
0                             SRV: this is for bryan
1                             SRV: this is for terry
2  SRV: this is for torain sec01: This is reserve...
3                               SRV: this is for Jun

细节 :

print (df['col'].str.startswith('SRV').cumsum())
0    1
1    2
2    3
3    3
4    3
5    3
6    4
Name: col, dtype: int32

为了 DataFrame 用途:

import pandas as pd

temp=u"""col
SRV: this is for bryan

SRV: this is for terry

SRV: this is for torain
sec01: This is reserved
sec02: This is open for all
sec03: Closed!

SRV: this is for Jun"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="|")

print (df)
                           col
0       SRV: this is for bryan
1       SRV: this is for terry
2      SRV: this is for torain
3      sec01: This is reserved
4  sec02: This is open for all
5               sec03: Closed!
6         SRV: this is for Jun

纯python解决方案:

out = []
with open("file.csv") as f1:
        last = 0
        for i, line in enumerate(f1.readlines()):
            if line.strip().startswith('SRV'):
                last = i
            out.append([line.strip(), last])

from itertools import groupby
from operator import itemgetter

with open("out_file.csv", "w") as f2:
    groups = groupby(out, key=itemgetter(1))
    for _, g in groups:
        gg = list(g)
        h = ' '.join(list(map(itemgetter(0), gg)))
        f2.write('\n' + h)

如果想要使用 numpy.select 设置最后一个参数 default 原始值:

data['CO BORROWER_STATUS'] = np.select([data['CO BORROWER NAME'] == 'NOT_AVAILABLE'],
                                       ['NOT_AVAILABLE'], 
                                       default=data['CO BORROWER NAME'])

因为只有一个条件,更好的是 numpy.where 在这里:

data['CO BORROWER_STATUS'] = np.where(data['CO BORROWER NAME'] == 'NOT_AVAILABLE',
                                      'NOT_AVAILABLE', 
                                      data['CO BORROWER NAME'])

使用 concat 具有 numpy.where :

df = pd.concat([df1, df2], axis=1)
df['result'] = np.where(df['col1'] == df['col2'], 'no change', 'changed')
print (df)
   col1  col2     result
0  audi  audi  no change
1  cars  bike    changed
6 年前
回复了 jezrael 创建的主题 » 根据条件从数据帧创建python列表

使用 crosstab 将数据帧转换为 list S:

df1 = pd.crosstab(df[48], [df[150], df[39]])
#alternative solutions
#df1 = df.groupby([48, 150, 39]).size().unstack(level=[1,2], fill_value=0)
#df1 = df.pivot_table(index=48, columns=[150, 39], aggfunc='size', fill_value=0)
print (df1)
150           0  3  8  F   
39            0  3  8  1  2
48                         
BE0974302342  1  1  0  0  1
FR0000073843  1  1  0  0  1
FR0000076861  1  1  0  1  1
FR0000076887  1  1  1  0  1
FR0000077562  1  1  0  0  1
FR0000079147  1  1  0  1  1
FR0004034072  1  1  1  0  1
FR0004152874  1  1  0  1  1
FR0004178572  1  0  0  0  0

L = df1.reset_index().values.tolist()
print (L)

[['BE0974302342', 1, 1, 0, 0, 1], 
 ['FR0000073843', 1, 1, 0, 0, 1], 
 ['FR0000076861', 1, 1, 0, 1, 1], 
 ['FR0000076887', 1, 1, 1, 0, 1], 
 ['FR0000077562', 1, 1, 0, 0, 1], 
 ['FR0000079147', 1, 1, 0, 1, 1], 
 ['FR0004034072', 1, 1, 1, 0, 1], 
 ['FR0004152874', 1, 1, 0, 1, 1], 
 ['FR0004178572', 1, 0, 0, 0, 0]]

如果需要组合转换 MultiIndex 在列到元组列表中:

print (df1.columns.tolist())
[('0', 0), ('3', 3), ('8', 8), ('F', 1), ('F', 2)]
6 年前
回复了 jezrael 创建的主题 » 如何从给定的数据中找到Python中的公共数据对

使用 GroupBy.size 具有 nlargest sort_values 具有 iloc 选择最后一个值。

功能 remove_unused_levels 用于按删除的值删除多索引值 Series .

a = (df.groupby(['Start Station','End Station'])
       .size()
       .nlargest(1)
       .index.remove_unused_levels()
       .tolist()
     )

或:

a = (df.groupby(['Start Station','End Station'])
       .size()
       .sort_values()
       .iloc[[-1]]
       .index.remove_unused_levels()
       .tolist()
       )

print(a)
[('A', 'B')]

如果需要输出 DataFrame :

df1 = (df.groupby(['Start Station','End Station'])
       .size()
       .reset_index(name='count')
       .nlargest(1, 'count')[['Start Station','End Station']]
)
print (df1)
  Start Station End Station
0             A           B
5 年前
回复了 jezrael 创建的主题 » 在python中向下取整datetime对象

使用 Timestamp.floor :

print (pd.Timestamp('2019-08-16 11:15:00').floor('15min'))
2019-08-16 11:15:00
print (pd.Timestamp('2019-08-16 11:23:00').floor('15min'))
2019-08-16 11:15:00
print (pd.Timestamp('2019-08-16 11:30:00').floor('15min'))
2019-08-16 11:30:00

用于测试:

df = pd.DataFrame({'dates':pd.date_range('2009-01-01', freq='T', periods=20)})

df['new'] = df['dates'].dt.floor('15min')
print (df)

0  2009-01-01 00:00:00 2009-01-01 00:00:00
1  2009-01-01 00:01:00 2009-01-01 00:00:00
2  2009-01-01 00:02:00 2009-01-01 00:00:00
3  2009-01-01 00:03:00 2009-01-01 00:00:00
4  2009-01-01 00:04:00 2009-01-01 00:00:00
5  2009-01-01 00:05:00 2009-01-01 00:00:00
6  2009-01-01 00:06:00 2009-01-01 00:00:00
7  2009-01-01 00:07:00 2009-01-01 00:00:00
8  2009-01-01 00:08:00 2009-01-01 00:00:00
9  2009-01-01 00:09:00 2009-01-01 00:00:00
10 2009-01-01 00:10:00 2009-01-01 00:00:00
11 2009-01-01 00:11:00 2009-01-01 00:00:00
12 2009-01-01 00:12:00 2009-01-01 00:00:00
13 2009-01-01 00:13:00 2009-01-01 00:00:00
14 2009-01-01 00:14:00 2009-01-01 00:00:00
15 2009-01-01 00:15:00 2009-01-01 00:15:00
16 2009-01-01 00:16:00 2009-01-01 00:15:00
17 2009-01-01 00:17:00 2009-01-01 00:15:00
18 2009-01-01 00:18:00 2009-01-01 00:15:00
19 2009-01-01 00:19:00 2009-01-01 00:15:00
6 年前
回复了 jezrael 创建的主题 » python pandas动态查询传入函数

我认为有必要创建布尔屏蔽:

#boolean mask
cond = df.str_column.str.len()>10
#return all values
cond = [True] * len(df)

myfunc(cond)
7 年前
回复了 jezrael 创建的主题 » 如何用python绘制df.groupby.value_counts()结果?[复制品]

使用 rename_axis 用于索引和中列的名称 reset_index :

df = value_counts.rename_axis('unique_values').reset_index(name='counts')
print (df)
   unique_values  counts
0              2       3
1              1       2

或者如果需要一列数据帧,则使用 Series.to_frame :

df = value_counts.rename_axis('unique_values').to_frame('counts')
print (df)
               counts
unique_values        
2                   3
1                   2