Py学习  »  Python

使用numpy和pandas优化python代码

Yeison H. Arias • 4 年前 • 660 次点击  

我有以下代码:

import numpy as np
import pandas as pd
colum1 = [0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05]
colum2 = [1,2,3,4,5,6,7,8,9,10,11,12]
colum3 = [0.85,0.80,0.80,0.80,0.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
colum4 = [1743.85, 1485.58, 1250.07, 1021.83, 818.96, 628.05, 455.40, 319.03, 190.86 , 97.07, 26.96 , 0.00]
df = pd.DataFrame({
    'colum1' : colum1,
    'colum2' : colum2,
    'colum3' : colum3,
    'colum4' : colum4,
});

df['result'] = 0
for i in range(len(colum2)):
    df['result'] = np.where(
        df['colum2'] <= 5,
        np.where(
            df['colum2'] == 1,
            df['colum4'],
            np.where(
                ( df['colum4'] - (df['result'].shift(1) * (df['colum1'] * df['colum3'])) )>0,
                ( df['colum4'] - (df['result'].shift(1) * (df['colum1'] * df['colum3'])) ),
                0
            )
        ),
        np.where(
            ( df['colum4'] - (df['result'].shift(1) * df['colum1']) )>0,
            ( df['colum4'] - (df['result'].shift(1) * df['colum1']) ),
            0
        )
    )

我需要在不使用for循环的情况下执行相同的操作。 这将是非常有帮助的,因为我正在与成千上万的记录,这是非常缓慢的工作。

我的预期结果如下:

    colum1  colum2  colum3   colum4       result
0     0.05       1    0.85  1743.85  1743.850000
1     0.05       2    0.80  1485.58  1415.826000
2     0.05       3    0.80  1250.07  1193.436960
3     0.05       4    0.80  1021.83   974.092522
4     0.05       5    0.85   818.96   777.561068
5     0.05       6    0.00   628.05   589.171947
6     0.05       7    0.00   455.40   425.941403
7     0.05       8    0.00   319.03   297.732930
8     0.05       9    0.00   190.86   175.973354
9     0.05      10    0.00    97.07    88.271332
10    0.05      11    0.00    26.96    22.546433
11    0.05      12    0.00     0.00     0.000000
Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/40162
 
660 次点击  
文章 [ 1 ]  |  最新文章 4 年前
jpp
Reply   •   1 楼
jpp    5 年前

第一步是删除索引上的循环,并用 np.maximum . 这样做是因为 np.where(a > 0, a, 0) 就我们的目的而言 np.maximum(0, a) .

同时,分别定义较长的表达式以使代码可读:

s1 = df['colum4'] - (df['result'].shift(1) * (df['colum1'] * df['colum3']))
s2 = df['colum4'] - (df['result'].shift(1) * df['colum1'])

df['result'] = np.where(df['colum2'] <= 5,
                        np.where(df['colum2'] == 1, df['colum4'],
                                 np.maximum(0, s1)),
                        np.maximum(0, s2))

下一步是使用 np.select 删除嵌套的 np.where 声明:

m1 = df['colum2'] <= 5
m2 = df['colum2'] == 1

conds = [m1 & m2, m1 & ~m2]
choices = [df['colum4'], np.maximum(0, s1)]

df['result'] = np.select(conds, choices, np.maximum(0, s2))

这个版本会更容易管理。