In [5]: url = ('https://raw.github.com/pandas-dev' ...: '/pandas/master/pandas/tests/data/tips.csv') ...:
In [6]: tips = pd.read_csv(url)
In [7]: tips.head() Out[7]: total_bill tip sex smoker day time size 016.991.01 Female No Sun Dinner 2 110.341.66 Male No Sun Dinner 3 221.013.50 Male No Sun Dinner 3 323.683.31 Male No Sun Dinner 2 424.593.61 Female No Sun Dinner 4
与 import delimited 一样,read_csv() 可以使用更多参数来导入数据。例如,如果数据是用 tab 分隔的,没有列名,并且存在于当前工作目录中, panda 命令将是:
pandas 通过在 DataFrame 中 x 选定各个 Series 来进行类似的向量化操作。可以用同样的方法产生新的列。可以用 DataFrame.drop() 方法从 DataFrame 中删除列。
In [8]: tips['total_bill'] = tips['total_bill'] - 2
In [9]: tips['new_bill'] = tips['total_bill'] / 2
In [10]: tips.head() Out[10]: total_bill tip sex smoker day time size new_bill 014.991.01 Female No Sun Dinner 27.495 18.341.66 Male No Sun Dinner 34.170 219.013.50 Male No Sun Dinner 39.505 321.683.31 Male No Sun Dinner 210.840 422.593.61 Female No Sun Dinner 411.295
In [11]: tips = tips.drop('new_bill', axis=1)
3.2 条件过滤
Stata 中使用 if 语句对一个或多个列的数据进行筛选。
list if total_bill > 10
DataFrames 有多种过滤方式,最直观的是使用 boolean indexing 。
In [12]: tips[tips['total_bill'] > 10].head() Out[12]: total_bill tip sex smoker day time size 014.991.01 Female No Sun Dinner 2 219.013.50 Male No Sun Dinner 3 321.683.31 Male No Sun Dinner 2 422.593.61 Female No Sun Dinner 4 523.294.71 Male No Sun Dinner 4
3.3 If 条件语句
在 Stata 中,可以使用 if 语句来创建新列。
generate bucket = "low" if total_bill < 10 replace bucket = "high" if total_bill >= 10
在 pandas 中也可以使用 numpy 的 where 方法来完成相同的操作。
In [13]: tips['bucket'
] = np.where(tips['total_bill'] 10, 'low', 'high')
In [14]: tips.head() Out[14]: total_bill tip sex smoker day time size bucket 014.991.01 Female No Sun Dinner 2 high 18.341.66 Male No Sun Dinner 3 low 219.013.50 Male No Sun Dinner 3 high 321.683.31 Male No Sun Dinner 2 high 422.593.61 Female No Sun Dinner 4 high
# keep In [22]: tips[['sex', 'total_bill', 'tip']].head() Out[22]: sex total_bill tip 0 Female 14.991.01 1 Male 8.341.66 2 Male 19.013.50 3 Male 21.683.31 4 Female 22.593.61
# drop In [23]: tips.drop('sex', axis=1).head() Out[23]: total_bill tip smoker day time size 014.991.01 No Sun Dinner 2 18.341.66 No Sun Dinner 3 219.013.50 No Sun Dinner 3 321.683.31 No Sun Dinner 2 422.593.61 No Sun Dinner 4
# rename In [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head() Out[24]: total_bill_2 tip sex smoker day time size 014.991.01 Female No Sun Dinner 2 18.341.66 Male No Sun Dinner 3 219.013.50 Male No Sun Dinner 3 321.683.31 Male No Sun Dinner 2 422.593.61 Female No Sun Dinner 4
In [25]: tips = tips.sort_values(['sex', 'total_bill'])
In [26]: tips.head() Out[26]: total_bill tip sex smoker day time size 671.071.00 Female Yes Sat Dinner 1 923.751.00 Female Yes Fri Dinner 2 1115.251.00 Female No Sat Dinner 1 1456.351.50 Female No Thur Lunch 2 1356.511.25 Female No Thur Lunch 2
4. 字符串处理
4.1 字符串长度
Stata 分别使用 strlen() 和 ustrlen() 函数来确定 ASCII 和 Unicode 码的字符串长度。
In [43]: df2 Out[43]: key value 0 B 1.212112 1 D -0.173215 2 D 0.119209 3 E -1.044236
在 Stata 中,要执行合并命令,一个数据必须在内存中,另一个数据必须通过引用硬盘上的文件名称。相反,Python 必须将两个数据 DataFrames 同时放在内存中。
默认情况下, Stata 通过外部连接数据,在合并之后,来自两个数据的所有观察值都留在内存中。通过使用创建的 _merge 变量中的不同值,可以只保留来自初始数据集、合并数据集或两者交集的观察值。
* First create df2 and save to disk clear input str1 key B D D E end generate value = rnormal() save df2.dta
* Now create df1 in memory clear input str1 key A B C D end generate value = rnormal()
preserve
* Left join merge 1:n key using df2.dta keep if _merge == 1
* Right join restore, preserve merge 1:n key using df2.dta keep if _merge == 2
* Inner join restore, preserve merge 1:n key using df2.dta keep if _merge == 3
* Outer join restore merge 1:n key using df2.dta
pandas DataFrames 有一个 DataFrame.merge() 方法,提供了类似的功能。不同的合并类型是通过设定 how 关键字实现的。
In [44]: inner_join = df1.merge(df2, on=['key'], how='inner')
In [45]: inner_join Out[45]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632-0.173215 2 D -1.1356320.119209
In [46]: left_join = df1.merge(df2, on=['key'], how='left')
In [47]: left_join Out[47]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632-0.173215 4 D -1.1356320.119209
In [48]: right_join = df1.merge(df2, on=['key'], how='right')
In [49]: right_join Out[49]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632
-0.173215 2 D -1.1356320.119209 3 E NaN -1.044236
In [50]: outer_join = df1.merge(df2, on=['key'], how='outer')
In [51]: outer_join Out[51]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632-0.173215 4 D -1.1356320.119209 5 E NaN -1.044236
6. 缺失值
与 Stata 一样,在 Pandas 中使用浮点值 NaN (不是数字)表示数组中的缺失值。许多含义是相同的,例如,缺失值可以进行数值运算操作,默认情况下汇总运算将忽略这些缺失值。
In [52]: outer_join Out[52]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632-0.173215 4 D -1.1356320.119209 5 E NaN -1.044236
In [53]: outer_join['value_x'] + outer_join['value_y'] Out[53]: 0 NaN 10.929249 2 NaN 3-1.308847 4-1.016424 5 NaN dtype: float64
In [54]: outer_join['value_x'].sum() Out[54]: -3.5940742896293765
与 Stata 不同之处在于,不能使用 pandas 缺失值的标记值进行比较。例如,在 Stata 中,可以使用标记值来过滤丢失数据。
* Keep missing values list if value_x == . * Keep non-missing values list if value_x != .
In [55]: outer_join[pd.isna(outer_join['value_x'])] Out[55]: key value_x value_y 5 E NaN -1.044236
In [56]: outer_join[pd.notna(outer_join['value_x'])] Out[56]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632-0.173215 4 D -1.1356320.119209
pandas 还提供了其他处理缺失值的方法——一些在 Stata 中很难操作。例如,有一些方法可以删除所有有缺失值的行,可以用指定的值(比如平均值,或者前行值)替换缺失值。获取更多信息,请查阅缺失值文档。
# Drop rows with any missing value In [57]: outer_join.dropna()
Out[57]: key value_x value_y 1 B -0.2828631.212112 3 D -1.135632-0.173215 4 D -1.1356320.119209
# Fill forwards In [58]: outer_join.fillna(method='ffill') Out[58]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.5090591.212112 3 D -1.135632-0.173215 4 D -1.1356320.119209 5 E -1.135632-1.044236
# Impute missing values with the mean In [59]: outer_join['value_x'].fillna(outer_join['value_x'].mean()) Out[59]: 00.469112 1-0.282863 2-1.509059 3-1.135632 4-1.135632 5-0.718815 Name: value_x, dtype: float64
7. 分组运算
7.1 分组求和
Stata的 collapse 函数可按一个或多个关键变量分组,并计算数值列上的和。
collapse (sum) total_bill tip, by(sex smoker)
pandas 通过 groupby 方法,可以实现类似的求和。
In [60]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()
In [61]: tips_summed.head() Out[61]: total_bill tip sex smoker Female No 869.68149.77 Yes 527.2796.74 Male No 1725.75302.00 Yes 1217.07183.07
7.2 分组变换
在 Stata 中,对原始数据分组运算,通常会使用 bysort 和 egen() 命令。例如,按吸烟变量分组后,每个观察值减去的组内平均值。
bysort sex smoker: egen group_bill = mean(total_bill) generate adj_total_bill = total_bill - group_bill
panda 使用 groubpy 方法提供了一个 分组变换 功能,可以这些类型的操作在一个操作中简洁地表达出来。
In [62]: gb = tips.groupby('smoker')[
'total_bill']
In [63]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')
In [64]: tips.head() Out[64]: total_bill tip sex smoker day time size adj_total_bill 671.071.00 Female Yes Sat Dinner 1-17.686344 923.751.00 Female Yes Fri Dinner 2-15.006344 1115.251.00 Female No Sat Dinner 1-11.938278 1456.351.50 Female No Thur Lunch 2-10.838278 1356.511.25 Female No Thur Lunch 2-10.678278
7.3 分组处理
除了汇总分析之外,pandas 的 groupby 函数可以完成 Stata 中 bysort 其他大多数的处理功能。例如,按性别、吸烟者变量分组排序列后列出第一个观察结果。
bysort sex smoker: list if _n == 1
在 pandas 中,写法如下:
In [65]: tips.groupby(['sex', 'smoker']).first() Out[65]: total_bill tip day time size adj_total_bill sex smoker Female No 5.251.00 Sat Dinner 1-11.938278 Yes 1.071.00 Sat Dinner 1-17.686344 Male No 5.512.00 Thur Lunch 2-11.678278 Yes 5.255.15 Sun Dinner 2-13.506344