这是一个答案,它给出的结果与问题框架的形式略有不同,但使用“a”和“B”的值作为
index
和
columns
数据帧结果,这可能更能描述最终结果:
import pandas as pd
lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)
输出:
AA BB CC
AC 1 2 1
BC 2 1 1
CC 2 2 0
下面是创建通用矩阵的上述方法与使用
numpy
在另一个使用硬编码列名的答案中显示:
import pandas as pd
import numpy as np
lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)
dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))
def foo(dfa, dfb):
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(dfb['B'][i], dfa['A'][j])) for j in range(len(dfa['A']))] for i in range(len(dfb['B']))], index=dfb['B'], columns=dfa['A'])
return df
def bar(dfa, dfb):
a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
b = np.array(dfb['B'].str.split('').str[1:-1].tolist())
dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)
return dfb
import timeit
print("\nGeneral matrix approach:")
t = timeit.timeit(lambda: foo(dfa, dfb), number = 100)
print(f"timeit: {t}")
print("\nHarcoded columns approach:")
t = timeit.timeit(lambda: bar(dfa, dfb), number = 100)
print(f"timeit: {t}")
通过
timeit
:
AA BB CC
AC 1 2 1
BC 2 1 1
CC 2 2 0
General matrix approach:
timeit: 0.023536499997135252
Harcoded columns approach:
timeit: 0.03922149998834357
这似乎表明
努比
这种方法大约需要1.5-2倍于这个答案中的一般矩阵方法。