Numpy已经在random.choice中提供了内置参数“p”(代表概率),用于生成加权样本。最小示例:
import pandas as pd, numpy as np
from collections import Counter
df = pd.DataFrame(dict(words=["a","e","i","o","u"],weights=np.random.randint(5,15,5)))
df["normalized"]=df["weights"]/sum(df["weights"].values)
print(df)
words weights normalized
0 a 9 0.204545
1 e 13 0.295455
2 i 8 0.181818
3 o 6 0.136364
4 u 8 0.181818
n = 3
l=np.random.choice(df.words,size=(n,),p=df.normalized)
print(l)
array(['u', 'i', 'i'], dtype=object)
你怎么知道概率是否得到尊重?答案很简单,如果n足够大,那么单个事件除以n的总和应该大约等于单词的规范化权重:
n = 10000
l=np.random.choice(df.words,size=(n,),p=df.normalized)
c=Counter(l)
for key in c: c[key]=c[key]/n
print(c, sum(c.values()))
Counter({'e': 0.2907, 'a': 0.2055, 'u': 0.1882, 'i': 0.1791, 'o': 0.1365}) 1.0