我有以下要求:
-
对于给定的单词或标记,确定它出现在多少段落中(称为文档频率)
-
创建一个数据结构(dict、pandas dataframe等),其中包含单词、其集合(总体)频率和文档频率
示例数据集如下所示:
<P ID=1>
I have always wanted to try like, multiple? Different rasteraunts. Not quite sure which kind, maybe burgers!
</P>
<P ID=2>
Nice! I love burgers. Cheeseburgers, too. Have you ever gone to a diner type restauraunt? I have always wanted to try every diner in the country.
</P>
<P ID=3>
I am not related to the rest of these paragraphs at all.
</P>
一个“段落”是由
<P ID=x> </P> tags
我需要的是创建一个类似这样的数据结构(我认为它是
dict
):
{'i': X Y, 'have': X Y, etc}
或者,可能是
pandas
如下所示的数据帧:
| Word | Content Frequency | Document Frequency |
| i | 4 | 3 |
| have | 3 | 2 |
| etc | etc | etc |
目前,我可以找到内容频率没有问题使用下面的代码。
import nltk
import string
from nltk.tokenize import word_tokenize, RegexpTokenizer
import csv
import numpy
import operator
import re
# Requisite
def get_input(filepath):
f = open(filepath, 'r')
content = f.read()
return content
# 1
def normalize_text(file):
file = re.sub('<P ID=(\d+)>', '', file)
file = re.sub('</P>', '', file)
tokenizer = RegexpTokenizer(r'\w+')
all_words = tokenizer.tokenize(file)
lower_case = []
for word in all_words:
curr = word.lower()
lower_case.append(curr)
return lower_case
# Requisite for 3
# Answer for 4
def get_collection_frequency(a):
g = {}
for i in a:
if i in g:
g[i] +=1
else:
g[i] =1
return g
myfile = get_input('example.txt')
words = normalize_text(myfile)
## ANSWERS
collection_frequency = get_collection_frequency(words)
print("Collection frequency: ", collection_frequency)
返回:
Collection frequency: {'i': 4, 'have': 3, 'always': 2, 'wanted': 2,
'to': 4, 'try': 2, 'like': 1, 'multiple': 1, 'different': 1,
'rasteraunts': 1, 'not': 2, 'quite': 1, 'sure': 1, 'which': 1,
'kind': 1, 'maybe': 1, 'burgers': 2, 'nice': 1, 'love': 1,
'cheeseburgers': 1, 'too': 1, 'you': 1, 'ever': 1, 'gone': 1, 'a': 1,
'diner': 2, 'type': 1, 'restauraunt': 1, 'every': 1, 'in': 1, 'the': 2,
'country': 1, 'am': 1, 'related': 1, 'rest': 1, 'of': 1, 'these': 1,
'paragraphs': 1, 'at': 1, 'all': 1}
但是,我目前正在删除
normalize_text
功能与行:
file = re.sub('<P ID=(\d+)>', '', file)
file = re.sub('</P>', '', file)
P
,
ID
,
1
,
2
,
3
在我的字典里算一下,因为那些只是段落标题。
那么,我怎样才能将一个单词的出现与它在一个段落中的实例联系起来,从而产生上面所期望的结果呢?我甚至不确定试图创建这样一个数据结构的逻辑。