我试图删除标点和小写的长字符串(取自文本文件)。
我有一个示例文本文件,如下所示:
This. this, Is, is. An; an, Example. example! Sentence? sentence.
然后我有以下脚本:
def get_input(filepath):
f = open(filepath, 'r')
content = f.read()
return content
def normalize_text(file):
all_words = word_tokenize(file)
for word in all_words:
word = word.lower()
word = word.translate(str.maketrans('','',string.punctuation))
return all_words
def get_collection_size(mydict):
total = sum(mydict.values())
return total
def get_vocabulary_size(mylist):
unique_list = numpy.unique(mylist)
vocabulary_size = len(unique_list)
return vocabulary_size
myfile = get_input('D:\\PythonHelp\\example.txt')
total_words = normalize_text(myfile)
mydict = countElement(total_words)
print(total_words)
print(mydict)
print("Collection Size: {}".format(get_collection_size(mydict)))
print("Vocabulary Size: {}".format(get_vocabulary_size(total_words)))
我得到如下结果:
['This', '.', 'this', ',', 'Is', ',', 'is', '.', 'An', ';', 'an', ',', 'Example', '.', 'example', '!', 'Sentence', '?', 'sentence', '.']
{'This': 1, '.': 4, 'this': 1, ',': 3, 'Is': 1, 'is': 1, 'An': 1, ';': 1, 'an': 1, 'Example': 1, 'example': 1, '!': 1, 'Sentence': 1, '?': 1,
'sentence': 1}
Collection Size: 20
Vocabulary Size: 15
但是,我希望:
['this', 'is', 'an', 'example', 'sentence']
{'this:' 2, 'is:' 2, 'an:' 2, 'example:' 2, 'sentence:' 2}
Collection Size: 10
Vocabulary Size: 5
为什么是
def normalize_text(file):
使用
str.maketrans
和
.lower()
工作不正常?
当我跑步时
python --version
我得到
3.7.0