创作新主题

社区所有版块导航

Python

python开源 Django Python DjangoApp pycharm

DATA

docker Elasticsearch

问与答闲聊招聘翻译创业分享发现分享创造求职区块链支付之战

aigc

aigc chatgpt

WEB开发

linux MongoDB Redis DATABASE NGINX 其他Web框架 web工具 zookeeper tornado NoSql Bootstrap js peewee Git bottle IE MQ Jquery

机器学习

机器学习算法

Python88.com

反馈公告社区推广

产品

短视频

印度

一周十大热门主题

预测2025诺贝尔生理学或医学奖，ChatGPT vs DeepSeek：差别竟然这么大！

ChatGPT：没人发现我来过的痕迹… ——n.american-20250928122144

OpenAI「降配门」发酵，偷换模型遭全网实锤；小米 SU7 在日本首秀；苹果内部测试类 ChatG...

当AI不等你开口就主动干活：实测ChatGPT Pulse三大特点

【精选报告】2025AIGC厂商全景报告（附PDF下载）

#谁是2025年最佳编程语言##Python依然是最好的语言#P-20251001105638

登顶github榜首！2025年最新版《前端高阶八股深挖指南》

卢伟冰回应小米17标准版销量不及预期/微信更新，聊天支持发Live图/ChatGPT上线购物功能

【OpenAI被曝偷换模型，连付费版ChatGPT都变笨了】知名-20250928163822

Unsloth AI 推出官方 Docker 镜像，轻松实现本地-20251002091908

关注

Py学习 » Python

Python 3.7中通过逐段计数单词的自定义数据结构

Jerry M. • 5 年前 • 1998 次点击

我有以下要求:

对于给定的单词或标记,确定它出现在多少段落中(称为文档频率)
创建一个数据结构(dict、pandas dataframe等),其中包含单词、其集合(总体)频率和文档频率

示例数据集如下所示:

<P ID=1>
I have always wanted to try like, multiple? Different rasteraunts. Not quite sure which kind, maybe burgers!
</P>

<P ID=2>
Nice! I love burgers. Cheeseburgers, too. Have you ever gone to a diner type restauraunt? I have always wanted to try every diner in the country.
</P>

<P ID=3>
I am not related to the rest of these paragraphs at all.
</P>

一个“段落”是由 <P ID=x> </P> tags

我需要的是创建一个类似这样的数据结构(我认为它是 dict ):

{'i': X Y, 'have': X Y, etc}

或者,可能是 pandas 如下所示的数据帧:

| Word | Content Frequency | Document Frequency |
|   i  |         4         |          3         |
| have |         3         |          2         |
| etc  |         etc       |          etc       |

目前,我可以找到内容频率没有问题使用下面的代码。

import nltk
import string
from nltk.tokenize import word_tokenize, RegexpTokenizer
import csv
import numpy
import operator
import re

# Requisite
def get_input(filepath):
    f = open(filepath, 'r')
    content = f.read()
    return content

# 1
def normalize_text(file):
    file = re.sub('<P ID=(\d+)>', '', file)
    file = re.sub('</P>', '', file)
    tokenizer = RegexpTokenizer(r'\w+')
    all_words = tokenizer.tokenize(file)
    lower_case = []
    for word in all_words:
        curr = word.lower()
        lower_case.append(curr)

    return lower_case

# Requisite for 3
# Answer for 4
def get_collection_frequency(a):
    g = {}
    for i in a:
        if i in g: 
            g[i] +=1
        else: 
            g[i] =1
    return g

myfile = get_input('example.txt')
words = normalize_text(myfile)

## ANSWERS
collection_frequency = get_collection_frequency(words)
print("Collection frequency: ", collection_frequency)

Collection frequency:  {'i': 4, 'have': 3, 'always': 2, 'wanted': 2, 
'to': 4, 'try': 2, 'like': 1, 'multiple': 1, 'different': 1,
'rasteraunts': 1, 'not': 2, 'quite': 1, 'sure': 1, 'which': 1,
'kind': 1, 'maybe': 1, 'burgers': 2, 'nice': 1, 'love': 1,
'cheeseburgers': 1, 'too': 1, 'you': 1, 'ever': 1, 'gone': 1, 'a': 1,
'diner': 2, 'type': 1, 'restauraunt': 1, 'every': 1, 'in': 1, 'the': 2,
'country': 1, 'am': 1, 'related': 1, 'rest': 1, 'of': 1, 'these': 1, 
'paragraphs': 1, 'at': 1, 'all': 1}

但是,我目前正在删除 normalize_text 功能与行:

file = re.sub('<P ID=(\d+)>', '', file)
file = re.sub('</P>', '', file)

P , ID , 1 , 2 , 3 在我的字典里算一下,因为那些只是段落标题。

那么,我怎样才能将一个单词的出现与它在一个段落中的实例联系起来,从而产生上面所期望的结果呢?我甚至不确定试图创建这样一个数据结构的逻辑。

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/50902

1998 次点击

文章 [ 3 ] | 最新文章 5 年前

• 1 楼

DarrylG 6 年前

import re
from collections import defaultdict, Counter

def create_dict(text):
" Dictionary contains strings for each paragraph using paragraph ID as key"
  d = defaultdict(lambda: "")
  lines = text.splitlines()
  for line in lines:
    matchObj = re.match( r'<P ID=(\d+)>', line)
    if matchObj:
      dictName = matchObj.group(0)
      continue  #skip line containing paragraph ID
    elif re.match(r'</P>', line):
      continue  #skip line containing paragraph ending token
    d[dictName] += line.lower()
  return d

def document_frequency(d):
" frequency of words in document "
  c = Counter()
  for paragraph in d.values():
    words = re.findall(r'\w+', paragraph)
    c.update(words)
  return c

def paragraph_frequency(d):
"Frequency of words in paragraph "
  c = Counter()
  for sentences in d.values():
    words = re.findall(r'\w+', sentences)
    set_words = set(words)  # Set causes at most one occurrence 
                            # of word in paragraph
    c.update(set_words)
  return c

text = """<P ID=1>
I have always wanted to try like, multiple? Different rasteraunts. Not quite sure which kind, maybe burgers!
</P>

<P ID=2>
Nice! I love burgers. Cheeseburgers, too. Have you ever gone to a diner type restauraunt? I have always wanted to try every diner in the country.
</P>

<P ID=3>
I am not related to the rest of these paragraphs at all.
</P>"""

d = create_dict(text)
doc_freq = document_frequency(d)    # Number of times in document
para_freq = paragraph_frequency(d)  # Number of times in paragraphs
print("document:", doc_freq)
print("paragraph: ", para_freq)

结果

document: Counter({'i': 4, 'to': 4, 'have': 3, 'always': 2, 'wanted': 2, 'try': 2, 'not': 2,'burgers': 2, 'diner': 2, 'the': 2, 'like': 1, 'multiple': 1, 'different': 1, 'rasteraunts':1, 'quite': 1, 'sure': 1, 'which': 1, 'kind': 1, 'maybe': 1, 'nice': 1, 'love': 1, 'cheeseburgers': 1, 'too': 1, 'you': 1, 'ever': 1, 'gone': 1, 'a': 1, 'type': 1, 'restauraunt': 1, 'every': 1, 'in': 1, 'country': 1, 'am': 1, 'related': 1, 'rest': 1, 'of': 1, 'these': 1, 'paragraphs': 1, 'at': 1, 'all': 1})
paragraph:  Counter({'to': 3, 'i': 3, 'try': 2, 'have': 2, 'burgers': 2, 'wanted': 2, 'always': 2, 'not': 2, 'the': 2, 'which': 1, 'multiple': 1, 'quite': 1, 'rasteraunts': 1, 'kind': 1, 'like': 1, 'maybe': 1, 'sure': 1, 'different': 1, 'love': 1, 'too': 1, 'in': 1, 'restauraunt': 1, 'every': 1, 'nice': 1, 'cheeseburgers': 1, 'diner': 1, 'ever': 1, 'a': 1, 'type': 1, 'you': 1, 'country': 1, 'gone': 1, 'at': 1, 'related': 1, 'paragraphs': 1, 'rest': 1, 'of': 1,'am': 1, 'these': 1, 'all': 1})

• 2 楼

Reinstate Monica 6 年前

试试这个:

import re
from nltk.tokenize import word_tokenize, RegexpTokenizer

def normalize_text(file):
    file = re.sub('<P ID=(\d+)>', '', file)
    file = re.sub('</P>', '', file)
    tokenizer = RegexpTokenizer(r'\w+')
    all_words = tokenizer.tokenize(file)
    lower_case = []
    for word in all_words:
        curr = word.lower()
        lower_case.append(curr)

    return lower_case

def find_words(filepath):
    with open(filepath, 'r') as f:
        file = f.read()
    word_list = normalize_text(file)
    data = file.replace('</P>','').split('<P ID=')
    result = {}
    for word in word_list:
        result[word] = {}
        for p in data:
            if p:
                result[word][f'paragraph_{p[0]}'] = p[2:].count(word)
    print(result)
    return result

find_words('./test.txt')

如果要按段落分组,则按单词出现次数分组:

def find_words(filepath):
    with open(filepath, 'r') as f:
        file = f.read()
    word_list = normalize_text(file)
    data = file.replace('</P>','').split('<P ID=')
    result = {}
    for p in data:
        if p:
            result[f'paragraph_{p[0]}'] = {}
            for word in word_list:
                result[f'paragraph_{p[0]}'][word] = p[2:].count(word)


    print(result)
    return result

不过还是有点难读。如果漂亮的打印对象对您很重要,您可以尝试使用 pretty printing package .

要查找单词出现的段落数,请执行以下操作:

def find_paragraph_occurrences(filepath):
    with open(filepath, 'r') as f:
        file = f.read()
    word_list = normalize_text(file)
    data = file.replace('</P>','').lower().split('<P ID=')
    result = {}
    for word in word_list:
        result[word] = 0
        for p in data:
            if word in p:
                result[word] += 1

    print(result)
    return result

• 3 楼

wwii 6 年前

那么,我怎样才能将一个单词的出现与它在一个段落中的实例联系起来,从而产生上面所期望的结果呢?

将过程分成两部分:查找段落和查找单词

from nltk.tokenize import RegexpTokenizer
import re, collections

p = r'<P ID=\d+>(.*?)</P>'
paras = RegexpTokenizer(p)
words = RegexpTokenizer(r'\w+')

解析时保留两个词典:一个用于收集频率,一个用于文档频率。

col_freq = collections.Counter()
doc_freq = collections.Counter()

遍历段落;获取段落中的单词;将单词输入col_freq dict,并将一组单词输入doc_freq dict

for para in paras.tokenize(text):
    tokens = [word.lower() for word in words.tokenize(para)]
    col_freq.update(tokens)
    doc_freq.update(set(tokens))

把这两本词典结合起来。

d = {word:(col_freq[word], doc_freq[word]) for word in col_freq}

有一些效率低下-分析文本两次-但它可以调整,如果它成为一个问题。

RegexpTokenizer 真的没有什么比 re.findall() 在这种情况下,但是兽皮一些细节,使这个不那么冗长,所以我用了它。

有时 re 不能很好地处理格式错误的标记。分析段落可以用BeautifulSoup完成。

from bs4 import BeautifulSoup
soup = BeautifulSoup(text,"html.parser")
for para in soup.find_all('p'):
    tokens = [word.lower() for word in words.tokenize(para.text)]
    print(tokens)
##    col_freq.update(tokens)
##    doc_freq.update(set(tokens))

登录后回复