用python从文本文件中提取数据

kalle • 4 年前 • 827 次点击

我有一个高级结构的文本文件,如下所示:

CATEG:
DATA1
DATA2
...
DATA_N
CATEG:
DATA1
....

我想打开这个文本文件,并解析categ:的每个实例,将两者之间的内容分隔开来。但是,我对 open 方法以及它在阅读时如何处理每行中的新行。

即使用 f = open('mydata.txt', 'r') 然后 f.readlines() 结果会产生很多不必要的新行运算符,并使按上面的数据结构拆分变得烦人。有人有什么建议吗?不幸的是,令人讨厌的是数据集。

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/43325

827 次点击

文章 [ 5 ] | 最新文章 4 年前

• 1 楼

Peter Wood 5 年前

你可以使用 itertools.groupby :

from itertools import groupby

with open(filename) a f:
    categs = [list(group) for (key, group) in groupby(f.splitlines(), key='CATEG:')]

• 2 楼

PaulMcG 5 年前

在序列周围写一个小包装,去掉所有的换行符:

def newline_stripper(seq):
    for s in seq:
        # or change this to just s.rstrip() to remove all trailing whitespace
        yield s.rstrip('\n')

然后在进行迭代时用它包装文件对象:

with open('text_file.txt') as f:
    for line in newline_stripper(f):
        # do something with your now newline-free lines

这将保留对文件的一次一行的读取,而不是一次全部读取,因为 read().splitlines() 会的。

• 3 楼

keramat 5 年前

试试这个:

with open('text.txt') as file:
text = file.read()
text = text.replace('\n', ' ')
s = text.split('CATEG:')
s = [x.strip() for x in s if x != '']
print(s)

• 4 楼

grapes 5 年前

请尝试以下代码:

with open('mydata.txt') as f:
  for line in f:
    line = line.strip(' \t\r\n')  # remove spaces and line endings
    if line.ednswith(';'):
      pass # this is category definition
    else:
      pass # this is data line

• 5 楼

David 5 年前

尝试read().splitlines()。

例如:

from io import StringIO

def mkString():
    return StringIO("""CATEG:
        DATA1
        DATA2
        ...
        DATA_N
        CATEG:
        DATA1
        ....""")

mkString().read().splitlines()

登录后回复