我用嵌套生成器解决了这个问题:
import re
SECTION_START = re.compile(r'^\s*theta\s+sigma\s*$')
SECTION_END = re.compile(r'^\s*END\s*$')
def fresco_iter(stream):
def inner(stream):
# Yields each line until an end marker is found (or EOF)
for line in stream:
if line and not SECTION_END.match(line):
yield line
continue
break
# Find a start marker, then break off into a nested iterator
for line in stream:
if line:
if SECTION_START.match(line):
yield inner(stream)
continue
break
这个
fresco_iter
方法返回可循环的生成器。它每段返回一个生成器
theta sigma
>>> with open('fort.16', 'r') as fh:
... print(list(fresco_iter(fh)))
[<generator object fresco_iter.<locals>.inner at 0x7fbc6da15678>,
<generator object fresco_iter.<locals>.inner at 0x7fbc6da15570>]
因此,为了利用这一点,您可以创建自己的嵌套循环来处理嵌套生成器。
filename = 'fort.16'
with open(filename, 'r') as fh:
for nested_iter in fresco_iter(fh):
print('--- start')
for line in nested_iter:
print(line.rstrip())
print('--- end')
将输出。。。
--- start
1 0.1
2 0.1
3 0.2
--- end
--- start
1 0.3
2 0.2
--- end
这种策略一次只能在内存中保存一行输入文件,所以对任何大小的文件都有效,即使是最小的设备。。。因为发电机太棒了。
所以一路走下去。。。将输出分为单个文件:
with open(filename, 'r') as fh_in:
for (i, nested_iter) in enumerate(fresco_iter(fh_in)):
with open('{}.part-{:04d}'.format(filename, i), 'w') as fh_out:
for line in nested_iter:
fh_out.write(line)
将输出
只是
分隔名为
fort.16.part-0000
和
fort.16.part-0001
.
我希望这有帮助,快乐编码!