如何使用python删除重复的文本块

dratoms • 5 年前 • 1469 次点击

我使用的文本文件是放射报告。如果文档有两个页面,则在所有页面的顶部都会重复包含患者姓名和其他元数据的文本块,而页面的其余部分则包含报告的内容。我已经把这些页面合并成一个文本对象。保留第一个块,我想删除所有其他重复块。有没有办法以编程方式从所有此类文件中删除这些块? 重复的块看起来像这样:

 Patient ID            xxx                 Patient Name           xxx
 Gender                 Female                         Age                     43Y 8M
 Procedure Name         CT Scan - Brain (Repeat)       Performed Date          14-03-2018
 Study DateTime         14-03-2018 07:10 am            Study Description       BRAIN REPEAT
 Study Type             CT                             Referring Physician     xxx

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/40170

1469 次点击

文章 [ 3 ] | 最新文章 5 年前

• 1 楼

SA12345 6 年前

您可以通过执行以下操作来查找所有出现的患者数据的起始索引:

str.find(sub,start,end)

哪里

sub:它是需要在给定字符串中搜索的子字符串——在您的情况下,它将是患者数据开始:需要在字符串中检查sub的开始位置结束:需要在字符串中检查后缀的结束位置

它将返回搜索字符串出现的最低索引(患者数据)。

您可以在一个循环中执行此过程,以获取发生患者数据的所有索引。

然后,您可以通过执行以下操作替换从第二个实例开始的患者数据:

str_new = ''.join(( str_old[ : indicies[1] ], '' , s_old[ indicies[2] + len(str_old) + 1 : ] ))
  ... assuming a total of 3 pages in your record.

其他选择:

str.replace(old, new [, max])

哪里

旧的:这是要替换的旧子字符串——在您的案例中是患者数据
新的:这是新的子字符串,它将替换旧的子字符串——它可以是“”(空白) max:如果给出了这个可选参数max,则只替换第一次出现的计数——这意味着患者数据现在将出现在 最后的 仅页。

• 2 楼

Charles Landau 6 年前

在python中,纯文本文件可以表示为序列。考虑 plain.txt 以下:

This is the first line!\n
This is the second line!\n
This is the third line!\n

你可以使用 with 保留字以创建管理打开/关闭逻辑的上下文,如下所示:

with open("./plain.txt", "r") as file:
    for line in file:
        # program logic
        pass

"r" 指open使用的模式。

因此,使用这个习惯用法,您可以以适合您的文件访问模式的方式存储重复值,并在遇到重复值时忽略它。

编辑:我看到你的编辑,看起来这实际上是一个csv,对吧?如果是的话,我推荐熊猫套餐。

import pandas as pd # Conventional namespace is pd

# Check out blob, os.walk, os.path for programmatic ways to generate this array
files = ["file.csv", "names.csv", "here.csv"] 

df = pd.DataFrame()
for filepath in files:
    df = df.append(pd.read_csv(filepath))

# To display result
print(df)

# To save to new csv
df.to_csv("big.csv")

• 3 楼

seventyseven 6 年前

假设您可以将每个单独的页面放入文档的列表中

def remove_patient_data(documents: list, pattern: str) -> str:
    document_buffer = ""
    for count, document in enumerate(documents):
        if count != 0:
            document = document.replace(pattern, "")
        document_buffer += document + '\n'
    return document_buffer

my_documents = ["blah foo blah", "blah foo bar", "blah foo baz"]
remove_patient_data(my_documents, "foo")

会回来的

'blah foo blah\nblah bar\nblah baz\n'

登录后回复