Py学习  »  kantal  »  全部回复
回复总数  1
6 年前
回复了 kantal 创建的主题 » 基于python中两个短序列的过滤行

如果文件不太大,可以立即读取,并使用re.findall():

    import re
    with open("infile.txt") as finp:
        data=finp.read()
    with open('outfile.txt', "w") as f:
        for item in re.findall(r">.+?[\r\n\f][AGTC]*?AATAAA[AGTC]{2,}GGAC[AGTC]*", data):
            f.write(item+"\n")

"""
+? and *?       means non-greedy process;
>.+?[\r\n\f]    matches a line starting with '>' and followed by any characters to the end of the line; 
[AGTC]*?AATAAA  matches any number of A,G,T,C characters, followed by the AATAAA pattern; 
[AGTC]{2,}      matches at least two or more characters of A,G,T,C;
GGAC            matches the GGAC pattern;
[AGTC]*         matches the empty string or any number of A,G,T,C characters.
"""