在python3中使用多处理进行文件读取

我有非常大的文件。每个文件几乎是2GB。因此,我希望并行运行多个文件。我可以这样做,因为所有的文件都有类似的格式,所以文件读取可以并行进行。我知道我应该使用多处理库,但我真的很困惑如何将它与我的代码一起使用。

我的文件读取代码是:

def file_reading(file,num_of_sample,segsites,positions,snp_matrix):
    with open(file,buffering=2000009999) as f:
        ###I read file here. I am not putting that code here.
        try:
            assert len(snp_matrix) == len(positions)
            return positions,snp_matrix ## return statement
        except:
            print('length of snp matrix and length of position vector not the same.')
            sys.exit(1)

我的主要功能是:

if __name__ == "__main__":    
    segsites = []
    positions = []
    snp_matrix = []




    path_to_directory = '/dataset/example/'
    extension = '*.msOut'

    num_of_samples = 162
    filename = glob.glob(path_to_directory+extension)

    ###How can I use multiprocessing with function file_reading
    number_of_workers = 10

   x,y,z = [],[],[]

    array_of_number_tuple = [(filename[file], segsites,positions,snp_matrix) for file in range(len(filename))]
    with multiprocessing.Pool(number_of_workers) as p:
        pos,snp = p.map(file_reading,array_of_number_tuple)
        x.extend(pos)
        y.extend(snp)

所以我对函数的输入如下:

文件-包含文件名的列表
num_of_samples-int值
segsites-最初是一个空列表,我在读取文件时要将其附加到该列表中。
位置-最初是一个空列表,我在读取文件时要将其附加到该列表中。
snp_matrix-最初是一个空列表,我在读取文件时要将其附加到该列表中。

函数返回位置列表和末尾的snp_矩阵列表。在我的参数是列表和整数的情况下,如何使用多处理?我使用多处理的方式给了我以下错误:

类型错误:文件_reading()缺少3个必需的位置参数:“segsite”、“positions”和“snp_matrix”

def file_reading(args): file, num_of_sample, segsites, positions, snp_matrix = args with open(file,buffering=2000009999) as f: ###I read file here. I am not putting that code here. try: assert len(snp_matrix) == len(positions) return positions,snp_matrix ## return statement except: print('length of snp matrix and length of position vector not the same.') sys.exit(1) if __name__ == "__main__": segsites = [] positions = [] snp_matrix = [] path_to_directory = '/dataset/example/' extension = '*.msOut' num_of_samples = 162 filename = glob.glob(path_to_directory+extension) number_of_workers = 10 x,y,z = [],[],[] array_of_number_tuple = [(filename[file], num_of_samples, segsites,positions,snp_matrix) for file in range(len(filename))] with multiprocessing.Pool(number_of_workers) as p: pos,snp = p.map(file_reading,array_of_number_tuple) x.extend(pos) y.extend(snp)