我正在使用Python多处理模块来并行化任务,但我注意到,执行时间因加载到实例属性的数据量而异。在下面的代码中,self。pg\u data\u download\u run()对函数进行多处理。当没有数据加载到RAM时,第一次运行需要2.6秒。然而,执行时间发生变化:
self.pg_data_download_run()
for symbol in self.tradeable_universe_symbol_list:
self.data_day[symbol] = pd.read_csv(self.data_dir_day + symbol + '.csv')
self.pg_data_download_run()
-
加载20个数据帧后,RAM影响为10 MB,需要3.2秒。
-
执行时间似乎与分配给RAM的对象大小相关,但我不明白为什么这么小的RAM影响会导致性能急剧下降。我在这台机器上有4 GB可用RAM。是什么导致性能下降?
def multiprocessing_test_run(self):
print('multiprocess testing')
iterations = 200
iterations = [x for x in range(iterations)]
tstart = dt.datetime.now()
with concurrent.futures.ProcessPoolExecutor(max_workers=os.cpu_count() - 1) as executor:
executor.map(self.multiprocessing_test, iterations)
# sleep for half a second, so 0.5 * iterations is the time it takes with one thread. Divide by actual time to
# get the multiprocessing multiplier
print('Duration (s): ', (dt.datetime.now() - tstart).seconds)
def multiprocessing_test(self, iteration):
print("Iteration Number: ", str(iteration))
time.sleep(0.5)
def multiprocessing_speed_test(self, data_dir):
final_df = pd.DataFrame()
self.multiprocessing_test_run()
for file in os.listdir(data_dir)[:200]:
final_df = pd.concat([final_df, pd.read_csv(data_dir + file)], axis=0)
self.data_day = final_df
self.multiprocessing_test_run()