python基本部分学完了之后,见专辑《python生信笔记》,就需要开始进行大量的数据分析实战练习进行掌握了。这次给大家找了个数据,文献于2022年12月12日发表在 Cancer Cell 杂志(IF=48.8)上,标题为《High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer》。
import scanpy as sc import pandas as pd from glob import glob from pathlib import Path import re import scipy.sparse from multiprocessing import Pool import anndata
3.2 读取样本meta信息
这里又学了一个新知识,python读取xlsx文件:
meta = pd.read_excel("tables/patient_table_batch1_3_patients.xlsx", engine="openpyxl", skiprows=1) meta
修改一下表头:
meta.rename(columns={"Tumornummer": "tumor_id", "Patient": "patient", "Alter": "age", "Geschlecht": "sex", "Tumor": "tumor_type"}, inplace=True) meta["sex"] = [{"W": 'f', "M":'m'}[x] for x in meta["sex"]] meta
# Pool(16)创建包含16个工作进程的进程池 # 数字16表示同时运行的进程数量 # map函数将任务分配给所有进程并行执行 # 存储所有进程返回结果的列表 with Pool(16) as p: adatas = p.map(load_counts, filenames)
厉害哇!又学到一个知识点!
3.4 创建h5ad对象
将上面的adatas合并在一起,生成h5ad对象:
adata = anndata.concat(adatas, index_unique="_", join="outer") adata # 修改一下meta信息 meta["patient"] = [x.strip() for x in meta["patient"]] adata.obs = adata.obs.reset_index().merge(meta, on=["patient"], how="left").set_index("cell_id") adata adata.obs
作者还修改了一点细节:
adata.obs.drop_duplicates() adata.obs["condition"] = "NSCLC" adata.obs["origin"] = ["tumor_primary"if c == "Tumor"else"normal_adjacent"for c in adata.obs["tissue"]] adata.obs["sample"] = [f"{patient}_{origin}"for patient, origin in zip(adata.obs["patient"], adata.obs["origin"])] adata.obs["sex"] = [{"m": "male", "f": "female"}[s] for s in adata.obs["sex"]] adata.obs["tissue"] = "lung" adata.obs.drop_duplicates()