有没有更好的方法在python中为文档的标题和子标题创建索引?

Sonu Gupta • 3 年前 • 1352 次点击

我有一份文件的标题和副标题清单。

test_list = ['heading', 'heading','sub-heading', 'sub-heading', 'heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading', 'sub-heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading','sub-sub-heading', 'heading']

我想为每个标题和副标题指定唯一索引,如下所示:

seg_ids = ['1', '2', '2_1', '2_2', '3', '3_1', '3_1_1', '3_1_2', '3_2', '3_3', '3_3_1', '3_3_2', '3_3_3', '4']

这是我创建这个结果的代码,但它很混乱,并且仅限于深度3。如果有任何文档带有子标题,代码就会变得更加复杂。有什么类似蟒蛇的方法吗?

seg_ids = []
for idx, an_ele in enumerate(test_list):
    
    head_id = 0
    subh_id = 0
    subsubh_id = 0
    if an_ele == 'heading' and idx == 0:  # if it is the first element 
        head_id = '1'
        seg_ids.append(head_id)
        
        
    else:
        last_seg_ids = seg_ids[idx-1].split('_')  # find the depth of the last element
        head_id = last_seg_ids[0]
        
        if len(last_seg_ids) == 2:  
            subh_id = last_seg_ids[1]
        elif len(last_seg_ids) == 3:
            subh_id = last_seg_ids[1]
            subsubh_id = last_seg_ids[2]
            
           
        if an_ele == 'heading':
            head_id= str(int(head_id)+1) 
            subh_id = 0  # reset sub_heading index 
            subsubh_id = 0 # reset sub_sub_heading index 

        elif an_ele == 'sub-heading':
            subh_id= str(int(subh_id)+1)
            subsubh_id = 0  # reset sub_sub_heading index 
        elif an_ele == 'sub-sub-heading':
            subsubh_id= str(int(subsubh_id)+1)
        else:
            print('ERROR')
            
        
        if subsubh_id==0:
            if subh_id !=0:
                seg_ids.append(head_id+'_'+subh_id)
                
            else:
                seg_ids.append(head_id)
                
        if subsubh_id !=0:
            seg_ids.append(str(head_id)+'_'+str(subh_id)+'_'+str(subsubh_id))
            
          
            
print(seg_ids)

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/130161

1352 次点击

文章 [ 2 ] | 最新文章 3 年前

• 1 楼

Lolrenz Omega 3 年前

你可以使用 split('-') 查找标题级别的方法:

subs_amount = an_ele.split('-')

你可以从标题的长度推断标题的级别 subs_amount 列表如果长度为1,则为a "heading" .如果是3,那就是 "sub-sub-heading" 等然后,列出一个清单 store_levels 如Tim Roberts在评论中所说,要存储更高级别的前几个标题的索引:

if len(subs_amount) > len(store_levels):
    store_levels.append(1) #add a sub-level
elif len(subs_amount) == len(store_levels):
    store_levels[-1] += 1 #add a heading of the same level
else:
    del store_levels[-1] #go back to the level above

现在,为了建立输出,你只需要 "_".join(store_levels) 并将其附加到输出中。

抱歉,没有使用与您相同的变量名。我这样做是为了不混淆或改变它们的用途。我希望我的代码足够清晰,这样你就可以实现它。

• 2 楼

Tim Roberts 3 年前

def get_level(s):
    return s.count('-')

def translate(test_list):
    seg_ids = []
    levels = [0]*9
    last_level = 99
    for an_ele in test_list:
        level = get_level(an_ele)
        if level <= last_level:
            levels[level] += 1
        else:
            levels[level] = 1
        seg_ids.append( '_'.join(str(k) for k in levels[:level+1]))
        last_level = level
    return seg_ids

print(translate(['heading', 'heading','sub-heading', 'sub-heading', 'heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading', 'sub-heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading','sub-sub-heading', 'heading']))

输出:

['1', '2', '2_1', '2_2', '3', '3_1', '3_1_1', '3_1_2', '3_2', '3_3', '3_3_1', '3_3_2', '3_3_3', '4']

这将最大级别数固定为9。你可以通过设置 levels=[0] 然后,如果新的水平已经超过了终点,那么就扩展它,但这一点得到了理解。

登录后回复