Python：文本分析必备—搜狗词库

👇 连享会 · 推文导航 | www.lianxh.cn

🍎 Stata：Stata基础 | Stata绘图 | Stata程序 | Stata新命令
📘 论文：数据处理 | 结果输出 | 论文写作 | 数据分享
💹 计量：回归分析 | 交乘项-调节 | IV-GMM | 时间序列 | 面板数据 | 空间计量 | Probit-Logit | 分位数回归
⛳ 专题：SFA-DEA | 生存分析 | 爬虫 | 机器学习 | 文本分析
🔃 因果：DID | RDD | 因果推断 | 合成控制法 | PSM-Matching
🔨 工具：工具软件 | Markdown | Python-R-Stata
🎧 课程：公开课-直播 | 计量专题 | 关于连享会

🍓 课程推荐：CGE 专题 - 理论与实操
主讲老师：贾智杰 (西安交通大学)
课程时间：2024 年 4 月 13/20/27 日 (三个周六)
课程主页：https://www.lianxh.cn

作者：梁淑珍 (华侨大学)
邮箱：13514084150@163.com

温馨提示： 文中链接在微信中无法生效。请点击底部「阅读原文」。或直接长按/扫描如下二维码，直达原文：

1. 引言
2. 词典的妙用
3. 搜狗词库的下载

3.1 抓取12个页面链接
3.2 爬取所有词库名称和下载链接
3.3 下载细胞词库

4. 细胞词库 scel 文件格式的转化
5. 相关推文

1. 引言

jieba 库是进行中文分词的一大利器，但 jieba 自带的词典并不完美。在实际操作过程中，用户需要添加特定的词典，来提高分词的准确性。搜狗细胞词库是外部词典的重要来源之一，提供了 12 类近 6000 个细胞词库。本文将详细展示搜狗词库的爬取和整理过程，并提供搜狗词库文本文档资源 (TXT 格式)，读者可点击「搜狗词库」下载。

2. 词典的妙用

本节简单地介绍一下词典的导入，通过下面的代码来对比一下 jieba 自带词典的分词结果和导入外部词典后的分词结果。

# 安装 jieba 词库
!pip install jieba

使用 jieba 自带词库进行分词：

# 导入库
import jieba
import re
import os

# 定义路径
os.chdir(r'D:\分词') # 可修改路径
os.getcwd()

# 载入停用词列表
# 停用词链接：https://github.com/goto456/stopwords
with open(r'cn_stopwords.txt','r',encoding='utf8') as f:
    stopwords=f.readlines()
    stop_list=[w.strip() for w in stopwords]

# 定义函数去除文本中的数字
def not_digit(w):
    w = w.replace(',', '')
    if re.match(r'\d+', w) != None or re.match(r'\d%', w) != None or re.match(
            r'\d*\.\d+', w) != None:
        return False
    else:
        return True

# 从年报中摘取的一段文字
text='''公司实现营业收入3,156,955.46万元，发生营业成本2,542,539.55万元，\
        发生销售费用、管理费用及财务费用合计362,273.65万元，            \
        主要是报告期内职工薪酬、固定资产折旧、无形资产摊销增加等综合影响。\
        公司发生研发投入37,251.33万元，同比增长18.12%，                 \
        主要是本期内对已有研发项目持续投入及新增研发项目使研发投入同比增加。'''

# jieba 进行分词
word_cut=jieba.cut(text)
word_result=[w.strip() for w in word_cut if w not in stop_list and len(w.strip())>0 and not_digit(w)]
print(word_result)

分词结果为：

['公司', '实现', '营业', '收入', '万元', 


    
'发生', '营业', '成本',
 '万元', '发生', '销售费用', '管理费用', '财务费用', '合计', '万元', 
 '主要', '报告', '期内', '职工', '薪酬', '固定资产', '折旧', '无形资产', 
 '摊销', '增加', '综合', '影响', '公司', '发生', '研发', '投入', '万元', 
 '同比', '增长', '主要', '本期', '已有', '研发', '项目', '持续', '投入', 
 '新增', '研发', '项目', '研发', '投入', '同比增加']

从结果中可以看出，jieba 自带词典识别不出财报中的部分专业术语，如 “营业收入”、“固定资产折旧”、“无形资产摊销” 等，分词结果的准确程度会对后续研究产生影响。接下来以加载搜狗词库中的 “财会词汇大全” 为例，看看分词结果的准确性能否有所提高。

# 导入词典，写入词典的路径
jieba.load_userdict(r'财务会计 财会词汇大全【官方推荐】.txt')

# 再次进行分词
word_cut=jieba.cut(text)
word_result=[w.strip() for w in word_cut if w not in stop_list and len(w.strip())>0 and not_digit(w)]
print(word_result)

分词结果为：

['公司', '实现', '营业收入', '万元', '发生', '营业成本', '万元', '发生', 
'销售费用', '管理费用', '财务费用', '合计', '万元', '主要', '报告', '期内', 
'职工', '薪酬', '固定资产折旧', '无形资产摊销', '增加', '综合', '影响', '公司', 
'发生', '研发投入', '万元', '同比', '增长', '主要', '本期', '已有', '研发', 
'项目', '持续', '投入', '新增', '研发', '项目', '研发投入', '同比增加']

从分词结果可以看出，“营业收入”、“营业成本”、“研发投入”、“固定资产折旧” 等词被识别出来，但 “职工薪酬” 等词并未出现在现有的词典中，此时可以采取以下的做法：1) 继续添加相关词典，2) 手动将词语添加到词典。总体上来看，使用相关的词库可以提高分词结果的准确性。

3. 搜狗词库的下载

目标网址：https://pinyin.sogou.com/dict/

该网页展示了 12 大类词库，包括近 6000 个细胞词库，右击选择 查看网页源代码，发现该网站是静态网页！本次爬取的过程分为三步：1) 抓取 12 类词库的网页链接；2) 爬取每个词库的页码，并对页码进行循环，抓取词库名称和下载链接；3) 访问下载链接，下载 scel 文件。

3.1 抓取12个页面链接

使用 Xpath Helper 定位链接路径：

这个 xpath 路径可以获取到所有链接，以下为详细代码：

# 导入库
import requests
import time
from lxml import etree
import re
import os
import pandas as pd

#添加 headers 信息
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
url="https://pinyin.sogou.com/dict/"

# 访问网址
html=requests.get(url,headers=headers)
html=html.text

# 解析网页内容
tree=etree.HTML(html)

# 词库大类名称的xpath路径
first_name_xpath="//div[@class='dict_category_list_title']/a/text()|//div\
                    [@class='dict_category_list_title ']/a/text()"
first_name_list=[name for name in tree.xpath(first_name_xpath)]

# 页面链接的 xpath 路径
first_href_xpath="//div[@class='dict_category_list_title']/a/@href|//div\
                    [@class='dict_category_list_title ']/a/@href"
first_href_list=[href for href in tree.xpath(first_href_xpath)]

# 打印前两条链接
print(first_href_list[:2])

可以发现，抓取结果并不是完整的网页链接，需要在抓取的基础上进一步处理。

['/dict/cate/index/167?rf=dictindex&pos=dict_rcmd', '/dict/cate/index/436?rf=dictindex']

以第二类「电子游戏」词库为例，抓取到的链接为：/dict/cate/index/436?rf=dictindex。

第一页的网址为：https://pinyin.sogou.com/dict/cate/index/436/default/
第二页的网址为：https://pinyin.sogou.com/dict/cate/index/436/default/2
第三页的网址为：https://pinyin.sogou.com/dict/cate/index/436/default/3

可以发现，网页的基本结构为：'https://pinyin.sogou.com' + '/dict/cate/index/436' + '/default/' + 页码。因此，需要将抓取链接 ? 及其后所有字符替换为 /default/，使用正则表达式进行处理：

new_href_list=['https://pinyin.sogou.com'+re.sub("\?.*","/default/",href) for href in first_href_list]
print(new_href_list[:2])

结果为：

['https://pinyin.sogou.com/dict/cate/index/167/default/',
 'https://pinyin.sogou.com/dict/cate/index/436/default/']

在此基础上，加上页码就能正常访问了！

3.2 爬取所有词库名称和下载链接

接下来，开始爬取「电子游戏」下的所有细胞词库，由于每类词库下的细胞词库数目不同，需要分别获取词库的总页码。

basic_url=new_href_list[1]
url=basic_url+'1'
html=requests.get(url,headers=headers)
html=html.text
tree=etree.HTML(html)

# 获取页码 xpath 路径
page_num_xpath="//li[6]/span/a/text()"
page_num=int(tree.xpath(page_num_xpath)[0])
print(page_num) # 结果为 104，与实际页面总数相符

对页码进行循环：

# 定义词库名称空列表和下载链接空列表
all_name=[]
all_download=[]

#对页面进行循环
for page in (1,page_num+1):
    # 构造网址
    url=basic_url+str(page)
    html=requests.get(url,headers=headers)
    html=html.text
    tree=etree.HTML(html)

    # 设置词库名称 xpath 路径
    name_xpath="//div[@class='detail_title']/a/text()"
    name_list=[name for name in tree.xpath(name_xpath)]

    # 设置下载链接 xpath 路径
    download_xpath="//div[@class='dict_dl_btn']/a/@href"
    download_list=[link for link in tree.xpath(download_xpath)]

    all_name.extend(name_list)
    all_download.extend(download_list)

# 打印结果
for name,download_link in zip(all_name,all_download):
    print(name+'\n'+download_link)

返回结果为：

《剑网3》官方词库大全
https://pinyin.sogou.com/dict/download_cell.php?id=54030&
name=%E3%80%8A%E5%89%91%E7%BD%913%E3%80%8B%E5%AE%98%E6%96%B9%E8%AF%8D%E5%BA%93%E5%A4%A7%E5%85%A8
梦幻西游【官方推荐】
https://pinyin.sogou.com/dict/download_cell.php?id=15235&
name=%E6%A2%A6%E5%B9%BB%E8%A5%BF%E6%B8%B8%E3%80%90%E5%AE%98%E6%96%B9%E6%8E%A8%E8%8D%90%E3%80%91

复制爬取到的链接到地址栏，发现可以直接下载 scel 文件。对 12 类词库进行循环，获得词库的全部词库名称和下载链接。

# 设置空列表
all_name=[]
all_download=[]

# 对链接列表进行循环
for i in range(len(new_href_list)):
    basic_url=new_href_list[i]

    # 访问第一个页面获取总页数
    url=basic_url+'1'
    # print语句能显示访问进度
    print(f'开始访问:{first_name_list[i]}')
    html=requests.get(url,headers=headers)
    html=html.text
    tree=etree.HTML(html)
    page_num_xpath="//li[6]/span/a/text()"
    page_num=int(tree.xpath(page_num_xpath)[0])

    # 对页面进行循环
    for page in range(1,page_num+1):
        url=basic_url+str(page)
        print(f'第{page}页')
        # 防止timeout导致报错
        sourcecode=True
        while sourcecode:
            try:
                html=requests.get(url,headers=headers,timeout=10)
                sourcecode=False
            except:
                time.sleep(10)
        html=requests.get(url,headers=headers)
        html=html.text
        tree=etree.HTML(html)
        name_xpath="//div[@class='detail_title']/a/text()"
        # 存在“ ”等不规范的文件命名字符，将其替换为空
        name_list=[re.sub("\\|\/|"


    
,"",name) for name in tree.xpath(name_xpath)]
        download_xpath="//div[@class='dict_dl_btn']/a/@href"
        download_list=[link for link in tree.xpath(download_xpath)]
        all_name.extend(name_list)
        all_download.extend(download_list)
    print(f'{first_name_list[i]}完成')

通过以上代码，我们爬取到了所有的词库名称和下载链接。

3.3 下载细胞词库

循环下载链接列表，将词库名称作为文件名下载 scel 文件。

# 设置文件存放路径
basic_storepath=r'D:\分词'
for i in range(len(all_name)):
    storepath=basic_storepath+all_name[i]
    down_link=all_download[i]
    
    # 防止timeout导致报错
    sourcecode=True
    while sourcecode:
        try:
            html=requests.get(down_link,headers=headers,timeout=10)
            sourcecode=False
        except:
            time.sleep(10)

    r=requests.get(down_link,headers=headers,timeout=10)
    with open (f"{storepath}.scel",'wb') as f:
        f.write(r.content)
    print(f'第{i}个{all_name[i]}下载完成')

运行完毕就能够将细胞词库 scel 文件下载到本地文件夹中了，目前共有 5771 个细胞词库。

4. 细胞词库 scel 文件格式的转化

jieba 添加的词典必须为 .txt 格式，且必须是 UTF-8 编码，使用 Python 可以进行批量转化，读者可以直接运行现成代码。以下代码来源于：https://blog.csdn.net/gsch_12/article/details/82083474 ，本文在此基础上做了少量修改。

#导入库
import struct
import sys
import binascii
import pdb
import os

#设置基础变量
startPy = 0x1540;
startChinese = 0x2628;
GPy_Table = {}

#定义函数
def byte2str(data):
    i = 0;
    length = len(data)
    ret = u''
    while i         x = data[i:i+2]
        t =  chr(struct.unpack('H', x)[0])
        if t == u'\r':
            ret += u'\n'
        elif t != u' ':
            ret += t
        i += 2
    return ret
def getPyTable(data):
    if data[0:4] != bytes(map(ord,"\x9D\x01\x00\x00")):
        return None
    data = data[4:]
    pos = 0
    length = len(data)
    while pos         index = struct.unpack('H', data[pos:pos +2])[0]
        pos += 2
        l = struct.unpack('H', data[pos:pos + 2])[0]
        pos += 2
        py = byte2str(data[pos:pos + l])



    
        GPy_Table[index] = py
        pos += l
def getWordPy(data):
    pos = 0
    length = len(data)
    ret = u''
    while pos         index = struct.unpack('H', data[pos:pos + 2])[0]
        ret += GPy_Table[index]
        pos += 2
    return ret
def getWord(data):
    pos = 0
    length = len(data)
    ret = u''
    while pos         index = struct.unpack('H', data[pos:pos +2])[0]
        ret += GPy_Table[index]
        pos += 2
    return ret
def getChinese(data):
    pos = 0
    length = len(data)
    while pos         same = struct.unpack('H', data[pos:pos + 2])[0]
        pos += 2
        py_table_len = struct.unpack('H', data[pos:pos + 2])[0]
        pos += 2
        py = getWordPy(data[pos: pos + py_table_len])
        pos += py_table_len
        for i in range(same):
            c_len = struct.unpack('H', data[pos:pos +2])[0]
            pos += 2
            word = byte2str(data[pos: pos + c_len])
            pos += c_len
            ext_len = struct.unpack('H', data[pos:pos +2])[0]
            pos += 2
            count = struct.unpack('H', data[pos:pos +2])[0]
            GTable.append((count, py, word))
            pos += ext_len
def deal(file_name):
    print('-' * 60)
    f = open(file_name, 'rb')
    data = f.read()
    f.close()
    if data[0:12] != bytes(map(ord,"\x40\x15\x00\x00\x44\x43\x53\x01\x01\x00\x00\x00")):
        pass
    getPyTable(data[startPy:startChinese])
    getChinese(data[startChinese:])
def txt_dict(txt):
    txts = txt.copy()
    for i in range(len(txt)):
        tr = txt[0][i]
        m = re.search(r' ',tr)
        txts[0][i]=tr[m.start()+1:]
    return txts

上述代码可以直接复制，读者只需在下列代码中定义文件储存路径即可：

# 细胞词库scel文件所在路径
path=r'D:\分词'
os.chdir(path)
files=os.listdir(path)

# 对所有文件进行循环
for i in range(len(files)):
    try:
        GTable=[]
        o=[files[i]]
        for f in o:
            deal(f)
            # 定义转化后的txt文档所在路径
            with open (f'F:\\搜狗词库txt\\{files[i][:-5]}.txt','w',encoding='utf8') as f:
                for word in GTable:
                    f.write(word[2]+'\n')
        print(f"第{i}个文件已转化")
    except:
        print(f'{i}个文件有问题')
        continue

运行以上代码，最后成功转化了 5756 份文件。

5. 相关推文

Note：产生如下推文列表的 Stata 命令为：
lianxh python, m
安装最新版 lianxh 命令：
ssc install lianxh, replace

专题：数据分享

Python+Stata：如何获取中国气象历史数据

专题：Stata入门

使用 Jupyter Notebook 配置 Stata\Python\Julia\R

专题：文本分析-爬虫

Python：计算管理层讨论与分析的余弦相似度
Stata+Python：爬取创历史新高股票列表
Python：爬取东方财富股吧评论进行情感分析
Python爬虫: 《经济研究》研究热点和主题分析

专题：Python-R-Matlab

Python：Jaccard 相似度和距离
Python：多进程、多线程及其爬虫应用
Python：爬取动态网站
Python爬取静态网站：以历史天气为例
Python：爬取巨潮网公告
司继春：Python学习建议和资源
Python：爬取上市公司公告-Wind-CSMAR
Python: 6 小时爬完上交所和深交所的年报问询函
Python: 使用正则表达式从文本中定位并提取想要的内容
Python: 批量爬取下载中国知网(CNKI) PDF论文

New！ Stata 搜索神器：lianxh 和 songbl GIF 动图介绍
搜：推文、数据分享、期刊论文、重现代码 ……
👉 安装：
. ssc install lianxh
. ssc install songbl
👉 使用：
. lianxh DID 倍分法
. songbl all

🍏 关于我们

连享会 ( www.lianxh.cn，推文列表) 由中山大学连玉君老师团队创办，定期分享实证分析经验。
直通车： 👉【百度一下：连享会】即可直达连享会主页。亦可进一步添加「知乎」,「b 站」,「面板数据」,「公开课」等关键词细化搜索。