新书推荐｜文本作为数据：机器学习与社会科学的新框架

文本作为数据：

机器学习与数据科学的新框架

1 内容简介

《文本作为数据：机器学习与社会科学的新框架》（Text as Data: A New Framework for Machine Learning and the Social Sciences）一书由普林斯顿大学出版社于2022年3月出版。从社交媒体帖子和短信到数字政府文件和档案，研究人员被大量反映社会世界的文本数据所淹没。这些文本数据为社会科学、人文学科和工业领域中的基本问题提供了前所未有的洞见。同时，新型机器学习工具正在快速改变科学和商业的方式。《文本作为数据》一书展示了如何将新数据源、机器学习工具和社会科学研究设计相结合，以开发和评估新的见解。

《文本作为数据》围绕着使用文本进行研究项目的核心任务：表示、发现、测量、预测和因果推断。作者提供了一种顺序性、迭代性和归纳性的研究设计方法。每个研究任务都配有真实世界应用案例，示例方法以及专注于任务的不同风格。

《文本作为数据》跨越多个领域，包括计算机科学和社会科学、定性和定量以及产业界和学术界，是一个理想的资源，适用于任何希望在数据丰富和计算便利、但社会科学领域仍面临持久挑战的时代分析大量文本集合的人士。

2 作者简介

Justin Grimmer

斯坦福大学政治学教授、

胡佛研究所高级研究员，

主要研究方向议会、代表性和政治方法论。

Margaret E. Roberts

加州大学圣迭戈分校政治科学副教授，Halıcıoğlu数据科学研究所成员，

主要研究方向为政治方法学和信息政治的交叉领域。

Brandon M. Stewart

普林斯顿大学社会学系副教授；

开发了新的定量统计方法，可应用于社会科学各个领域，同时专注于构建促进自动文本分析和回归模型中复杂异质性建模的工具。

3 章节目录

向上滑动阅览

第1章引言

第2章社会科学研究与文本分析

第3章选择及表现原则

第4章选择文件

第5章词袋模型

第6章多项式语言模型

第7章向量空间模型和相似度度量

第8章词的分布式表示

第9章语言序列的表示

第10章发现原则

第11章辨别词语

第12章聚类

第13章主题模型

第14章低维文档嵌入

第15章测量原理

第16章字数统计

第17章监督分类概述

第18章编写训练集

第19章使用监督学习对文档进行分类

第20章检查绩效

第21章重新利用发现方法

第22章推理原则

第23章预测

第24章因果推断

第25章文本作为结果

第26章文本作为应对措施

第27章文本作为混杂因素

第28章结论

4 内容节选

This is a book about the use of texts and language to make inferences about human behavior. Our framework for using text as data is aimed at a wide variety of audiences—from informing social science research, offering guidance for researchers in the digital humanities, providing solutions to problems in industry, and addressing issues faced in government. This book is relevant to such a wide range of scholars and practitioners because language is an important component of social interaction—it is how laws are recorded, religious beliefs articulated, and historical events reported. Language is also how individuals voice complaints to representatives, organizers appeal to their fellow citizens to join in protest, and advertisers persuade consumers to buy their product. And yet, quantitative social science research has made surprisingly little use of texts—until recently.

这是一本关于利用文本和语言推断人类行为的书。我们使用文本作为数据的框架面向广泛的受众——从为社会科学、数字人文领域研究者提供指导，到解决工业问题和政府面临的问题。这本书对如此广泛的学者和实践者都有意义，因为语言是社交互动中重要组成部分——它记录法律、表达宗教信仰，并报告历史事件。语言也是个体向代表投诉、组织者呼吁公民加入抗议、广告商说服消费者购买产品的方式。然而，定量社会科学研究在最近才开始充分利用文本。

Texts were used sparingly because they were cumbersome to work with at scale. It was difficult to acquire documents because there was no clear way to collect and transcribe all the things people had written and said. Even if the texts could be acquired, it was impossibly time consuming to read collections of documents filled with billions of words. And even if the reading were possible, it was often perceived to be an impossible task to organize the texts into relevant categories, or to measure the presence of concepts of interest. Not surprisingly, texts did not play a central role in the evidence base of the social sciences. And when texts were used, the usage was either in small datasets or as the product of massive, well-funded teams of researchers.

由于规模庞大，文本使用非常有限。获取文件很困难，因为没有明确的方法来收集和转录人们所写和说的所有内容。即使可以获得这些文本，阅读数十亿字的文件集合也是不可能完成的任务。即使能够阅读，将文本组织成相关类别或测量感兴趣概念的存在也经常被认为是一项不可能完成的任务。毫不奇怪，在社会科学证据基础中，文本并没有发挥核心作用。而当使用文本时，则要么在小型数据集中使用，要么作为大规模、资金充裕研究团队产品。

Recently, there has been a dramatic change in the cost of analyzing large collections of text. Social scientists, digital humanities scholars, and industry professionals are now routinely making use of document collections. It has become common to see papers that use millions of social media messages, billions of words, and collections of books larger than the world’s largest physical libraries. Part of this change has been technological. With the rapid expansion of the internet, texts became much easier to acquire. At the same time, computational power increased—laptop computers could handle computations that previously would require servers. And part of the change was also methodological. A burgeoning literature—first in computer science and computational linguistics, and later in the social sciences and digital humanities—developed tools, models, and software that facilitated the analysis and organization of texts at scale.

最近，分析大量文本的成本发生了巨大变化。社会科学家、数字人文学者和行业专业人士现在经常利用文件集合。使用数百万条社交媒体消息、数十亿个单词以及比世界上最大的实体图书馆还要庞大的书籍集合来撰写论文已经很普遍了。这种变化部分是技术性的。随着互联网的快速扩张，获取文本变得更加容易。同时，计算能力也增强了——笔记本电脑可以处理以前需要服务器才能完成的计算任务。而这种变化也有一部分是方法论上的。在计算机科学和计算语言学领域，以及社会科学和数字人文领域，涌现出了大量的文献、工具、模型和软件，这些东西使得对大规模文本进行分析和组织变得更加容易。

However, the knowledge transfer from computer science and related fields has created confusion in how text as data models are applied, how they are validated, and how their output is interpreted. This confusion emerges because tasks in academic computer science are different than the tasks in social science, the digital humanities, and even parts of industry. While computer scientists are often (but not exclusively!) interested in information retrieval, recommendation systems, and benchmark linguistic tasks, a different community is interested in using “text as data” to learn about previously studied phenomena such as in social science, literature, and history. Despite these differences of purpose, text as data practitioners have tended to reflexively adopt the guidance from the computer science literature when doing their own work. This blind importing of the default methods and practices used to select, evaluate, and validate models from the computer science literature can lead to unintended consequences.

然而，来自计算机科学及相关领域的知识转移，在文本作为数据模型如何应用、如何验证以及如何解释其输出等方面造成了困惑。之所以会出现这种混乱，是因为学术计算机科学的任务不同于社会科学、数字人文，甚至是工业的一部分。虽然计算机科学家通常对信息检索、推荐系统和基准语言任务感兴趣，但另一个社区则有兴趣使用“文本作为数据”来研究先前研究过的现象，例如社会科学、文学和历史等领域。尽管存在这些目标上的差异，但处理文本作为数据的从业者倾向于在自己工作时反射性地采用计算机科学文献中提供的指导意见。这种盲目地从计算机科学文献中导入用于选择、评估和验证模型的默认方法和实践会导致意料之外的后果。

This book will demonstrate how to treat “text as data” for social science tasks and social science problems. We think this perspective can be useful beyond just the social sciences in the digital humanities, industry, and even mainstream computer science. We organize our argument around the core tasks of social science research: discovery, measurement, prediction, and causal inference. Discovery is the process of creating new conceptualizations or ways to organize the world. Measurement is the process where concepts are connected to data, allowing us to describe the prevalence of those concepts in the real world. These measures are then used to make a causal inference about the effect of some intervention or to predict values in the future.

本书将演示如何将“文本作为数据”用于社会科学任务和问题。我们认为，这种观点不仅在数字人文、工业甚至主流计算机科学中都很有用。我们围绕社会科学研究的核心任务组织了我们的论述：发现、测量、预测和因果推断。发现是创建新概念或组织世界方式的过程。测量是将概念与数据连接起来，使我们能够描述这些概念在现实世界中的普遍性。然后使用这些度量值对某些干预效果进行因果推断或预测未来值。

5 书籍评价

“这是社会科学家希望使用基于文本数据工作的权威指南。由该领域的先驱撰写，《文本作为数据》提供了对现有技术的全面概述。但作者并不止步于此：他们提出了一个新鲜议程来进行社会科学研究，展示算法如何增强我们发展人类行为理论的能力，而不是试图取代我们。”

——Chris Bail,

author of Breaking the Social Media Prism

“《文本作为数据》是由一支顶尖方法学家团队合著的备受期待的书籍。文本数据的爆炸式增长为我们提供了以前所未有的机会，可以 从大规模上了解人类行为和社会现象。Grimmer、Roberts和Stewart通过这本权威性书籍，为学生和研究者奠定了文本分析基础。”

——Kosuke Imai,

author of Quantitative Social Science

“这本书清晰而全面地介绍了分析文本数据的关键计算技术。技术材料在更广泛的研究哲学背景下得到了解释，将推动计算社会科学、数字人文和商业数据科学等领域出现令人兴奋的新应用。我强烈推荐它。”

——Jacob Eisenstein,

author of Introduction to Natural Language Processing

以上内容来自网站：

https://press.princeton.edu/books/paperback/9780691207551/text-as-data

经数字人文资讯小编翻译整理而成

未经允许不得转载

▼

编译丨韩竺媛

校对丨刘佳臻

排版丨付靖宜