However, the knowledge transfer from computer science and related fields has created confusion in how text as data models are applied, how they are validated, and how their output is interpreted. This confusion emerges because tasks in academic computer science are different than the tasks in social science, the digital humanities, and even parts of industry. While computer scientists are often (but not exclusively!) interested in information retrieval, recommendation systems, and benchmark linguistic tasks, a different community is interested in using “text as data” to learn about previously studied phenomena such as in social science, literature, and history. Despite these differences of purpose, text as data practitioners have tended to reflexively adopt the guidance from the computer science literature when doing their own work. This blind importing of the default methods and practices used to select, evaluate, and validate models from the computer science literature can lead to unintended consequences.
然而,来自计算机科学及相关领域的知识转移,在文本作为数据模型如何应用、如何验证以及如何解释其输出等方面造成了困惑。之所以会出现这种混乱,是因为学术计算机科学的任务不同于社会科学、数字人文,甚至是工业的一部分。虽然计算机科学家通常对信息检索、推荐系统和基准语言任务感兴趣,但另一个社区则有兴趣使用“文本作为数据”来研究先前研究过的现象,例如社会科学、文学和历史等领域。尽管存在这些目标上的差异,但处理文本作为数据的从业者倾向于在自己工作时反射性地采用计算机科学文献中提供的指导意见。这种盲目地从计算机科学文献中导入用于选择、评估和验证模型的默认方法和实践会导致意料之外的后果。