有一种对数据使用的指责是,由数据产生的系统反映了说话者所反对的社会的某些东西。这种误读的一个明显例子涉及 Word2Vec [13] ,这是谷歌几年前开发的一个系统(后来被BERT所取代) ,该系统将单词嵌入到高维向量空间中,从而使具有相似意义的单词具有相近的向量。直观的想法是看看通常围绕在单词 w 周围的单词。那么 w 的向量就是与其周围关联单词的方向的加权组合。例如,我们预期「可口可乐」和「百事可乐」有相似的向量,因为人们谈论它们的方式大致相同。
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining associations between sets of items in massive databases,” Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 207–216, 1993.
[2] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” Intl. Conf. on Very Large Databases, pp. 487–499, 1994.
[3] T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,” 30th Conference on Neural Information Processing Systems, Barcelona, 2016.
[4] A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” ACM Symposium on Theory of Computing, pp. 327–336, 1998.
[5] T. Buonocore, “Man is to doctor as woman is to nurse: the gender bias of word embeddings,” https://towardsdatascience.com/gender-bias-word-embeddings-76d9806a0e17
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
[7] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” Proc. Intl. Conf. on Very Large Databases, pp. 518–529, 1999.
[8] B. Howe, M.J. Franklin, L.M. Haas, T. Kraska, and J.D. Ullman: “Data science education: we’re missing the boat, again,” ICDE, pp. 1473–1474, 2017.
[11] J. Leskovec, A. Rajaraman, and J.D.Ullman, Mining of Massive Datasets 3rd edition, Cambridge Univ. Press, 2020. Available for download at http://www.mmds.org
[12] P. Li, A.B. Owen, and C.H. Zhang. “One permutation hashing,” Conf. on Neural Information Processing Systems 2012, pp. 3122–3130.
[13] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” ArXiv:1301.3781, 2013.