通过Wikipedia（维基百科）回答开放域问题

Danqi Chen∗Computer Science Stanford University Stanford, CA 94305, USA danqi@cs.stanford.edu Adam Fisch, Jason Weston & Antoine Bordes Facebook AI Research 770 Broadway New York, NY 10003, USA {afisch,jase,abordes}@fb.com

摘要

该文章提出通过使用Wikipedia（维基百科）作为唯一知识源来处理开放域问题的回答（任何实际问题的答案都是基于Wikipedia的文本中的信息产生）。这种大规模机器阅读的任务结合了文档重构的挑战，（找到相关的文章）和文本的机器理解（确定这些文章的答案）。我们的方法将基于双重哈希和TF-IDF匹配的搜索组件与经过训练以检测维基百科段落中的答案的多层递归神经网络模型相结合。我们对多个现有QA数据集的实验表明（1）两个模块与现有的同行相比都具有很强的竞争力;（2）利用远程监督的多任务学习方式，是解决这一挑战性任务的有效体系。

1、引言

这篇文章认为回答实际问题的方法在于在开放域中使用Wikipedia做为唯一知识源，类似于人们在百科全书中寻找问题的答案。Wikipedia是一个细节信息不断演进的百科全书如果能够恰当的使用该能力可以促进机器智能不断提升。不同于知识库（KBs）例如Freebase(Bollacker et al., 2008)或者DB-Pedia(Auer et al., 2007)尽管它们更容易被计算和处理，但在开放域问题的回答中却显得十分有限（Miller et al.2016），Wikipedia包涵有人们感兴趣的最新的知识。然而它被设计供人类阅读和使用而不是机器。

将Wikipedia文章作为知识源使得QA任务的挑战不仅要处理大规模开放域问题还要使得机器可以理解文本。为了回答任意问题，首先要从超出500万个条目当中抽取极少数与问题相关的文档，然后仔细的阅读它们来生成回答。我们称这项工作为machine reading at scale（MRS）。我们将Wikipedia当做一个文本集并且不依赖它内部的知识图结构。结果表明我们的方法是通用的并且可以被切换到其他的文本集、书籍、甚至是日更新的新闻报纸中。

大规模QA系统，像IBM的DeepQA (Ferrucci et al., 2010)依赖于多个知识源提供回答：除了Wikipedia还有KBs，字典甚至新闻文章、书籍等。结果是这种系统为了回答准确，极度依赖冗余的信息源。使用单一知识源强迫模型能够非常准确的寻找问题的答案，因为其对应的答案信息可能只出现一次。于机器理解子域和创建数据集例如SQuAD (Rajpurkar et al., 2016), CNN/Daily Mail (Hermann et al., 2015) 和CBT (Hill et al., 2016)，一个关键的动机在于研究机器的阅读能力。

然而这些机器理解源通常假设一小段相关文本已经被识别并被嵌入模型当中。这对建立开放域QA系统来说是不实际的。与之形成鲜明对比的是，使用KB或者文件信息检索的方法必须将搜索作为解决方案的一个组成部分。相反，MRS侧重于同时保持机器理解的挑战，这需要深入理解文本，同时保持对大量开放资源搜索的现实限制。

在本文中我们展示了如何用多个现有的QA数据集评估MRS

即要求开放域系统在这些数据集上都有良好的表现。我们开发了DrQA，一个用Wikipedia来回答问题的强大系统。主要的组成部分有：（1）文档检索器（Document Retriever），一个使用双重哈希和TF-IDF，用来匹配问题答案，可以有效返回相关文章子集的模块，还有（2）文本阅读器（Document Reader）一个多层递归神经网络机器理解模型被训练用来检测答案与这些返回的文本之间的跨度。图一给出了DrQA的示例。

我们的实验表明我们的实验表明，文档检索器（Document Retriever）优于内置的维基百科搜索引擎，文档阅读器（Document Reader）在极具竞争力的SQuAD结果排名上达到了最先进水平。最后我们的完整系统使用多个衡量标准进行评估。特别的，通过使用多任务学习和远程监督学习我们在所有数据集上的性能相对于单任务训练均有所提高。

2、相关工作

以每年一度的TREC竞赛[1]为背景，开放域QA起初被定义为在非结构化的文本集中寻找答案。随着KBs的发展，从KBs中创造资源如 WebQuestions (Berant et al., 2013) 和

SimpleQuestions (Bordes et al., 2015) based on the Freebase KB (Bollacker et al., 2008) 或 on automatically extracted KBs, e.g.,OpenIE triples and NELL (Fader et al., 2014). 然而KBs有其固有的缺陷（不完整，模式固定）这使得研究人员回到了原始的基于行文本设置问题的答案。

第二个促使人们从新的角度看待该问题在于机器文本理解例如，在阅读一小段故事后可以回答与之有关的问题。得益于近期的深度学习文章如：attention-based and memory-augmented neural networks (Bahdanau et al., 2015; Weston et al., 2015; Graves et al., 2014) 和新版本的训练和评估数据集如 QuizBowl (Iyyer et al., 2014), CNN/Daily Mail based on news articles (Hermann et al., 2015), CBT based on children books (Hill et al., 2016), 或基于Wikipedia的SQuAD (Rajpurkar et al., 2016) 和 WikiReading (Hewlett et al., 2016),这个子领域已经能够被实现。这篇文章的其中一个目标就是测试这些新方法在开放域QA中的表现如何。

使用Wikipedia作为QA知识源在之前已经被提出Ryu et al. (2014) perform open-domain QA using a Wikipedia-based knowledge model.基于不同类型的半结构化知识例如infoboxes, article structure, category structure, 和definitions，他们通过对多个回答匹配模型生成文本内容。相似的Ahn et al. (2004) 也将Wikipedia与其他数据源作为文本来源, 在这种环境下检索其他文本。Buscaldi and Rosso (2006)也从Wikipedia中为QA挖掘知识和回答。不同于使用Wikipedia作为知识源寻找问题的答案，他们聚焦于QA系统返回的答案的正确性，并使用Wikipedia数据库生成一系列模板判断答案是否是所希望的。在我们所做的工作中，如同引言中描述的一样，我们仅仅考虑文本理解，为了专注于大量机器文本阅读的工作（MRs）使用Wikipedia作为唯一的知识源。

通过使用web已经有大量高度开发的全管道QA方法，如QuASE (Sun et al., 2015)所做的，或者将Wikipedia作为数据源，如Microsoft’s AskMSR (Brill et al., 2002), IBM’s DeepQA (Ferrucci et al., 2010) 和YodaQA (Baudiˇ s, 2015; Baudiˇ s andˇSediv` y, 2015)做的一样，后者是开源的，因此可以重现以用于比较的目的。AskMSR是一个基于QA系统的搜索引擎并且依赖于“数据冗余而不是复杂的问题或候选答案的语言分析”，例如，它不在乎我们所做的机器理解。DeepQA是一个非常复杂的系统，并且依赖于文本包涵的非结构信息和像KBs一样的结构化数据，利用数据库或文章本体生成候选答案或者进行有效的投票。YodaQA是一个仿照DeepQA的开源系统，类似于组建网站，尤其在信息抽取，数据库和Wikipedia方面。我们的文本理解任务在使用单一知识源后面临更大的挑战。与这些方法进行比较为性能上限基准提供了一个有用的数据点。

Multitask learning（多任务学习） (Caruana, 1998)和任务迁移在机器学习中有有丰富的历史，例如，using ImageNet in the computer vision community (Huh et al., 2016))同样在NLP中典型的有(Collobert and Weston, 2008)。这几个工作已经尝试组合多个QA训练数据集通过多任务学习来实现(i)任务迁移改善数据集性能(ii)提供一个单一通用的系统，由于不同数据集中数据分布的不同能够回答不同种类的问题。Fader et al. (2014) 使用 WebQuestions, TREC 和 WikiAnswers将四个 KBs做为知识源并且通过多任务学习在后两个数据集上得到提升。Bordes et al. (2015) 将 WebQuestions 和 SimpleQuestions 组合起来利用远程监督学习使用 Freebase 作为 KB 在两个数据集上均有小量的提升, 尽管在报告中当使用一个数据集训练并用另一个数据集做测试时有较差的表现，这也表明了任务迁移的确是一个很有挑战性的课题，同样的(Kadlec et al., 2016)得出了相似的结论。我们沿着相似的主题，但不是使用KB而是以文档检索和文本阅读理解为背景，取得了积极的结果。

3、我们的系统： DrQA

接下来我们将我们的DrQA系统描述为由两部分组成的MRS：（1）为了找到相关文档的文档检索模块（2）一个机器理解模型，文档阅读器，为了从单个文档或一小部分文档集合中获得答案。

3.1、文档检索

如同传统的QA系统，我们首先采用了高效（非机器学习）的文档检索系统缩小我们的搜索范围并且仅专注于阅读可能相关的文档。对于许多问题类型，相比于基于Wikipedia搜索API(GormleyandTong,2015)内建的ElasticSearch，一个简单的倒排索引查找，然后用术语向量模型评分，就可以很好地完成任务。将文章和问题比作TF-IDF模型中的词袋向量中的权重。通过采取n-gram属性和文章词序相结合我们进一步提高了我们的系统。

4、数据

5、实验

6、结论

参考

David Ahn, Valentin Jijkoun, Gilad Mishne, Karin Mller, Maarten de Rijke, and Stefan Schlobach. 2004. Using wikipedia at the trec qa track. In Pro- ceedings of TREC 2004.

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.In The semantic web, Springer, pages 722–735.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015.Neural machine translation by jointly learning toalign andtranslate. InInternational Conference on Learning Representations (ICLR).

Petr Baudiˇ s. 2015. YodaQA: a modular question answering system pipeline. In POSTER 2015-19th In-ternational Student Conference on Electrical Engineering. pages 1156–1165.

Petr Baudiˇ s and JanˇSediv` y. 2015.Modeling of the question answering task in the YodaQA system.In International Conference of the Cross- Language Evaluation Forum for European Languages. Springer, pages 222–228.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Empirical Methods in Natural Language Processing (EMNLP). pages 1533–1544.

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. AcM, pages 1247–1250.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 .

Eric Brill, Susan Dumais, and Michele Banko. 2002.An analysis of the Ask MSR question-answering system. In Empirical Methods in Natural Language Processing (EMNLP). pages 257–264.

Davide Buscaldi and Paolo Rosso. 2006.Mining knowledge from Wikipedia for the question answering task. In International Conference on Language Resources and Evaluation (LREC). pages 727–730.

Rich Caruana. 1998. Multitask learning. In Learning to learn, Springer, pages 95–133.

Danqi Chen, Jason Bolton, and Christopher D Man- ning. 2016.A thorough examination of the CNN/Daily Mail reading comprehension task. In Association for Computational Linguistics (ACL).

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In International Conference on Machine Learning (ICML).

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.2014. Open question answering over curated and extracted knowledge bases. In ACM SIGKDD international conference on Knowledge discovery and data mining. pages 1156–1165.

David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. 2010. Building Watson: An overview of the DeepQA project. AI magazine 31(3):59–79.

Clinton Gormley and Zachary Tong. 2015.Elasticsearch: The Definitive Guide. ” O’Reilly Media, Inc.”.

Alex Graves, Greg Wayne, and Ivo Danihelka.2014.Neural turing machines.arXiv preprint arXiv:1410.5401 .

Karl Moritz Hermann, Tomáˇ s Koˇ cisk´ y, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS).

Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. Wikireading: A novel largescale language understanding task over wikipedia. In Association for Computational Linguistics (ACL). pages 1535–1545.

Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The Goldilocks Principle: Reading children’s books with explicit memory representations. In International Conference on Learning Representations (ICLR).

Minyoung Huh, Pulkit Agrawal, and Alexei A Efros.2016.What makes ImageNet good for transfer learning? arXiv preprint arXiv:1608.08614 .

Mohit Iyyer, Jordan L Boyd-Graber, Leonardo Max Batista Claudino, Richard Socher, and Hal Daumé III. 2014. A neural network for factoid question answering over paragraphs. In Empirical Methods in Natural Language Processing (EMNLP).pages 633–644.

Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst.2016. From particular to general: A preliminary case study of transfer learning in reading comprehension. Machine Intelligence Workshop, NIPS .

Diederik Kingma and Jimmy Ba. 2014.ఀAdam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .

Kenton Lee, Tom Kwiatkowski, Ankur Parikh, and Dipanjan Das. 2016. Learning recurrent span representations for extractive question answering. arXiv preprint arXiv:1611.01436 .

Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J Bethard, and David McClosky. 2014.The stanford corenlp natural language processing toolkit. In Association for Computational Linguistics (ACL). pages 55–60.

Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir- Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-value memory networks for directly reading documents. In Empirical Methods in Natural Language Processing (EMNLP). pages 1400–1409.

Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009.Distant supervision for relation extraction without labeled data.In Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL/IJCNLP). pages 1003–1011.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation.In Empirical Methods in Natural Language Processing (EMNLP). pages 1532–1543.

PranavRajpurkar, JianZhang, KonstantinLopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).

Pum-Mo Ryu, MyungGil Jang, and Hyun-Ki Kim.2014.Open domain question answering using Wikipedia-based knowledge model.Information Processing & Management 50(5):683–692.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016.Bidirectional attention flow for machine comprehension.arXiv preprint arXiv:1611.01603 .

Huan Sun, Hao Ma, Wen-tau Yih, Chen-Tse Tsai, Jingjing Liu, and Ming-Wei Chang. 2015. Open do- main question answering via semantic enrichment.In Proceedings of the 24th International Conference on World Wide Web. ACM, pages 1045–1055.

Zhiguo Wang, Haitao Mi, Wael Hamza, and Radu Florian. 2016.Multi-perspective context matching for machine comprehension.arXiv preprint arXiv:1612.04211 .

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009.Feature hashing for large scale multitask learning. In International Conference on Machine Learning (ICML).pages 1113–1120.

Jason Weston, Sumit Chopra, and Antoine Bordes.2015. Memory networks. In International Conference on Learning Representations (ICLR).

Caiming Xiong, Victor Zhong, and Richard Socher.2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604

[1] 1http://trec.nist.gov/data/qamain.html