SciPDFindexer: Distributed Information Retrieval system using MapReduce
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Sangyoon Oh | - |
dc.contributor.author | Murtazaev, Aziz | - |
dc.date.accessioned | 2018-11-08T08:04:04Z | - |
dc.date.available | 2018-11-08T08:04:04Z | - |
dc.date.issued | 2011-08 | - |
dc.identifier.other | 11807 | - |
dc.identifier.uri | https://dspace.ajou.ac.kr/handle/2018.oak/10064 | - |
dc.description | 학위논문(석사)아주대학교 일반대학원 :컴퓨터공학과,2011. 8 | - |
dc.description.tableofcontents | I. Introduction 1 1.1. Introduction: Indexing problem at large scale 1 1.2. Motivation: MapReduce programming model as an efficient means of indexing in parallel 2 1.3. Thesis Summary and Contributions 4 1.4. Thesis Outline 5 II. Background and Related Works 7 2.1. Information Retrieval 7 2.2. MapReduce framework 10 2.3. Semantic issues of scientific papers 13 III. Distributed Information Retrieval system for indexing and querying scientific articles in PDF 15 3.1. High-level system overview 15 3.2. Indexing large-scale PDF articles in distributed system 17 3.2.1. Preprocessing step description 19 3.2.2. Preprocessing implementation with Hadoop MapReduce 22 3.2.3. Indexing schemes and Text-indexing step 24 3.3. Querying system 27 3.3.1. Ranking model used in SciPDFindexer 27 3.3.2. Querying system implementation 29 IV. Evaluation and Analysis 32 4.1. Experiment Environment 32 4.2. Experiment Setup 33 4.2.1. Experiment Objectives 33 4.2.2. Obtaining input data 36 4.3. Experiment Results 37 4.3.1. Parsing PDF files into textual format 37 4.3.2. Effectiveness of indexing at larger scale 37 4.3.3. Finding optimal MapReduce parameters 39 4.3.4. Evaluating response time of querying indices 42 V. Conclusion and Future Work 44 5.1. Conclusion 44 5.2. Future work 45 References 47 | - |
dc.language.iso | eng | - |
dc.publisher | The Graduate School, Ajou University | - |
dc.rights | 아주대학교 논문은 저작권에 의해 보호받습니다. | - |
dc.title | SciPDFindexer: Distributed Information Retrieval system using MapReduce | - |
dc.type | Thesis | - |
dc.contributor.affiliation | 아주대학교 일반대학원 | - |
dc.contributor.department | 일반대학원 컴퓨터공학과 | - |
dc.date.awarded | 2011. 8 | - |
dc.description.degree | Master | - |
dc.identifier.localId | 569707 | - |
dc.identifier.url | http://dcoll.ajou.ac.kr:9080/dcollection/jsp/common/DcLoOrgPer.jsp?sItemId=000000011807 | - |
dc.description.alternativeAbstract | Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides sub-second response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. We target at the problem of large-scale indexing of documents with specific structure. We propose SciPDFindexer, system for indexing and querying scientific papers in PDF using MapReduce programming model in a distributed system. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing large number of scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. Our contributions are distributed indexing scheme suitable for scientific articles’ structures and corresponding full-functional implementation which includes parsing, indexing, querying logics. We show the difference of our scheme from distributed indexing scheme described in the original MapReduce paper. And we describe each part of the system in detail, particularly, besides indexing scheme we show our proposed PDF parsing logic for scientific articles and ranking model used in our system. We conducted three types of experimental evaluations of our system in a cluster of nodes. First, we show that our distributed indexing scheme can be parallelized efficiently and can scale with adding nodes. Second, we found optimal MapReduce parameters for our system under the given conditions. And third, we showed that our querying system provides sub-second response time for various length of queries. | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.