AJOU Central Library Repository: SciPDFindexer: Distributed Information Retrieval system using MapReduce

BROWSE

Graduate School of Ajou University Department of Computer Engineering 3. Theses(Master)

SciPDFindexer: Distributed Information Retrieval system using MapReduce

DC Field	Value	Language
dc.contributor.advisor	Sangyoon Oh	-
dc.contributor.author	Murtazaev, Aziz	-
dc.date.accessioned	2018-11-08T08:04:04Z	-
dc.date.available	2018-11-08T08:04:04Z	-
dc.date.issued	2011-08	-
dc.identifier.other	11807	-
dc.identifier.uri	https://dspace.ajou.ac.kr/handle/2018.oak/10064	-
dc.description	학위논문(석사)아주대학교 일반대학원 :컴퓨터공학과,2011. 8	-
dc.description.tableofcontents	I. Introduction 1 1.1. Introduction: Indexing problem at large scale 1 1.2. Motivation: MapReduce programming model as an efficient means of indexing in parallel 2 1.3. Thesis Summary and Contributions 4 1.4. Thesis Outline 5 II. Background and Related Works 7 2.1. Information Retrieval 7 2.2. MapReduce framework 10 2.3. Semantic issues of scientific papers 13 III. Distributed Information Retrieval system for indexing and querying scientific articles in PDF 15 3.1. High-level system overview 15 3.2. Indexing large-scale PDF articles in distributed system 17 3.2.1. Preprocessing step description 19 3.2.2. Preprocessing implementation with Hadoop MapReduce 22 3.2.3. Indexing schemes and Text-indexing step 24 3.3. Querying system 27 3.3.1. Ranking model used in SciPDFindexer 27 3.3.2. Querying system implementation 29 IV. Evaluation and Analysis 32 4.1. Experiment Environment 32 4.2. Experiment Setup 33 4.2.1. Experiment Objectives 33 4.2.2. Obtaining input data 36 4.3. Experiment Results 37 4.3.1. Parsing PDF files into textual format 37 4.3.2. Effectiveness of indexing at larger scale 37 4.3.3. Finding optimal MapReduce parameters 39 4.3.4. Evaluating response time of querying indices 42 V. Conclusion and Future Work 44 5.1. Conclusion 44 5.2. Future work 45 References 47	-
dc.language.iso	eng	-
dc.publisher	The Graduate School, Ajou University	-
dc.rights	아주대학교 논문은 저작권에 의해 보호받습니다.	-
dc.title	SciPDFindexer: Distributed Information Retrieval system using MapReduce	-
dc.type	Thesis	-
dc.contributor.affiliation	아주대학교 일반대학원	-
dc.contributor.department	일반대학원 컴퓨터공학과	-
dc.date.awarded	2011. 8	-
dc.description.degree	Master	-
dc.identifier.localId	T000000011807	-
dc.identifier.url	http://dcoll.ajou.ac.kr:9080/dcollection/jsp/common/DcLoOrgPer.jsp?sItemId=000000011807	-
dc.description.alternativeAbstract	Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides sub-second response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. We target at the problem of large-scale indexing of documents with specific structure. We propose SciPDFindexer, system for indexing and querying scientific papers in PDF using MapReduce programming model in a distributed system. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing large number of scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. Our contributions are distributed indexing scheme suitable for scientific articles’ structures and corresponding full-functional implementation which includes parsing, indexing, querying logics. We show the difference of our scheme from distributed indexing scheme described in the original MapReduce paper. And we describe each part of the system in detail, particularly, besides indexing scheme we show our proposed PDF parsing logic for scientific articles and ranking model used in our system. We conducted three types of experimental evaluations of our system in a cluster of nodes. First, we show that our distributed indexing scheme can be parallelized efficiently and can scale with adding nodes. Second, we found optimal MapReduce parameters for our system under the given conditions. And third, we showed that our querying system provides sub-second response time for various length of queries.	-

Appears in Collections:: Graduate School of Ajou University > Department of Computer Engineering > 3. Theses(Master)

Files in This Item:: There are no files associated with this item.

Show simple item record

qrcode

트윗하기

License

STATISTICS: Total Visit :5,002,923; Total Download :2,095; Today View :3,870

AJOU Central Library Repository는 국립중앙도서관 OAK 보급사업으로 구축되었습니다.

BROWSE

Browse