AJOU Central Library Repository: SciPDFindexer: Distributed Information Retrieval system using MapReduce

BROWSE

Graduate School of Ajou University Department of Computer Engineering 3. Theses(Master)

SciPDFindexer: Distributed Information Retrieval system using MapReduce

Author(s): Murtazaev, Aziz

Advisor: Sangyoon Oh

Department: 일반대학원 컴퓨터공학과

Publisher: The Graduate School, Ajou University

Publication Year: 2011-08

Language: eng

Alternative Abstract: Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides sub-second response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. We target at the problem of large-scale indexing of documents with specific structure. We propose SciPDFindexer, system for indexing and querying scientific papers in PDF using MapReduce programming model in a distributed system. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing large number of scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. Our contributions are distributed indexing scheme suitable for scientific articles’ structures and corresponding full-functional implementation which includes parsing, indexing, querying logics. We show the difference of our scheme from distributed indexing scheme described in the original MapReduce paper. And we describe each part of the system in detail, particularly, besides indexing scheme we show our proposed PDF parsing logic for scientific articles and ranking model used in our system. We conducted three types of experimental evaluations of our system in a cluster of nodes. First, we show that our distributed indexing scheme can be parallelized efficiently and can scale with adding nodes. Second, we found optimal MapReduce parameters for our system under the given conditions. And third, we showed that our querying system provides sub-second response time for various length of queries.

URI: https://dspace.ajou.ac.kr/handle/2018.oak/10064

Fulltext

Appears in Collections:: Graduate School of Ajou University > Department of Computer Engineering > 3. Theses(Master)

Files in This Item:: There are no files associated with this item.

Export: RIS (EndNote); XLS (Excel); XML

Show full item record

qrcode

트윗하기

License

STATISTICS: Total Visit :4,967,648; Total Download :2,093; Today View :321

AJOU Central Library Repository는 국립중앙도서관 OAK 보급사업으로 구축되었습니다.

BROWSE

Browse