Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides sub-second response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process.
We target at the problem of large-scale indexing of documents with specific structure. We propose SciPDFindexer, system for indexing and querying scientific papers in PDF using MapReduce programming model in a distributed system. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing large number of scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. Our contributions are distributed indexing scheme suitable for scientific articles’ structures and corresponding full-functional implementation which includes parsing, indexing, querying logics. We show the difference of our scheme from distributed indexing scheme described in the original MapReduce paper. And we describe each part of the system in detail, particularly, besides indexing scheme we show our proposed PDF parsing logic for scientific articles and ranking model used in our system.
We conducted three types of experimental evaluations of our system in a cluster of nodes. First, we show that our distributed indexing scheme can be parallelized efficiently and can scale with adding nodes. Second, we found optimal MapReduce parameters for our system under the given conditions. And third, we showed that our querying system provides sub-second response time for various length of queries.