About

Description

The information retrieval module for SANCTUM provides functionality for indexing and searching the Twitter dataset in a cluster environment using Hadoop as well as a self-developed MapReduce framework. This part of the project explores the development and analysis of an information retrieval system module used to pre-process, index (using the pre-existing Hadoop framework, as well as a self-developed framework) and search the Twitter dataset provided to us. Data is first indexed to provide fast queried tweet searches from input at the front-end. The system was built to be as robust as possible, with the aim of being able to run on any computer with little to no setup.

Design

At its core, SANCTUM's primary functions are ones that can be efficiently solved using MapReduce. The Information Retrieval module uses the Hadoop implementation of MapReduce, as well as a self-developed MapReduce framework, to perform the indexing of the Twitter data.

The figure below shows the flow of events and interactions between objects during a typical MapReduce job. SANCTUM MapReduce's design was inspired by the Hadoop MapReduce framework, but used a simpler implementation and model.

mapreduce

Figure 1. MapReduce Flow of Events

sanctum-mapreduce

Figure 2. SANCTUM MapReduce Flow of Events

The module consists of a single jar that is executed from the command-line. A configuration file and indexing token blacklist (used to prevent indexing of certain words) are provided to configure the program. A readme provided gives more information on running the module.

indexing-usage

Figure 3. SANCTUM IR Usage

Results

The graphs below show the results of running an evaluation of the indexing and searching systems on an Amazon Web Service cluster. SANCTUM MapReduce indexing performed very well on single-node setups. Hadoop ran slowly in comparison to the single-node MapReduce due to Hadoop overheads described in the final report on the indexing application. Searching the indices proved to be efficient enough to give good results.

indexing-usageindexing-usage