About

Findmail is a Web application that ingest emails from archives(in mdir or mbox formats) , and allow users to accurately and efficiently search and browse over the archives offline. It works across different browsers(Firefox, Microsoft Edge, Safari and Google Chrome) and different operating systems(Windows, Linux and Mac OS).It requires minimal software installation( only python) and any standard Web browser.The main purpose of the tool is to help individuals better manage their email and retrieve past email information more easily, while implementing preservation policies to ensure preservation against hardware and/or software obsolescence over time.

Project Breakdown

Problem Statement

In the working world, there is often a need to search and browse through large collections of emails for auditing purposes, to track down individuals and to verify decisions, etc. However, for the purposes of sanity, most users will either delete or archive email after it has been handled. These archives can later become large and cumbersome to search through, especially after a long period of time has passed.There is also the issue of archives becoming obsolete through software aging and unreliable Internet bandwidth in Africa. The goal of this project was to create an email browse and search system for an individual. The system would ideally be simple to use, able to ingest emails from archives, and provide fast and accurate search and browse functions over email. The tool that was produced works off­line.

System Design
Pre-Processing:

This involves parsing and indexing of the inputted archives of various email formats. Parsing will extract and structure relevant information from the inputted archive, while indexing involves creating indices from the parser output. The indices created were in the format of XML plain text documents in order to uphold simplicity and mitigate obsolescence. These indices are accessable by the query system and facilitate search and ranking of the emails in the archive.

The parser and indexer were built with notion of scalabilty and portability in mind. They should thus be able to run on the major operating systems.ie. Windows, MacOS and Linux. Also, both the parser and indexer were created in Python.

Email Processing:

This involves the user interface and a query system that allows for fast and efficient retrieval of emails. The user interface should display emails clearly to the user and allow for ease-of-use. It is a standard email Web interface offering the user search and browse functions. It was developed using a user-driven approach in order to understand users’ needs and preferences. It consists of mainly static HTML, CSS and Javascript to display the relevant result when a user invokes one of the services.

The query system should be able to handle various queries and facilitate discovery of relevant emails using extended boolean implementation and indices generated by the search indexer.


Conclusions and Results

Pre-Processing:
Both maildir and mbox formats can be parsed and indexed. However, the parser and indexer only scale for collections 30 000 or smaller in size (as shown in Fig. 1). FINDMAIL is portable: We concluded that the parser and indexer can work on the major operating systems, namely Windows 10, Mac OS, Linux 16.04 and 17.10. It also works on multiple versions of Python.

Figure 1: Time to Parse and Index various sized maildir archives with one level of nesting

Email Processing:
The user interface, when rendered across the different browsers, worked as intended with respect to the look and feel; code validation and functional behaviour. The browse time (browser rendering time) increased linearly with increased collection sizes, and similar results were obtained for search testing (Refer to Fig.2).

Figure 2: Time taken to download and display the entire content of a Web page after a search action over a collection

Project documents

Literature Review

Final Report

Other