A portable searchable email archive system
Findmail is a Web application that ingest emails from archives(in mdir or mbox formats) , and allow users to accurately and efficiently search and browse over the archives offline. It works across different browsers(Firefox, Microsoft Edge, Safari and Google Chrome) and different operating systems(Windows, Linux and Mac OS).It requires minimal software installation( only python) and any standard Web browser.The main purpose of the tool is to help individuals better manage their email and retrieve past email information more easily, while implementing preservation policies to ensure preservation against hardware and/or software obsolescence over time.
In the working world, there is often a need to search and browse through large collections of emails for auditing purposes, to track down individuals and to verify decisions, etc. However, for the purposes of sanity, most users will either delete or archive email after it has been handled. These archives can later become
large and cumbersome to search through, especially after a long period of time has passed.There is also the
issue of archives becoming obsolete through software aging and unreliable
Internet bandwidth in Africa. The goal of this project was to create an email browse and search system for an individual. The system would ideally be simple to use, able to ingest emails from archives, and provide fast and accurate search and browse functions over email. The tool that was produced works offline.
This involves parsing and indexing of the
inputted archives of various email formats.
Parsing will extract and structure relevant
information from the inputted archive, while
indexing involves creating indices from the
parser output. The indices created were in the
format of XML plain text documents in order to uphold
simplicity and mitigate obsolescence. These indices
are accessable by the query system and facilitate
search and ranking of the emails in the archive.
The parser and indexer were built with notion of scalabilty
and portability in mind. They should thus be able to run on
the major operating systems.ie. Windows, MacOS and Linux.
Also, both the parser and indexer were created in Python.
This involves the user interface and a query
system that allows for fast and efficient
retrieval of emails. The user interface should
display emails clearly to the user and allow for
ease-of-use. It is a standard email Web
interface offering the user search and browse functions.
It was developed using a user-driven approach in order
to understand users’ needs and preferences. It consists
of mainly static HTML, CSS and Javascript to display
the relevant result when a user invokes one of the
services.
The query system should be able
to handle various queries and facilitate
discovery of relevant emails using extended boolean
implementation and indices generated by the
search indexer.
Pre-Processing:
Both maildir and mbox formats can be parsed and indexed. However, the parser and indexer only scale for collections
30 000 or smaller in size (as shown in Fig. 1).
FINDMAIL is portable: We concluded that the parser and indexer can work on the major operating systems, namely Windows 10, Mac OS, Linux 16.04 and 17.10. It also works on multiple versions of Python.
Email Processing:
The user interface, when rendered across the different browsers, worked as intended with respect to the look and feel; code validation and functional behaviour. The browse time (browser rendering time) increased linearly with increased collection sizes, and similar results were obtained for search testing (Refer to Fig.2).