Ingestor

– Craig Stevenson

Problem

The Archive Of Archives as a whole was comprised of a lot of moving parts that need to be able to communicate with each other. These parts include: UI, SimpleDL and the archive collector. Being able to collect data from digital libraries to be archived in the archives of archives, is fundamental however if the information collected is unable to be correctly stored in the Archive Of Archives it is useless. This is where the Archive Ingestor comes into play. The main goal of this ingestor is understand and process the archive data, produced by the archive collector and transform it into a format that the Simple DL repository can understand and hence store.

Objectives of the Ingestor

  1. Ingest an archive scraped by the archive collector into the Simple DL consistent with a format it understands
  2. Version the archive being Ingested by recoding metrics such as date scraped, number of added or deleted documents and which documents have been modified since the previous version
  3. Report items that have not been scraped by the archive collector
  4. Perform the above 3 objectives quickly and efficiently

How these objectives were achieved

Objective One

This was achieved by scraping the metadata for all the items in a scraped archive using the OAI-PMH protocol. This metadata was then used to produce metadata for each item, that Simple DL can understand, by examining each element in the original metadata and mapping that element to an equivalent element understood by SimpleDL. Key metadata elements such as identifier were modified to correctly reflect the new location of that item now stored in the Simple DL repository. Once the metadata has been correctly produced for an archive, that archive was then compressed and inserted into its correct position in the Simple DL flat file format repository.

Objective Two

This was achieved through the use of a simple function that checked to see up to what version a scraped archive has been scraped up to and hence store it accordingly. A hashing function is used on each of the items being ingested and the result of that hash is compared with the corresponding hash for the item in the previous version, if different that item is tagged as modified in the metadata. The total items ingested is recorded at the end of the ingestion process and that is compared with the total number of items for the previous version of the archive. The difference is recorded in the metadata.

Objective Three

Due to the OAI-PMH protocol producing metadata for all items in an archive this can be used to compare which items are not present in the scraped archive that should be. If it was found that an item was not present in the scraped archive the link to that item in the original archive is recorded by the Ingestor. At the end of the ingestion process the Ingestor returns a text file containing all the links of items that were not present in the scraped archive that should be.

Objective four

Multithreading was leveraged for the metadata scraping as well as the ingestion process to improve performance since archives can become rather large and processing each item sequentially wastes unnecessary time especially if one wants to mass scrape archives.

Evaluation

Due to the Ingestor being a backend tool no user testing could be done. Instead automated testing was done through the use of a python script. Test archives were used and augmented to see if the ingestion tool could process these changes between archives.

In total four test cases were performed:

Test One: Ingestion

Two archives of size 133 and 1117 items were ingested. This test was passed all correct metadata was produced in the correct hierarchy.

Test Two: Versioning

The same archive was ingested 4 times however before each ingestion that archive was augmented by either add/removing items or modifying items in the archive. The results being the Ingestor was able to successfully detect and document changes between archives

Test case Three: Performance

Tests were run on archives of different sizes and the execution time for each of those were recorded. The largest archive ingested contained 35 681 items which the Ingestor was able to ingest in around 51.84 seconds.

Test Four: Simple DL Integation

The metadata produced by the Ingestor was run through the Simple DL indexing engine to see if data could be processed by Simple DL. The result being that it was processable by Simple DL.

Conclusion and Future Work

The Ingestor was successfully able to Ingest and version archives quickly and efficiently and hence should allow for increased scalability for the Archive Of Archives. Future work could include adding more threading to the Ingestor to allow it to ingest multiple archives at once as well as improving the versioning system to detect precise changes in digital objects, for example what exactly and where the modification took place in a digital object.