Library of Congress

Digital Preservation

The Library of Congress > Digital Preservation > Partner Tools and Publications > NDIIPP Partner Tools and Services Inventory

NDIIPP Partner Tools and Services Inventory

This is a list of tools and services designed, developed or used by NDIIPP partners during their projects. By making this list available, the Library encourages NDIIPP partners and others in the preservation community to share in, and take advantage of, the work and resources of our partners.

A

ACE (Audit Control Environment)

ACE is a prototype tool that validates the integrity of digital files through mathematical techniques. Its purpose is to ensure the authenticity of digital objects in long term archives. ACE consists of a third-party Integrity Management Service (IMS) which issues integrity tokens for digital objects, and a local archive Audit Manager (AM) that periodically validates the repository. Consistency in ACE is guaranteed through the use of the archive independent IMS to validate integrity tokens and with the publication of witness values to prove the correctness of the system.

Archive-It

Archive-It, a subscription service from the Internet Archive, allows institutions to build and preserve collections of born digital content. Through a web application, Archive-It partners can harvest, catalog, manage and browse their archived collections. Collections are hosted at the Internet Archive data center and are accessible to the public with full-text search. Over 65 memory institutions around the world partner with Internet Archive to archive the web using Archive-It.

B

BagIt

A specification for the packaging of digital content for transfer. Content is packaged (the bag) along with a small amount of machine-readable text (the tag) to help automate the content's receipt, storage and retrieval. There is no software to install. A bag consists of a base directory containing the tag and a subdirectory that holds the content files. The tag is a simple text-file manifest, like a packing slip, that consists of two elements:

1. An inventory of the content files in the bag
2. A checksum for each file.

A slightly more sophisticated bag lists URLs instead of simple directory paths. A script then consults the tag, detects the URLs and retrieves the files over the Internet, ten or more at a time. This type of simultaneous multiple transfer reduces the overall data-transfer time. In another optional file, users can supply metadata that describes the bag.

SEE ALSO: BagIt Library, BagIt Transfer Utilities, RubyBagit (external link) and the BagIt video.

  • Developer: Library of Congress, California Digital Library and Stanford University
  • Written in: n/a
  • OS and run-time environment: n/a
  • Application: n/a
  • Documentation: Bagit Specification (PDF, 63KB)
  • License: n/a

BagIt Library (BIL)

A Java software library that supports the creation, manipulation and validation of bags.

SEE ALSO: BagIt, BagIt Transfer Utilities and RubyBagit (external link)

BagIt Transfer Utilities

A collection of tools developed by the Library of Congress and their partners in the National Digital Information Infrastructure and Preservation Program (NDIIPP) for the purpose of validation and transfer of bags.

The Parallel Retriever optimizes the retrieval of bags through parallelization, and produces a bag when given a file manifest and a "fetch.txt" file. VerifyIt verifies a bag manifest using parallel md5 processes. The Bag Validator validates a bag against the BagIt specification, as well as checking for files in the manifest that are missing from the disk, files on the disk that are not listed in the manifest and duplicate entries in manifest.

SEE ALSO: BagIt, BagIt Library, RubyBagit (external link) and the BagIt video.

Back to Top

C

Conspectus Database for Private LOCKSS Networks (PLNs)

The conspectus tool was developed by the MetaArchive Cooperative to provide a web-based data management tool for Private LOCKSS Networks. It maintains the metadata of preserved collections, including LOCKSS-specific metadata. The conspectus offers a Graphical User Interface to contributing institutions so they can create, update, and maintain their collection descriptions. The conspectus generates LOCKSS-specific configuration data which is used by system scripts to configure the Private LOCKSS Network's production and test networks (as currently deployed by the MetaArchive preservation network).

NOTE: Software freely available to other PLNs upon request. Please contact
katherine.skinner@metaarchive.org.

ContextMiner

ContextMiner is a framework to collect, analyze, and present contextual information along with the data. It is based on the idea that while describing or archiving an object, contextual information helps to make sense of that object or to preserve it better.

Back to Top

D

Dataverse Network

The Dataverse Network software is an open-source, digital library system for management, dissemination, exchange, and citation of virtual collections (dataverses) of quantitative data. Dataverses can be used or administered through Web-based clients that communicate with a host Dataverse Network.

A Dataverse Network, usually running at a major institution, requires installation of application software. Individual dataverses are self-contained virtual data archives, served out by a Dataverse Network, appearing on the Web sites of their owners (e.g., individuals, departments, projects, or publications). Dataverses are branded in the style of the owning entity, but are easy to set up, require no local software installations, and offer the services of a modern archive controlled by the dataverse owner. Data is displayed in a hierarchy; descriptive information (using DDI, Data Documentation Initiative) can be searched.

Depending on the policies of the dataverse owner, end users may be able not only to download files, but also to extract subsets, and perform statistical analysis online. Dataverses and Dataverse Networks can federate with each other, and with other systems through open protocols (OAI-PMH and Z39.50).

Digital Archive

The Digital Archive provides a secure storage environment to manage and monitor the health of master files and digital originals. It also provides a managed storage environment for digital master files that fits in with the workflows for acquiring digital content.
For users of CONTENTdm(R) (either hosted or direct) the Digital Archive is an optional capability integrated with the various workflows for building collections. Master files are secured for ingest to the Archive using the CONTENTdm Acquisition Station, the Connexion digital import capability, and the Web Harvesting service.
For users of other content management systems the Digital Archive provides a low-overhead mechanism for safely storing master files.

DiscoverInfo

DiscoverInfo is a tool to explore a collection of documents. The tool enables the user to:

  • Search: Run a full text search in the collection. DiscoverInfo indexes text, HTML, XML, and PDF documents.
  • Browse: Builds term clouds based on the term occurrences in the collection as well as across the documents. You can browse through the clickable term clouds to discover documents.
  • Discover: Retrieves relevant information from the indexed collection, but also evaluates the novelty of information in documents with respect to other documents in the collection.

Duracloud

A hosted service and open technology developed that makes it easy for organizations and end users to use cloud services. It offers cloud storage across multiple commercial and non commercial providers and compute services to unlock the digital content stored in the cloud. It provides services that enable digital preservation, data access, transformation and data sharing. Customers are offered "elastic capacity" coupled with a "pay as you go" approach.

Back to Top

E

EchoDep Hub and Spoke Framework Tool Suite

With a set of simple tools, Hub and Spoke provides a method for exchanging digital files and metadata among different types of digital management systems built on different platforms. It provides basic interoperability between repositories via a common METS-based profile, a standard programming API, and a series of scripts that use the API and METS profile for creating SIPs and DIPs that can be used across different repositories. Key architectural components are:

The METS profile, which remains mostly neutral regarding content files and structure but defines a minimum level of descriptive (MODS) and administrative (PREMIS) metadata, with an emphasis on preserving technical data and provenance.

The REST-based Lightweight Repository Create, Retrieve, Update, and Delete Service (LRCRUDS), which maps URIs to local identifiers and uses HTTP methods (PUT, GET, POST, and DELETE} to handle packages submitted or disseminated from a repository. Packages are shipped as Zip archives containing a header, METS file, and content files in a format suitable for repository import.

The Hub, which converts from and to the METS profile and manages generation and validation of technical and provenance metadata. At present, the Hub is a package-staging area; the goal is to develop the Hub into a digital repository capable of disseminating packages and handling submissions from other repositories.

Embedded Metadata Extraction Tool (EMET)

EMET is a stand-alone tool designed to extract metadata embedded in JPEG and TIFF files. EMET is intended to facilitate the management and preservation of digital images and their incorporation into external databases and applications. 

After installing EMET, users will be able to download Photoshop installers for the ARTstor XMP panel (a link to download the XMP panel will also be provided from the help documentation). This panel can be used with Adobe Bridge and Photoshop CS3, 4, and 5 on a PC or a Mac and was developed primarily to demonstrate how data verification could be used with EMET. It is not required to use EMET, and while it maps very loosely to such standards as VRA Core and could be used for very basic cataloging of cultural heritage materials, it was not created as a standard for image cataloging or ARTstor submissions.

Back to Top

F

Federated Archive Cyberinfrastucture Testbed (FACIT)

FACIT is technology testbed that explores the use of geographically-distributed storage in a networked environment. It builds on logistical networking technology (see http://loci.cs.utk.edu/ (external link)) using the Internet Backplane Protocol (IBP) (see http://loci.cs.utk.edu/ibp/ (external link)) to provide a generic interface for managing distributed storage resources. Each FACIT archive will use L-Store (see http://www.lstore.org) to manage data storage in both its private infrastructure and in the shared storage pool that the federation makes available.

Using L-Store, and leveraging IBP, FACIT archives will automatically mirror each other's content to provide fault-tolerance and increased accessibility. For its wide area storage infrastructure, FACIT archives will participate in the larger Research and Education Data Depot Network (REDDnet) storage network (see http://www.reddnet.org/ (external link)). Since REDDnet is based on IBP and supports L-Store, FACIT archives will have seamless access to this larger, shared pool of storage.

File Information Tool Set (FITS)

Acts as a wrapper around multiple open-source file format identification, validation and metadata extraction tools. Invokes and manages the output of these tools. The native output from these tools is converted into a common format, "FITS XML", compared to one another and consolidated into a single XML output file. The tools currently wrapped by FITS are:

  • JHOVE
  • Exiftool from Phil Harvey
  • National Library of New Zealand Metadata Extractor
  • DROID from the UK National Archives
  • Ffident from Marco Schmidt
  • File Utility

FITS includes two original tools: FileInfo and XmlMetadata. There are a number of tools that will be evaluated for incorporation into FITS in the future, including:

  • Apache Tika
  • JHOVE 2
  • Aduna Aperture
  • MediaInfo

Back to Top

G

GIS Archiving Toolset

The Toolset prepares vector and raster datasets for archive ingest. Basic pre-ingest functions include limited format validation, fidelity management, virus scanning, dataset characterization, metadata creation and remediation, and SIP organization.

  • Developer: NCSU
  • NDIIPP Project: North Carolina Geospatial Data Archiving Project
  • Written in: Python
  • OS and runtime requirements: The Toolset was written to run cross-platform, but has only been tested in Linux. Core requirements are met by Python. Extended functionality requires calls to external applications including ClamAV, NOID, 4Suite XML, Unix File, and JHOVE.
  • Application: Tool is not shared.
  • Documentation: Tool is not shared.
  • License: Tool is not shared.

Back to Top

H

Heritrix

Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content.

Back to Top

I

Integrated Rule Oriented Data Systems (iRODS)

iRODS is a data grid that allows the end-user powerful control over storage management policies and procedures through definition of business rules tailored to the characteristics of the files being managed. It provides an abstraction for data management processes and policies in the same way that the Storage Resource Broker provided abstractions for data objects, collections, resources, users and metadata, but is flexible and customizable.

This is accomplished by coding the processes as micro-services that are controlled by explicit rules. Management policies are mapped to sets of rules. Management processes are mapped to sets of micro-services. Assessment criteria are mapped to queries on the persistent state information generated by execution of each micro-service. A distributed rule engine is installed at each storage location to ensure enforcement of policies independently of the choice of access mechanism. iRods architecture features include:

  • Peer-to-peer data grid servers, based on a client/server model and distributed storage resources
  • A database system, for maintaining the attributes and states of data and operations
  • A rule system, for enforcing and executing adaptive rules

Back to Top

J

JSTOR/Harvard Object Validation Environment

JHOVE is an extensible system designed to provide automated and efficient identification and validation of the format of digital files with minimal human intervention. JHOVE can:

  • Identify the format to which a digital object conforms
  • Determine the compliance of an object to its format's specification, both in terms of syntax (well-formedness) and semantics (validity)
  • Characterize an object in terms of its format-specific significant properties

JHOVE defines a Java API and also provides a stand-alone application that runs in either command line or GUI mode. JHOVE supports the following formats: AIFF, ASCII, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, UTF-8, WAVE, and XML.

Back to Top

L

L-Store (Logistical Storage)

L-Store is low-level system software that leverages the basic powerful protocols of the Internet to move and manage large chunks of data through digital networks, much as the Internet moves and manages emails and other traffic. It is built on the Internet Backplane Protocol (IBP) (see http://loci.cs.utk.edu/ibp/ (external link)). The L-Store client provides a storage framework for distributed, scalable, and secure access to data. It is to be used on the Research and Education Data Depot Network (REDDnet) infrastructure (see http://www.reddnet.org/ (external link)). L-Store is designed to provide:

  • High scalability in both raw storage and associated file system metadata
  • A decentralized management system
  • Security
  • Fault-tolerant metadata support
  • User-controlled replication and striping of data on file and directory level
  • Scalable performance in both raw data movement and metadata queries
  • A virtual file system interface in both a Web and command line form
  • Support for the concept of geographical locations for data migration to facilitate quicker access.

LOCKSS

LOCKSS provides inexpensive digital preservation through replication of data storage in multiple locations. Copies of the same content in multiple LOCKSS replicas are automatically compared to each other, and can be repaired by the comparisons automatically.

  • Developer:Stanford University
  • Written in: Java
  • OS and run-time environment: All POSIX (Linux/BSD/UNIX-like OS), Linux. Most LOCKSS installations use a CD which bundles the LOCKSS daemon with an operating system based on OpenBSD. The LOCKSS team also supports running the daemon on RPM-based Linux distributions and on Solaris. The LOCKSS daemon can run in any environment with a Java VM 1.5 or above and a Unix-like file system. The hosting PC needs at least 1 GB of memory, a CD drive, and at least 250 GB of storage. The current CD distribution supports parallel (PATA) and serial (SATA) ATA and SCSI drives. On Linux and Solaris the daemon can use the full set of storage options.
  • Application: http://sourceforge.net/projects/lockss/ (external link)
  • Documentation: http://www.lockss.org/lockss/Installing_LOCKSS (external link)
  • License: BSD, http://www.lockss.org/lockss/Software_License (external link)

Logistical Distribution Network (LoDN)

An experimental content distribution tool. LoDN allows users to store content on the REDDnet and to manage or retrieve that stored content without installing anything or learning to use any complicated software. LoDN is comprised of three elements: 1) Upload and 2) Download clients (powered by Java Web Start) for storing and retrieving data, and 3) a Web interface for managing stored data and browsing public content.

LoDN uses the Logistical Networking infrastructure provided by the Internet Backplane Protocol (IBP) (see http://loci.cs.utk.edu/ibp/ (external link)) deployed on REDDnet (http://www.reddnet.org (external link)) to store file content on IBP storage "depots." Content publishers can use LoDN's Web interface to manage stored data. Content distributors can make LoDN data files available by including an active LoDN link on a Webpage, in an email, or through the LoDN content directory. Users access a file by clicking a LoDN link, thereby starting the LoDN Download Client, and then using the download client to retrieve the file content directly from IBP storage.

Back to Top

N

National Geospatial Digital Archive Tools: main page

NGDA Tools provide a suite of tools for graphical search and display of geospatial and map digital data.
http://www.ngda.org/research.php (external link)

NGDA/ Alexandria Digital Library: ADL Middleware Server

A distributed, peer-to-peer software component that provides mediated access to digital library collections.

  • Developer: UCSB
  • NDIIPP Project: National Geospatial Digital Archive
  • Written in: Java and Python
  • OS and runtime environments: It can be run as a Web application inside a servlet container, as an RMI server, or both; has been tested and run under Tomcat in Windows, *nix, and MacOSX. Build requirements include Java and the Apache Ant build tool. Python modules are run inside of Java through an interpreter, so Python is not a requirement.
  • Application: http://www.alexandria.ucsb.edu/middleware/download/webapp/ (external link)
  • Documentation: http://www.alexandria.ucsb.edu/~gjanee/middleware/ (external link)
  • License: Open Source for non-commercial use with attribution; see source code for details. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.

NGDA/ Alexandria Digital Library: Globetrotter

A Google Maps-based Web client for the Alexandria Digital Library middleware. Globetrotter enables a user to perform spatial searches on spatial data. The user can tune his or her search by adjusting a number of different constraints.

NGDA: Format Registry

A wiki-based expert community Web site for collaborative description of geospatial formats.

NGDA: NGDA Server

Software responsible for the creation of Archive Objects within the archive. Accepts requests with attached data, and properly formats and places that data within the archive.

  • Developer: UCSB
  • NDIIPP Project: National Geospatial Digital Archive
  • Written in: Written in Java using the Spring Framework. The Spring Framework is an open-source Web application framework that works with any servlet container.
  • OS and runtime: Runs in a servlet container. Tested and run under Tomcat 5 in Windows and *nix. Build requirements include Java (1.5 or 1.6) and the Apache Ant build tool.
  • Application: The NGDA Server is not available for public use.
  • Documentation: http://www.ngda.org/research.php (external link)
  • Licensing: Open source for non-commercial applications with attribution.We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.

Back to Top

NGDA: Bulk Ingest Tool

A tool used for preparing large collections of data for addition to the archive. After the user has created a template and a configuration file, the Ingest Tool is able to collect files and other data and tie them to an Archive Object identifier. This information is later used to create objects within the Archive itself.

  • Developer: UCSB
  • NDIIPP Project: National Geospatial Digital Archive
  • Written in: Java
  • OS and runtime environment: Uses a MySQL database for persistent data storage. Users will need to have a user account with write access to a MySQL database. Tested in Windows and *nix. Requires Java (1.5 or 1.6). Current build runs in NetBeans, but code is not dependent on NetBeans as a platform.
  • Application: An offline tool.
  • Documentation: http://www.ngda.org/research.php (external link)
  • Licensing: Open source for non-commercial applications with attribution.We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.

NGDA: Workflow Tool

A GUI-based tool for taking items prepared by the Bulk Ingest Tool and inserting them into the Archive.

  • Developer: UCSB
  • NDIIPP Project: National Geospatial Digital Archive
  • Written in: Java
  • OS and runtime environment: As a Java-based GUI, the Workflow Tool should work on any OS that supports Java. Tested under Windows XP and Ubuntu with Java 1.5 and 1.6.
  • Application: An offline tool.
  • Documentation: To be posted.
  • Licensing: Freely available for non-commercial use with attribution.We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.

NutchWAX

Software for indexing ARC files (archived Web sites gathered using Heritrix) for full text search. NutchWAX is based on the open-source Web-search software, Nutch.

Back to Top

R

Replication Monitor and Verification

The Replication monitor is currently designed to monitor copies of data in a federated SRB installation. The monitor periodically checks a master site for new data and ensures that copies are created at designated sites. Each replica site operates independent of other sites, ensuring that replication will occur even if the entire data grid is in a degraded state. Extensive logging of any action on data in a collection is provided. In addition, it provides a Web interface for quickly reporting the current state of a distributed collection and its copies. Extensions of the current tools to other distributed environments are planned.

Back to Top

S

Storage Resource Broker

The Storage Resource Broker (SRB) is a software tool that allows end-users to be able to organize their digital files in a way meaningful to them, without having to be knowledgeable about the underlying storage technologies. Data may be stored in file systems, tape archives, object-relational databases, and object ring buffers. State information is maintained for each registered entity, enabling uniform access support.

A Metadata Catalog (MCAT) supports retrieval based on queries on attributes instead of physical names and/or locations. The logical name used to identify a file does not change as the file is moved to other storage systems. The access controls on the file do not change as the file is moved, and the metadata associated with the file remain attached to the file or directory.

  • Developer: Data Intensive Cyber Environments Center (DICE Center) at University of North Carolina (UNC)
  • Written in: SRB servers written in C. The SRB clients are written in the appropriate language: Perl load library, Python load library, Java I/O library, C library calls
  • OS and runtime environment: SRB has been ported to UNIX platforms including Linux, Mac OS X, AIX (ex. SP-2 machines), Solaris, SunOS, SGI Irix and Windows. If you are setting up an MCAT-enabled SRB, you will require an Oracle, DB2, Sybase, mySQL, or PostgreSQL database. The SRB software system itself requires only about 200 MB of storage. For MCAT-enabled servers, the DBMS will require additional space; on Linux, for example, the SRB with PostgreSQL and ODBC take about 700 MB. Any Linux system with a 1.5 GHz CPU should have good performance. Memory size of 1/2 GB or 1 GB will be sufficient. For a heavy-load instance of SRB, it is best to use a commercial DBMS like Oracle. PostgreSQL works fine for initial testing and light to moderate data loads.
  • Application: http://www.sdsc.edu/srb/index.php/Downloads (external link)
  • Documentation: http://www.sdsc.edu/srb/index.php/Documentation (external link)
  • Licensing: Freely available only to academic organizations and government agencies through a source code distribution. http://www.sdsc.edu/srb/index.php/Client_License (external link)

Back to Top

T

TubeKit

TubeKit is a toolkit for creating YouTube crawlers. It allows the user to build a tool that can crawl YouTube based on a set of seed queries and collect up to 17 different attributes. TubeKit assists in all the phases of the process, from database creation to browsing and searching interfaces that provide access to the collected data.

Back to Top

W

Wayback Machine

The Wayback Machine is a powerful search and discovery tool for use with collections of Web site "snapshots" collected through Web harvesting, usually with Heritrix (ARC or WARC files).

Web Archives Workbench

The Web Archives Workbench is a suite of Web capture tools based on principles of managing archived content in aggregates rather than as individual objects. The suite is comprised of:
-Discovery Tool, which helps identify potentially relevant Web sites by crawling relevant "seed" Entry Points to generate a list of domains that they link to
-Properties Tool, which enables you to maintain information about content creators, associate them with the Web sites they are responsible for, and enter high-level metadata
-Analysis Tool, enables you to look at the structure of the Web site to see what kind of content is represented by the file directory
-Harvest Tool, which is used to monitor crawl status, to review and modify harvest settings, and to package harvests for transfer to a repository. The Harvest Tool also offers a separate Quick Harvest feature, which schedules one-time harvests of content. Harvest packages are encoded in METS with Dublin Core metadata embedded.

Web Archiving Service

The Web Archiving Service (WAS) is a Web-based curatorial tool that enables libraries and archivists to capture, curate, analyze, and preserve Web-based government and political information. The WAS allows users to set parameters of Web crawls, capture sites, provide metadata for archived sites, and build collections of archived Web sites.

  • Developer: California Digital Library
  • NDIIPP Project: Web-at-Risk
  • Written in: Java, Ruby on Rails
  • OS and runtime environment:
    WEB PAGE: Javascript must be enabled in the user’s browser. User must be able to install browser bookmarklets to use the "add sites while browsing" feature. Login and password are required.
    BACKEND: Infrastructure consists of Solaris 10 and Linux machines. The heaviest infrastructure demands are processing power for crawling, processing power for indexing, and storage. Other tools used are Heritrix, NutchWAX, Open Source Wayback Machine, MySQL, and Storage Resource Broker.
  • Application: Web based, http://was.cdlib.org (external link)
  • Documentation: http://was.cdlib.org (external link)
  • Licensing: n/a

Web Harvester

A service that enables users to harvest content from the Web, review it and add the harvested items to their CONTENTdm® collections during the Connexion cataloging process. By integrating digital collection development and capture with standard cataloging workflows, the Web Harvester provides an additional option for expanding participation in growing and maintaining digital collections.

Harvested items added to CONTENTdm Digital Collection Management Software using the Web Harvester are discoverable from the CONTENTdm Web interface, as well as WorldCat.org, WorldCat Local and OCLC FirstSearch. Each harvested item added to CONTENTdm using the Web Harvester is associated with its WorldCat record via a persistent URL based on the OCLC number of the WorldCat record. With an additional subscription to the OCLC Digital Archive, master files will be automatically placed in the Archive's secure, managed storage system.

Back to Top