Microsoft Word - 13355 20211217 galley.docx


ARTICLE 

A 21st Century Technical Infrastructure  
for Digital Preservation 
Nathan Tallman 

 
INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2021  
https://doi.org/10.6017/ital.v40i4.13355 

Nathan	Tallman	(ntt7@psu.edu)	is	Digital	Preservation	Librarian,	Pennsylvania	State	
University.	©	2021.	

ABSTRACT	

Digital	preservation	systems	and	practices	are	rooted	in	research	and	development	efforts	from	the	
late	1990s	and	early	2000s	when	the	cultural	heritage	sector	started	to	tackle	these	challenges	in	
isolation.	Since	then,	the	commercial	sector	has	sought	to	solve	similar	challenges,	using	different	
technical	strategies	such	as	software	defined	storage	and	function-as-a-service.	While	commercial	
sector	solutions	are	not	necessarily	created	with	long-term	preservation	in	mind,	they	are	well	
aligned	with	the	digital	preservation	use	case.	The	cultural	heritage	sector	can	benefit	from	adapting	
these	modern	approaches	to	increase	sustainability	and	leverage	technological	advancements	widely	
in	use	across	Fortune	500	companies.	

INTRODUCTION	

Most	digital	preservation	systems	and	practices	are	rooted	in	research	and	development	efforts	
from	the	late	1990s	and	early	2000s	when	the	cultural	heritage	sector	started	to	tackle	these	
challenges	in	isolation.	Since	then,	the	commercial	sector	has	sought	to	solve	similar	challenges,	
using	different	technical	strategies.	While	commercial	sector	solutions	are	not	necessarily	created	
with	long-term	preservation	in	mind,	they	are	well	aligned	with	the	digital	preservation	use	case	
because	of	similar	features.	The	cultural	heritage	sector	can	benefit	from	adapting	these	modern	
approaches	to	increase	sustainability	and	leverage	technological	advancements	widely	in	use	
across	Fortune	500	companies.	

In	order	to	understand	the	benefits,	this	article	will	examine	the	principles	of	sustainability	and	
how	they	apply	to	digital	preservation.	Typical	preservation	activities	that	use	technology	will	be	
described,	followed	by	how	these	activities	occur	in	a	20th-century	technical	infrastructure	model.	
After	a	discussion	on	advancements	in	the	IT	industry	since	the	conceptualization	of	the	20th-
century	model,	a	theoretical	21st-century	model	is	presented	that	attempts	to	show	how	the	
cultural	heritage	sector	can	employ	industry	advancements	and	the	beneficial	impact	on	
sustainability.	

Galleries,	libraries,	archives,	and	museums	cannot	afford	to	ignore	the	sustainability	of	managing	
and	preserving	digital	content	and	neither	can	distributed	digital	preservation	or	commercial	
service	providers.1	Budgets	lag	behind	economic	inflation	while	the	cost	of	and	amount	of	
materials	to	purchase	rises,	coupled	with	the	need	to	hire	more	employees	to	do	this	work.	If	
digital	preservation	programs	are	going	to	scale	up	to	enterprise	levels	and	operate	in	perpetuity,	
it	is	imperative	to	update	technical	approaches,	adopt	industry	advancements,	and	embrace	cloud	
technology.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 2 

SUSTAINABILITY	

For	digital	preservation	programs	to	succeed,	they	must	be	sustainable	per	the	Triple	Bottom	Line	
or	they	risk	subverting	their	mission.	The	Triple	Bottom	Line	definition	of	sustainability	identifies	
three	pillars:	people	(labor),	planet	(environmental),	and	profit	(economic).2	While	there	are	
typically	few	people	with	digital	preservation	in	their	job	title	within	an	organization,	it’s	a	
collaborative	domain	with	roles	and	responsibilities	distributed	throughout	organizations,	
reflecting	the	digital	object	lifecycle.	It’s	important	that	the	underlying	technical	infrastructure	can	
easily	be	supported	and	is	not	so	complicated	that	it	is	hard	to	recruit	systems	administration	
staff.	Digital	preservation	consumes	many	technical	resources	and	data	centers	have	a	substantial	
environmental	impact.	As	Ben	Goldman	points	out	in	“It’s	Not	Easy	Being	Green(E),”	data	centers	
consume	an	immense	amount	of	power	and	require	extravagant	cooling	systems	that	use	precious	
fresh	water	resources.3	Because	there	is	no	point	in	preserving	digital	content	if	there	will	be	no	
future	generation	of	users,	responsible	digital	preservation	programs	will	seek	to	reduce	carbon	
outputs	and	the	number	of	rare-earth	elements	in	our	technical	infrastructure.4	While	cultural	
heritage	organizations	rarely	seek	to	make	a	profit,	economic	sustainability	is	vital	to	
organizational	health	and	costs	for	digital	preservation	must	be	controlled.	Modern	technological	
infrastructures	discussed	here	will	help	to	increase	sustainability	by	using	widespread	
technologies	and	strategies	for	which	support	can	be	easily	obtained,	by	reducing	energy	
consumption,	by	minimizing	reliance	on	hardware	using	rare-earth	elements,	and	by	leveraging	
advances	in	infrastructure	components	such	as	storage	to	perform	digital	preservation	activities.	

BASIC	DIGITAL	PRESERVATION	ACTIVITIES	

This	paper	will	examine	technical	preservation	activities	and	the	author	acknowledges	that	basic	
digital	preservation	activities	are	likely	to	include	risk	management	and	other	non-technical	
concepts.	While	there	is	no	formal,	agreed-upon	definition	of	what	constitutes	a	set	of	basic	digital	
preservation	activities,	bit-level	digital	preservation	is	a	common	baseline.	Bit-level	digital	
preservation	seeks	to	preserve	the	digital	object	as	it	was	received,	ensuring	that	you	can	get	out	
an	exact	copy	of	what	you	put	in,	no	matter	how	long	ago	the	ingest	occurred;	however,	with	no	
guarantees	as	to	the	renderability	of	said	digital	object.	Two	basic	digital	preservation	activities	
are	key	to	this	strategy:	fixity	and	replication.	

Fixity	
Fixity	checking,	or	the	“practice	of	algorithmically	reviewing	digital	content	to	ensure	that	it	has	
not	changed	over	time,”	is	a	foundational	digital	preservation	strategy	for	verifying	integrity	that	
aligns	with	Rosenthal	et	al.’s	“Audit”	strategy.5	Fixity	is	how	preservationists	demonstrate	
mathematically	that	the	content	has	not	changed	since	it	was	received.	Not	all	fixity	is	the	same,	
however;	fixity	can	be	broken	up	into	three	types:	transactional	fixity,	authentication	fixity,	and	
fixity-at-rest.6	

Transactional	Fixity	
Transactional	fixity	is	checked	after	some	sort	of	digital	preservation	event7,	such	as	ingest	or	
replication.	Depending	on	the	event,	it’s	desirable	to	use	a	non-cryptographic	algorithm,	such	as	
CRC32	or	MD5,	when	files	move	within	a	trusted	system.	When	it’s	only	necessary	to	prove	that	a	
file	hasn’t	immediately	changed,	such	as	copying	between	filesystems,	cryptographic	algorithms	
are	unnecessarily	complex	and	are	too	expensive,	in	terms	of	compute	consumption.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 3 

Authentication	Fixity	
Authentication	fixity	proves	that	a	file	hasn’t	changed	over	a	long	period	of	time,	particularly	since	
ingest.	Although	one	could	use	a	chain	of	transactional	fixity	checks	to	cumulatively	prove	there	
has	been	no	change,	it’s	often	desirable	to	conduct	one	fixity	check	that	can	be	independently	
verified.	Unbroken	cryptographic	algorithms,	such	as	one	from	the	SHA-2	and	SHA-3	families,	are	
well	suited	to	this	use	case	and	worth	the	complexity	and	compute	expense,	particularly	since	this	
type	of	fixity	check	doesn’t	have	to	be	run	as	often.	

Fixity-at-Rest	
Fixity-at-rest	is	when	fixity	is	monitored	while	content	is	stored	on	disk.	While	some	organizations	
may	choose	to	only	conduct	fixity	checks	when	files	move	or	migrate,	this	strategy	can	miss	bit	
loss	due	to	media	degradation,	software	or	human	error,	or	malfeasance	that	is	only	discovered	
when	the	file	is	retrieved.8	A	common	approach	for	monitoring	fixity-at-rest	is	to	systematically	
conduct	fixity	checks	on	all	or	a	sample	of	files	at	regular	intervals.	These	types	of	fixity	checks	
may	or	may	not	use	cryptographic	algorithms,	depending	on	their	availability.9	

Replication	
Replication	is	another	cornerstone	of	achieving	bit-level	digital	preservation.	The	National	Digital	
Stewardship	Alliance’s	2019	Levels	of	Digital	Preservation,	a	popular	community	standard,	
recommends	maintaining	at	least	two	copies	in	separate	locations,	while	noting	three	copies	in	
geographic	locations	with	different	disaster	threats	is	stronger.10	All	of	these	copies	must	be	
compared	to	ensure	fixity	is	maintained.	An	important	concept	to	consider	when	thinking	about	
replication	is	the	independence	of	each	copy.	According	to	Schaefer	et	al.’s	User	Guide	for	the	
Preservation	Storage	Criteria,	“The	copies	should	exist	independently	of	one	another	in	order	to	
mitigate	the	risk	of	having	one	event	or	incident	which	can	destroy	enough	copies	to	cause	loss	of	
data.”11	In	other	words,	a	replica	should	not	depend	on	another	replica,	but	instead	depend	on	the	
original	file.	

ADVANCED	DIGITAL	PRESERVATION	ACTIVITIES	

When	considering	more	robust	digital	preservation	strategies	beyond	bit-level	preservation,	
additional	activities	must	be	considered	to	ensure	that	the	information	contained	within	digital	
files	can	be	understood.	Implementation	of	these	activities	may	vary	by	digital	object,	depending	
on	the	digital	preservation	goal	and	appraised	value	of	the	content.	This	paper	only	describes	a	
handful	of	the	many	advanced	digital	preservation	activities	as	illustrative	examples;	the	ideas	in	
this	paper	could	be	applied	to	most	advanced	activities.	

Metadata	Extraction	
Digital	files	often	contain	various	types	of	embedded	metadata	that	can	be	used	to	help	describe	
both	its	intellectual	content	and	technical	characteristics.	This	metadata	can	be	extracted	and	used	
to	populate	basic	descriptive	metadata	fields,	such	as	title	or	author.	Extracted	technical	metadata	
is	useful	for	broader	preservation	planning,	but	also	for	validating	technical	characteristics	in	
derivative	files.	For	example,	if	generating	an	access	file	for	digitized	motion	picture	film,	it’s	
necessary	to	know	the	color	encoding,	aspect	ratio,	and	frame	rate.	If	these	details	are	ignored,	the	
access	derivative	may	appear	significantly	different	than	the	original	file	and	give	a	false	
impression	to	users.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 4 

File	Format	Conversions	
File	format	conversions	help	to	ensure	the	renderability	of	digital	content.	There	are	two	types	of	
file	format	conversions	to	consider:	normalization	and	migration.	Normalization	generally	refers	
to	proactively	converting	file	formats	upon	ingest	to	retain	informational	content,	e.g.,	converting	
a	WordPerfect	document	to	plain	text	or	PDF	when	only	the	informational	content	is	desired.	
Migration	may	occur	at	any	time:	upon	ingest,	upon	access,	or	any	time	while	an	object	is	in	
storage.	Migration	occurs	when	file	formats	are	converted	to	a	newer	version	of	the	same	format,	
e.g.,	Microsoft	Access	2003	(MDB)	to	Microsoft	Access	2016	(ACCDB)	or	to	a	more	stable	and	open	
format	that	retains	features,	e.g.,	Microsoft	Access	2016	(ACCDB)	to	SQLite.	

Versioning	
Versioning,	or	the	retention	of	past	states	of	a	digital	object	with	the	ability	to	restore	previous	
states,	is	complex	to	implement	and	not	always	necessary.	An	organization	might	choose	to	apply	
versioning	to	subsets	of	digital	content,	such	as	within	an	institutional	repository,	but	not	for	
born-analog,	or	digitized	material.	Additionally,	an	organization	may	choose	to	version	metadata	
only,	ignoring	changes	to	the	bitstream,	such	as	for	born-analog	digital	objects.	

	
Figure	1.	The	infrastructure	architecture	for	a	typical	20th-century	stack.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 5 

THE	20TH	CENTURY	TECHNICAL	INFRASTRUCTURE	

The	technical	infrastructure	that	enables	digital	preservation	can	come	in	many	forms.	While	
technology	has	advanced	over	the	past	thirty	years,	the	cultural	heritage	sector,	particularly	
where	digital	preservation	is	concerned,	has	been	slow	to	adapt.	Below	are	descriptions	of	three	
common	components	of	a	typical	server	stack	(technical	infrastructure),	though	the	author	
acknowledges	that	some	organizations	have	already	moved	past	this	model.	Figure	1	is	a	diagram	
of	the	typical	20th-century	stack.	

Storage	
Storage,	at	the	core	of	digital	preservation,	has	benefitted	from	rapid	technological	advancement	
since	computers	first	started	storing	information	on	punch	cards	and	magnetic	media.	Twentieth-
century	servers	often	use	three	main	types	of	storage:	file,	block,	and	object.	

File	Storage	
File	storage	is	what	most	people	are	familiar	with.	A	filesystem	interfaces	with	the	underlying	
storage	technology	(block	or	object)	and	physical	media	(hard	disk	drives,	solid	state	drives,	tape-
based	media,	or	optical	media)	to	present	users	with	a	hierarchy	of	directories	and	subdirectories	
to	store	data.	This	data	can	easily	be	accessed	by	users	or	applications	using	file	paths,	while	the	
filesystem	negotiates	the	actual	bit-locations	on	the	physical	media.	

The	choice	of	filesystem	can	impact	data	integrity	(fixity),	although	choice	may	be	limited	by	
operating	system.	In	the	20th	century,	journaling	filesystems	offered	the	most	data	protection	as	
the	filesystem	keeps	track	of	all	changes;	in	the	event	of	a	disk	failure,	it’s	possible	to	recover	more	
data	if	a	journaling	filesystem	is	used.	

Block	Storage	
Block	storage	uses	blocks	of	memory	on	physical	media	(disk,	tape,	etc.)	that	are	managed	through	
a	filesystem	to	present	volumes	of	storage	to	the	server.	All	interactions	between	server	and	
storage	are	handled	by	the	filesystem	via	file	paths,	though	the	data	is	stored	on	scattered	blocks	
on	the	media.	Block	storage	directly	attached	to	a	server	is	often	the	most	performant	option,	the	
data	does	not	travel	outside	the	server.	Network	attached	storage,	in	which	an	external	file	system	
is	mounted	to	the	server	as	if	it	were	locally	attached	block	storage,	requires	data	to	travel	
through	cables	and	networks	before	it	gets	to	the	server,	which	decreases	performance.	

Object	Storage	
Object	storage,	which	still	uses	tape	and	disk	media,	is	an	abstraction	on	top	of	a	filesystem.	
Instead	of	using	a	filesystem	to	interact	directly	with	storage	media,	the	storage	media	is	managed	
by	software.	The	software	pools	storage	media	and	interactions	happen	through	an	API,	with	files	
being	organized	into	“buckets”	instead	of	using	a	filesystem	with	paths.	Object	storage	is	web-
native	and	the	basis	for	commercial	cloud	storage.	Software-defined	storage,	which	is	discussed	in	
more	detail	later	in	this	article,	also	allows	users	to	create	block	storage	volumes	that	can	be	
directly	mounted	to	virtual	servers	as	part	of	a	filesystem	or	to	create	network	shares	that	present	
the	underlying	storage	to	users	via	a	filesystem.12	

Both	block	and	object	storage	can	be	used	for	high-performance	storage,	hot	storage	(online),	cold	
storage	(nearline),	and	offline	storage.	Generally,	tape	and	slower	performing	hard	disks	are	used	
for	offline	and	nearline	storage;	faster	performing	hard	disks	are	used	for	online	storage.	Solid-


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 6 

state	drives	(SSDs)	using	Non-Volatile	Memory	Express	(NVMe)	protocols	are	best	suited	for	high-
performance	storage.13	

In	the	2019	Storage	Infrastructure	Survey,	by	the	National	Digital	Stewardship	Alliance,	60%	of	
those	aware	of	their	organizational	infrastructure	reported	a	reliance	on	hardware-based	
filesystems	(file	and	block	storage)	while	about	15%	used	software-based	filesystems	(object	
storage),	with	14%	reporting	a	hybrid	approach.14	This	indicates	that	the	cultural	heritage	sector	
continues	to	rely	more	on	file	and	block	storage	and	is	not	yet	fully	embracing	object	storage.	The	
survey	did	not	probe	into	why	this	might	be.	

Servers:	Physical	and	Virtual	
Twentieth-century	technical	infrastructures	relied	primarily	upon	physical	servers.	Physical	
servers,	also	called	bare	metal,	dominated	the	server	landscape	up	through	roughly	2005.	Virtual	
servers	arrived	on	the	scene	after	“VMware	introduced	a	new	kind	of	virtualization	technology	
which	…	[ran]	on	the	x86	system”	in	1999.15	Server	virtualization	facilitated	a	fresh	wave	of	
innovation	by	making	it	easier	and	more	inexpensive	to	create,	manage,	and	destroy	servers	as	
necessary.	Dedicating	physical	servers	to	one	or	a	limited	number	of	applications	requires	more	
resources	and	expends	a	higher	carbon	cost;	virtual	servers	can	be	highly	configured	for	their	
precise	needs	and	this	configuration	can	be	changed	using	software,	rather	than	changing	parts	on	
a	physical	server,	resulting	in	less	waste.	

Cultural	heritage	organizations	have	been	slow	to	fully	adapt	virtual	servers.	The	2019	NDSA	
Storage	Infrastructure	Survey	reports	that	81%	of	respondents	continue	to	rely	on	physical	
servers	with	63%	of	respondents	using	virtual	servers.	Fewer	than	10%	reported	using	
containers,	an	even	more	efficient	virtualization	technology.16	Containers	are	an	evolution	of	
virtual	servers	that	act	like	highly	optimized,	self-contained	servers	doing	a	specific	activity.17		

Applications	and	Microservices	
In	the	20th	century,	applications	often	required	dedicated	servers.	Business	logic	was	handled	by	
applications	or	microservices	that	ran	on	top	of	the	server	and	storage,	the	highest	level	in	the	
stack.	There	are	advantages	to	handling	the	business	logic	at	this	high	level:	it’s	completely	in	the	
control	of	the	developer	and	can	be	finely	tuned	to	the	needs.	Unfortunately,	this	is	also	an	
expensive	place	to	handle	all	business	logic	as	the	application	needs	to	be	maintained	over	time	
and	there’s	overhead	involved	in	working	at	this	level	of	the	stack.	Microservices,	in	this	server	
model,	are	generally	specific	commands	that	can	be	invoked	as	needed.	While	called	microservices	
because	they	can	be	applied	individually,	they	still	run	in	this	expensive	part	of	the	stack	and	have	
the	same	downsides	as	applications.	

In	digital	preservation	systems	using	this	type	of	architecture,	basic	and	advanced	digital	
preservation	activities	occur	within	this	application	layer.	Fixity	can	be	a	costly	activity.	Garnett,	
Winter,	and	Simpson,	in	their	paper	“Checksums	on	Modern	Filesystems,	or:	On	the	Virtuous	
Consumption	of	CPU	Cycles,”	point	out	that	“calculating	full	checksums	in	software	is	not	efficient”	
and	“increases	wear	and	tear	on	the	disks	themselves,	actually	accelerating	degradation.”18	Fixity,	
when	done	this	way,	is	a	linear	process	that	requires	every	file	to	be	read	from	disk	so	a	checksum	
can	be	calculated;	when	performing	fixity	over	large	amounts	of	content,	this	is	very	inefficient	
and	time	consuming.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 7 

PRESERVATION	ACTIVITIES	IN	THE	20TH-CENTURY	STACK	

In	this	model	of	infrastructure,	many	cultural	heritage	institutions	are	relying	on	practices	created	
when	the	field	of	digital	preservation	was	emerging.	

Basic	Activities	
Basic	preservation	activities	take	a	generalized	approach	and	mostly	occur	in	the	costly	
application	and	microservices	layer.	This	follows	the	general	approach	of	application	development	
from	the	commercial	sector	in	the	20th	century.	

Fixity	
Although	there	are	differences	in	frequency,	most	organizations	do	not	currently	make	
distinctions	between	transactional	fixity,	authentication	fixity,	or	fixity-at-rest.	Common	current	
practices	use	the	same	method	(MD5,	SHA-256,	SHA-512)	for	all	fixity	checks.19	This	inefficient	
approach	take	place	in	the	application	and	microservices	layer	and	uses	more	compute	power	
than	necessary,	increasing	the	environmental	impact.	

Replication	
In	most	20th-century	stacks,	replications	are	handled	in	the	application	layer,	where	it	is	most	
costly	in	terms	of	computational	power	and	labor	to	maintain,	having	a	negative	impact	on	
sustainability.	Some	are	using	20th-century	microservices	are	well.	

Advanced	Digital	Preservation	Activities	
Like	basic	preservation	activities,	advanced	ones	chiefly	take	place	in	the	application	and	
microservices	layer	if	they	occur	at	all.	

Metadata	Extraction	and	File	Format	Conversion	
Metadata	extraction	and	file	format	conversion	tends	to	occur	only	upon	ingest	as	a	one-time	
event.	Archivematica,	the	popular	open-source	digital	preservation	system,	uses	20th-century	
microservices	for	each	and	they	only	occur	during	the	transfer	(ingest)	process.20	Other	systems	
often	include	this	in	the	business	logic	of	the	application	layer.	

Versioning	
Version	control	is	a	feature	that	many	organizations	choose	not	to	implement.	The	2019	NDSA	
Storage	Infrastructure	surveys	shows	that	fewer	than	half	(40)	of	respondents	(83)	used	any	type	
of	version	control.21	Version	control	is	hard	to	implement	in	a	custom	system,	with	alternative	
approaches.	Fedora,	a	digital	preservation	backend	repository,	introduced	support	for	versioning	
in	the	application	layer	around	2004.22	

ADVANCES	IN	THE	COMMERCIAL	SECTOR	

Since	the	conceptualization	of	the	20th-century	stack,	there	have	been	significant	advancements	
made	in	the	general	IT	industry.	Virtualization	technology	developed	in	the	1990s	led	to	the	
proliferation	of	cloud	computing	and	infrastructure	that	transformed	the	IT	industry	in	the	early	
2000s,	leading	to	the	“long-held	dream	of	computing	as	a	utility”	or	commodity.23	Clouds	can	be	
public,	where	anyone	is	able	to	provision	and	use	services,	or	private,	where	services	are	only	
available	to	a	group	of	authorized	users.	Public	clouds	are	run	in	commercial	data	centers	while	
private	clouds	are	typically	built-in	privately-owned	data	centers,	though	it’s	possible	to	use	
commercial	data	centers	to	build	private	clouds.	Hybrid	clouds	are	also	possible,	typically	combing	
private	and	public	clouds,	or	combining	on-premises	infrastructure	with	a	private	or	public	cloud.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 8 

In	2009,	researchers	at	UC	Berkley	identified	three	strong	reasons	why	cloud	computing	has	been	
so	widely	adopted:	the	illusion	of	vertical	scaling	on	demand,	elimination	of	upfront	cost,	and	the	
ability	to	pay	for	short-term	resources.24	Surveys	from	the	NDSA	and	the	Beyond	the	Repository	
grant	project	show	a	steady,	but	slow	adoption	of	cloud	infrastructure	by	the	cultural	heritage	
community.25	It	is	unclear	whether	early	adopters	have	chosen	independently	or	simply	followed	
IT	changes	in	their	parent	organizations.	

Any	organization	can	build	a	private	cloud	and	take	advantage	of	the	benefits	described	in	this	
article.	Using	the	cloud	does	not	mean	that	you	must	contract	with	commercial	cloud	providers.	
Some	organizations	may	choose	to	build	a	private	cloud	if	there	are	concerns	over	data	
sovereignty,	mistrust	in	public	clouds,	or	for	other	reasons.	The	Ontario	Council	of	University	
Libraries	in	Canada	has	built	a	private	cloud	for	its	members	called	the	Ontario	Library	Research	
Cloud	using	OpenStack,	a	suite	of	open-source	software	for	building	clouds.26	

Software-Defined	Storage	
While	virtualization	enables	cloud	computing,	software-defined	storage	is	the	foundation	for	cloud	
storage.	Software-defined	storage	combines	inexpensive	hardware	with	software	abstractions	to	
create	a	flexible,	scalable,	storage	solution	that	provides	data	integrity.27	Software-defined	storage	
can	use	the	same	pool	of	disks	to	present	all	three	of	the	common	types	of	storage:	file,	object,	and	
block.		

File	storage	is	what	most	users	are	familiar	with.	Software	defined	file	storage	creates	a	network	
file	share	from	which	files	can	be	accessed	on	local	devices	via	a	filesystem.28	Object	storage	in	this	
environment	is	like	a	web-native	file	share;	files	are	stored	in	buckets,	which	can	be	further	
organized	by	folders.	Files	are	not	accessed	through	a	filesystem,	but	are	instead	accessed	through	
URIs,	which	makes	object	storage	very	amenable	to	web	applications	and	avoids	some	of	the	
pitfalls	of	relying	on	filesystems.	Block	storage	is	mostly	used	to	mount	storage	to	virtual	servers,	
storage	that	is	directly	attached	to	the	server	as	if	it	was	a	physical	disk	or	volume	mounted	to	the	
server.	Block	storage	is	more	performant	than	either	file	or	object	storage;	as	such	it’s	typically	
used	for	things	like	the	operating	system	and	application	code,	but	not	for	storing	content.	All	
storage	can	be	managed	through	APIs,	adding	to	its	suitability	for	automation,	software	
development,	and	IT	operations.29	

Hardware	Diversity	
Software	defined	storage	also	has	features	that	make	a	compelling	use	case	for	digital	
preservation.	First,	software	defined	storage	accommodates	hardware	diversity.	Because	software	
defined	storage	is	an	abstraction,	it’s	possible	to	combine	different	types	of	storage	media,	from	
different	manufacturers	and	production	batches	to	ensure	some	technical	diversity	and	avoid	risk	
from	catastrophic	failure	from	a	hardware	monoculture.	

Fixity	and	Integrity	
Second,	like	the	use	of	RAID	in	traditional	filesystems,	file	integrity	can	be	strengthened	through	
the	use	of	erasure	coding.30	Erasure	coding	splits	files	into	chunks	and	spreads	them	across	
multiple	disks	or	potentially	nodes	such	that	the	file	can	be	reconstructed	if	some	of	the	disks	or	
nodes	fail.	This	can	be	configured	in	different	ways,	depending	on	the	amount	of	parity	desired.31		

Replication	
Third	is	replication	of	content.	For	cloud	administrators,	replication	might	be	an	alternative	to	
erasure	coding	for	ensuring	data	integrity;	for	digital	preservationists,	it’s	a	distinct	strategy	and	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 9 

basic	preservation	activity.	Operating	nodes	in	a	software	defined	storage	network	can	be	in	
different	availability	zones;	through	object	storage	policies,	content	can	be	replicated	as	many	
times	as	necessary	to	provide	mitigation	of	geographic	based	threats.	It’s	even	possible	to	
replicate	to	object	storage	in	a	different	software	defined	storage	network,	helping	to	achieve	
organizational	diversity	as	well.	

	
Figure	2.	The	infrastructure	architecture	for	a	theoretical	21st-century	stack.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 10 

AN	UPDATED	TECHNICAL	INFRASTRUCTURE	FOR	THE	21ST	CENTURY	

A	theoretical	21st-century	stack	for	digital	preservation	has	many	of	the	same	components	as	its	
20th-century	antecedent.	However,	these	components	are	used	in	different	ways,	largely	due	to	
technological	advancements.	Leveraging	these	advancements	to	handle	digital	preservation	
activities	at	lower	levels	of	the	stack	reduces	the	complexity	of	the	business	logic	in	the	application	
layer.		

Figure	2	shows	an	updated	architecture	diagram	for	this	21st-century	stack,	which	may	be	used	by	
an	individual	organization,	consortium,	or	service	provider	planning	to	build	a	digital	preservation	
system.	The	storage	layer	is	built	on	software-defined	storage	with	digital	content	primarily	being	
stored	as	objects;	these	objects	are	stored	using	the	Oxford	Common	File	Layout	(discussed	
further	later).	Physical	bare	metal	servers	are	used	to	power	virtual	machines	that	host	
applications	such	as	a	digital	repository.	Physical	servers	also	host	a	container	and	function	as	a	
service	to	provide	a	suite	of	microservices	for	processing	digital	content.	

Storage	
In	this	new	stack,	storage	is	primarily	managed	through	software	defined	storage	with	data	
flowing	over	networks.	There	are	currently	two	primary	open-source	options	for	running	a	
software-defined	storage	service:	Gluster	and	Ceph.	Both	can	be	installed	and	run	on-premises,	in	
a	private	or	public	data	center,	or	even	contracted	through	infrastructure	as	a	service	(IaaS).	In	his	
presentation	at	the	2018	Designing	Storage	Architectures	for	Digital	Collections	meeting,	hosted	
by	the	Library	of	Congress,	Glenn	Heinle	recommended	Ceph	over	Gluster	where	data	integrity	is	
the	highest	priority;	however,	others	argue	that	Gluster	is	better	for	long-term	storage.32This	is	
likely	because	Ceph	is	better	able	to	recover	from	hardware	failures.33	

File	Storage	
Reliance	on	file	storage	has	become	minimal	in	this	theoretical	stack,	with	data	primarily	stored	as	
objects.	However,	file	storage	may	still	be	used;	when	it	is,	it	benefits	from	using	a	modern	
filesystem.	Several	modern	filesystems	have	emerged	since	2000,	most	notably	ZFS	and	
OpenZFS34	with	their	innovative	copy-on-write	transactional	model	and	methods	for	managing	
free	space.35	Both	ZFS	and	OpenZFS	can	also	be	configured	to	use	RAID-Z,	which	maintains	block-
level	fixity	by	calculating	checksums	for	each	block	of	data	and	verifying	the	checksum	when	
accessed.	This	can	be	combined	with	simple	software	to	touch	every	block	on	a	regular	basis	to	
ensure	fixity-at-rest.	Although	this	is	a	different	approach	than	file-level	fixity	checks,	it	
accomplishes	the	same	thing	in	a	much	more	efficient	method:	preservation	metadata	could	be	
recorded	for	each	block	that	contains	part	of	the	file.36	ZFS	has	also	inspired	similar	modern	
filesystems	such	as	BTRFS,	Apple	File	System	(APFS),	ReFS,	and	Resier.37	

However,	even	if	this	theoretical	stack	isn’t	relying	on	file	storage	to	persist	data,	software-defined	
storage	is	an	abstraction	that	sits	atop	servers	and	disks	(or	tape)	that	do	use	filesystems.38	
Ironically,	ZFS	is	not	the	best	option	for	the	underlying	disks	as	its	data	integrity	features	come	
with	more	overhead	and	data	integrity	can	be	achieved	through	different	means	with	software-
defined	storage.39	

Block	Storage	
Block	storage	comes	in	two	forms	in	this	future	stack.	Many	virtual	servers	will	leverage	the	block	
storage	offerings	of	the	software	defined	storage	service,	attaching	virtual	disk	blocks	to	virtual	
servers.	However,	the	physical	servers	that	support	virtualization	will	still	have	some	physically	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 11 

attached	storage	using	SSDs	(through	NVMe)	to	support	high	performance	storage	needs.	This	
physically	attached	block	storage	is	more	performant	than	virtually	attached	block	storage	since	
the	system	has	direct	access	to	the	disks	and	does	not	have	to	work	through	a	virtually	abstracted	
filesystem.	

Object	Storage	
Object	storage	has	become	the	primary	method	of	storing	data	in	this	theoretical	stack.	The	
flexibility	of	object	storage,	with	its	web-native	APIs	and	authentication,	gives	it	an	advantage	as	
systems	become	less	centralized	and	more	integrations	are	needed.	The	natural	scalability	of	
object	storage	and	the	variety	of	private,	public,	and	commercial	offerings	greatly	simplifies	
geographic	and	organizational	redundancy	when	replicating	data.	

With	software-defined	storage,	it’s	also	possible	to	offer	hot	(live)	and	cold	(nearline,	offline)	
options,	giving	flexibility	for	how	data	is	stored	to	better	optimize	the	storage	for	various	needs.	
Hot	storage	may	use	either	hard	disk	or	solid-state	drives	while	cold	storage	would	rely	on	tape	or	
optical	media.	Presently,	options	for	running	software	defined	storage	on	tape	and	optical	media	
are	mostly	proprietary.40	While	this	would	be	a	concern	if	these	systems	held	the	only	copy,	if	the	
data	is	replicated	to	systems	using	other	technology	and	media,	this	risk	can	be	managed.	While	
optical	media	has	long	been	criticized	for	use	as	a	preservation	media,	when	well-managed,	the	
risk	may	be	overstated.41	

Oxford	Common	File	Layout	
The	Oxford	Common	File	Layout	(OCFL)	is	a	“shared	approach	to	filesystem	layouts	for	
institutional	and	preservation	repositories.”42	OCFL	is	a	specification	for	organizing	digital	objects	
in	a	way	that	supports	preservation	while	being	computationally	efficient.	It	has	several	
advantages	for	use	in	digital	preservation,	such	as	the	ability	to	rebuild	a	repository	with	only	the	
files,	it’s	both	human	and	machine	readable,	supports	native	error	detection,	allows	objects	to	be	
efficiently	versioned,	and	is	designed	to	work	with	a	variety	of	storage	infrastructures.43	Although	
some	implementation	details	are	still	being	worked	out,	OCFL	can	be	used	with	object	storage.44	
OCFL	is	in	production	use	and	client	libraries	are	available	for	JavaScript,	Java,	Ruby,	and	Rust.45	In	
this	future	stack,	all	storage	operations	are	handled	by	an	OCFL	client,	which	then	interacts	with	
the	underlying	software	defined	storage	network	as	shown	in	figure	2.	

Servers	
Physical	servers	are	used	chiefly	to	support	virtualization	in	this	future	stack.	However,	this	stack	
moves	beyond	virtual	servers	and	supports	containers	and	serverless	computing.	Virtual	servers	
are	chiefly	used	to	support	applications	and	databases	while	containers	are	perfectly	suited	for	
microservices	running	preservation	activities.	

Serverless,	or	function-as-a-service,	is	the	next	evolution	in	virtualization.	While	a	container	may	
be	idling	all	the	time,	spinning	into	action	when	a	microservice	is	called,	serverless	functions	are	
run	on-demand	only.	They	can	cost	less	when	using	commercial	services	as	AWS	Lambda	or	AWS	
Fargate	where	the	customer	is	billed	for	usage	only.46	Though	serverless	functions	can	make	use	
of	containers,	function-as-a-service	platforms	have	emerged,	such	as	Apache	OpenWhisk	and	
OpenFaas	that	don’t	always	require	containers.	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 12 

PRESERVATION	ACTIVITIES	IN	THE	21ST-CENTURY	STACK	

This	21st-century	stack	performs	the	same	preservation	activities	as	its	predecessor.	However,	it	
generally	does	this	at	lower	levels	of	the	stack,	in	the	infrastructure	layers	as	opposed	to	the	
application	and	microservice	layers.	This	change	reduces	the	computational	load	on	the	stack	and	
simplifies	the	business	logic.	

Basic	Activities	
Fixity	and	replication	are	achieved	leveraging	a	combination	of	microservices	and	software-
defined	storage.	By	optimizing	the	approach	to	fixity	for	each	use	case,	instead	of	using	the	same	
computationally	intensive	method	for	all	fixity,	organizations	can	use	less	compute	power.	While	
fixity	and	replication	still	involve	the	microservice	layer,	it	is	a	more	targeted	approach.	

Transactional	Fixity	
Transactional	fixity	is	maintained	through	a	function-as-a-service-based	microservice.	Each	time	a	
file	is	moved,	the	microservice	is	triggered,	which	calculates	a	MD5	checksum	and	compares	it	to	a	
stored	value	that	was	created	upon	ingest.	If	there	is	a	mismatch	between	the	MD5	values,	a	
second	microservice	is	called	that	fetches	a	valid	file	replica.	While	CRC32	might	be	preferred	
(because	it’s	slightly	less	CPU-intensive),	Box	has	shown	that	CRC32	values	can	differ	depending	
on	how	they	are	calculated.47	A	stored	CRC32	can	only	be	reliably	used	to	confirm	fixity	if	the	new	
calculation	uses	the	exact	same	method	because	CRC32	not	a	true	specification—such	as	MD5—
and	implementations	may	differ.	CRC32	is	recommended	only	when	it’s	possible	to	calculate	in	the	
same	manner,	such	as	the	same	microservice.	As	this	introduces	technical	complexity,	some	
organizations	may	prefer	to	rely	solely	on	MD5	for	transactional	fixity.	

Authentication	Fixity	
Authentication	fixity	is	maintained	in	much	the	same	way	as	in	the	20th-century	model,	except	
using	a	cryptographically	secure	checksum	algorithm,	such	as	SHA-512	(part	of	the	SHA-2	family).	
However,	distinguishing	between	transactional	vs.	authentication	fixity	allows	more	precise	use	of	
algorithms,	only	requiring	more	computationally	intensive	cryptography	when	it’s	truly	needed.	
Authentication	fixity	may	require	the	use	of	a	container-based	microservice,	versus	a	function-as-
a-service,	due	to	the	increased	computational	need.	

Fixity-at-Rest	
Fixity-at-rest,	the	most	common	type	of	fixity,	is	managed	by	the	software-defined	storage	service	
and	reported	in	preservation	metadata.	How	this	is	achieved	might	look	different,	depending	on	
which	software-defined	storage	service	is	used.	The	Ceph	community	has	developed	a	new	
technology	called	BlueStor	which	serves	as	a	custom	storage	backend	that	directly	interacts	with	
disks,	essentially	replacing	the	need	to	use	an	underlying	filesystem.48	BlueStor	calculates	
checksums	for	every	file	and	verifies	them	when	read.	Because	this	is	all	internal	and	managed	by	
the	same	system,	CRC32	is	the	default	algorithm,	but	multiple	algorithms	are	supported.	Ceph	will	
“scrub”	data	every	week.		

Scrubbing	is	the	process	of	reading	the	file	simply	to	verify	the	checksum,	even	if	no	user	has	
accessed	the	file.	Because	of	the	way	Ceph	performs	erasure	coding,	if	a	checksum	is	invalid,	the	
file	can	be	repaired.	What	remains	to	be	done	is	writing	a	script	that	will	read	Ceph’s	internal	
metadata	and	record	preservation	events	within	the	object’s	preservation	metadata	for	the	fixity	
verification	and	any	reparative	actions.	Ryu	and	Park	propose	a	“Markov	failure	and	repair	model”	
to	optimize	the	frequency	of	data	scrubbing	and	number	of	replicas	such	that	the	least	amount	of	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 13 

power	is	consumed	and	that	scrubbing	occurs	at	off-peak	times.49	It	appears	that	this	optimization	
causes	less	media	degradation	from	the	process	of	regularly	reading	the	file,	though	empirical	
studies	are	needed	to	confirm	that	there	is	overall	less	degradation	than	conducting	fixit	checks	
through	an	application.	

Gluster	has	a	similar	scrubbing	process	for	fixity-at-rest	in	the	optional	BitRot	feature,	although	it	
uses	SHA-256	by	default,	instead	of	CRC32,	which	requires	more	computing	power.50	

Replication	
Replication	in	this	future	stack	is	mostly	handled	by	the	software-defined	storage	service,	but	
microservices	may	play	a	role	in	achieving	independence	of	copies.51	Object	storage	policies	allow	
the	automatic	copying	of	data	into	another	region	or	availability	zone	that	is	within	the	software	
defined	storage	network.	However,	these	copies	are	not	replicas,	or	independent	instances,	
because	all	copies	are	in	a	chain	derived	from	the	primary	instance;	if	there	is	a	problem	anywhere	
in	the	chain,	bad	data	will	be	copied.	In	addition	to	using	object	storage	policies,	microservices	
could	be	used	to	independently	verify	the	fixity	of	downstream	copies	as	well	as	trigger	true	
replications	to	independent	instantiations,	such	as	an	alternative	storage	service	or	different	
storage	area	within	the	same	software	defined	storage	network.	Bill	Branan	suggested	a	similar	
approach	in	his	Cloud	Native	Preservation	presentation	at	NDSA	Digital	Preservation	2019.52	

Advanced	Digital	Preservation	Activities	
Advanced	digital	preservation	activities	in	a	21st-century	stack	also	make	use	of	microservices	for	
metadata	extraction	and	file	format	conversion.	Versioning,	however,	relies	upon	features	of	the	
Oxford	Common	File	Layout,	even	though	object	storage	often	supports	versioning	natively.		

Metadata	Extraction	
Function-as-a-service	microservices	are	well	suited	to	metadata	extraction,	actuated	upon	ingest	
or	as	needed.	Since	embedded	metadata	is	machine-readable,	this	activity	will	not	be	resource	
intensive	or	time	consuming.	In	addition	to	extracting	metadata	and	storing	it	as	discrete,	sidecar	
files,	these	microservices	can	be	used	to	populate	specific	metadata	fields	used	by	the	repository,	
including	descriptive,	administrative,	and	technical	metadata.	This	approach	is	more	efficient	as	it	
gives	flexibility	to	reuse	the	functions	in	multiple	workflows	as	opposed	to	specific	events	like	
ingest.	

File	Format	Conversion	
File	format	conversions	use	a	combination	of	function-as-a-service	and	container-based	
microservices,	depending	upon	the	original	format.	Like	metadata	extraction,	conversion	may	
occur	at	ingest	(normalization)	or	as	needed	(migration).	Function-as-a-service	is	well	suited	for	
small	to	medium	files,	such	as	converting	WordPerfect	to	OpenDocument	Format.	Function-as-a-
service	is	also	well	suited	for	logical	preservation,	when	only	the	informational	content	is	
necessary	to	preserve,	such	as	converting	a	TIF	to	a	TXT	file	through	OCR.	Container-based	
microservices	are	better	suited	for	converting	large	media	files	that	may	take	more	memory	and	
time;	function-based	services	often	have	a	time	constraint,	for	example,	migrating	proprietary	
encoded	digital	video	to	open	codecs	and	container	formats.	

Versioning	
Although	object	storage	typically	supports	versioning,	it	is	inefficient	because	each	version	is	an	
entirely	new	object.	This	means	that	unchanged	data	is	duplicated,	taking	up	more	disk	space.	The	
Oxford	Common	File	Layout,	which	negotiates	storage	between	the	application	and	microservices	


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 14 

layers	and	a	software	defined	storage	service,	supports	a	forward	delta	versioning	in	which	each	
new	version	only	contains	the	changes.	Using	the	object	inventories,	it’s	possible	to	rebuild	any	
object	to	any	version	without	duplicating	bits.53	An	additional	benefit	of	using	OCFL	is	that	it	
inherently	uses	checksums,	and	any	changes	or	corruption	are	detected	when	an	interaction	
occurs	with	the	object,	creating	a	layered	approach	to	maintaining	fixity-at-rest.	

SUSTAINABILITY	IN	THE	21ST-CENTURY	STACK	

The	differences	between	our	20th-	and	21st-century	stacks	result	in	a	more	sustainable	approach	
to	digital	preservation,	per	the	triple-bottom-line	definition.54	By	adopting	commercial	sector	
approaches,	cultural	heritage	organizations	can	more	efficient	data	centers	consumers.	

People	(Labor)	
By	shifting	activities	to	lower	levels	in	the	stack	and	letting	infrastructure	components	self-
manage,	fewer	people	are	needed	to	develop	and	maintain	the	business	logic	that	formerly	
handled	the	same	action.	The	application	and	microservice	layers	use	programming	languages	and	
libraries	that	can	become	outdated	quickly,	requiring	development	work	to	maintain	functionality.	
While	there	is	still	a	need	for	specialized	knowledge,	fewer	people	are	needed	to	do	the	work	
when	these	actions	take	place	in	more	stable	parts	of	the	stack.	

Planet	(Environmental)	
Our	new	stack	has	a	lower	environmental	impact	for	a	variety	of	reasons.	First,	per	Kryder’s	Law	
(the	storage	parallel	to	Moore’s	Law	for	computing),	the	areal	density	of	storage	media	
predictably	increases	annually,	and	our	new	stack	uses	the	latest	hard	disk	and	tape	technology.55	
This	results	in	needing	less	media,	some	of	which	doesn’t	need	power	to	run,	decreasing	the	
carbon	impact.	Additionally,	our	new	stack	uses	a	mix	of	hot	and	cold	storage,	making	it	possible	
to	implement	automatic	tiering	to	shift	objects	to	less	resource-intensive	storage,	like	tape.56		

Second,	as	the	stack	becomes	more	serverless,	fewer	computational	resources	are	needed.	Even	
though	container	and	function-based	microservices	may	incur	some	overhead	in	terms	of	CPU	
cycles,	it	is	more	efficient	in	terms	of	system	idling	to	be	running	these	as	microservices	on	one	
platform	rather	than	doing	the	same	action	in	the	application	or	VM	layer.	This	further	decreases	
the	carbon	impact	and	while	also	decreasing	the	dependency	on	rare-earth	elements.	Relatedly,	by	
leveraging	software-defined	storage	to	maintain	fixity-at-rest,	the	compute	load	is	greatly	
decreased;	the	CPU	cost	to	calculating	checksums	in	the	storage	layer	is	less	than	when	this	is	
done	in	the	through	applications	or	microservices.		

Profit	(Economic)	
Sustainability	improvements	for	both	people	and	planet	may	also	result	in	a	lower	total	cost	of	
ownership	for	a	digital	preservation	system.	Cost	is	a	prime	motivator	when	administers	and	
leaders	make	long	term	decisions,	decreasing	annual	operating	cost	related	to	digital	preservation	
is	crucial	to	the	viability	of	a	program.	

FUTURE	AND	RELATED	WORK	

The	21st-century	stack	proposed	in	this	paper	is	not	the	only	way	to	increase	sustainability	or	the	
only	way	in	which	digital	preservation	stacks	will	change.	The	planet	is	running	out	of	bandwidth	
and	will	need	to	expand	into	using	5G	and	low-earth	orbit	satellite	communications.	New,	
quantum-resistant	algorithms	will	need	to	be	introduced	as	quantum	computing	advances.57		


INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 15 

Blockchain	technology	introduces	many	possibilities.	Inherently,	blockchain	maintains	fixity.	The	
ARCHANGEL	project	is	exploring	practical	methods	of	recording	provenance	and	proving	
authenticity	by	using	a	permissioned	blockchain.58	Blockchain	is	also	the	technology	behind	the	
InterPlanetary	File	System	(IPFS),	a	content-addressed	distributed	storage	network,	that	is	in	turn	
used	by	Filecoin,	a	marketplace	for	an	IPFS	storage.	Small	Data	Industries	is	building	Starling,	a	
Filecoin-based	application	designed	for	cultural	heritage	organizations	to	securely	store	digital	
content.59	It’s	important	to	note	that	these	blockchain-based	projects	use	a	Proof-of-Stake	model	
instead	of	a	Proof-of-Work	model,	which	has	a	significantly	lower	environmental	impact	than	
other	blockchain	implementations	like	the	cryptocurrency	Bitcoin.60	

While	some	organizations,	like	Stanford	University,	may	already	leverage	software-defined	
storage,	most	in	the	cultural	heritage	sector	are	not.61	The	MetaArchive	Cooperative,	a	nonprofit	
consortium	for	digital	preservation,	has	begun	a	noteworthy	project	to	explore	using	software-
defined	storage	in	a	distributed	digital	preservation	network.	MetaArchive,	which	currently	uses	
LOCKSS,	is	one	of	the	few	public	digital	preservation	services	that	mitigates	risk	through	
organizational	and	administrative	diversity.	Because	members	host	and	administer	the	LOCKSS	
nodes	that	contain	the	replications,	each	copy	is	managed	by	a	different	set	of	organizational	and	
administrative	policies	and	often	use	different	technology	to	do	so.	Diversifying	in	this	way	
protects	against	a	single	point	of	failure	if	only	one	organization	managed	the	technical	
infrastructure.	How	this	same	diversity	is	achieved	in	a	software-defined	storage-based	
distributed	digital	preservation	network	will	be	a	great	contribution	to	the	community.	

It	would	also	be	useful	to	study	the	reasons	cultural	heritage	organizations	have	been	so	reluctant	
to	adopt	commercial	sector	technologies.	Identifying	these	hesitations	would	make	it	possible	to	
create	strategies	that	would	encourage	adoption	of	these	approaches.	It	may	simply	be	that	when	
it	comes	to	digital	preservation,	familiar	and	proven	technologies	provide	a	level	of	comfort.	
Organizations	may	also	be	entrenched	in	custom	developed	solutions	that	are	hard	to	move	away	
from.	

CONCLUSION	

Digital	preservation	is	a	long-term	commitment.	While	re-appraisal	may	take	place,	it’s	inevitable	
that	the	amount	of	content	stored	in	digital	repositories	will	only	increase	over	time.	It	is	
fiduciarily	incumbent	upon	the	cultural	heritage	community	to	examine	our	practices	and	look	for	
better	alternatives.	Exceptionalism	ignores	technological	advancements	made	by	the	commercial	
industry,	advancements	that	are	very	well	suited	to	digital	preservation.	By	adopting	commercial	
industry	data	practices,	such	as	software-defined	storage,	while	simultaneously	implementing	
innovations	from	within	the	cultural	heritage	community,	like	the	Oxford	Common	File	Layout,	it	
is	possible	to	reduce	the	ongoing	costs,	resource	consumption,	and	environmental	impact	of	
digital	preservation.	

	 	
INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 16 

ENDNOTES	
 

1	Ben	Goldman,	“It’s	Not	Easy	Being	Green(e):	Digital	Preservation	in	the	Age	of	Climate	Change,”	
in	Archival	Values:	Essays	in	Honor	of	Mark	A.	Greene,	ed.	Mary	A.	Caldera	and	Christine	
Weidman	(Chicago:	American	Library	Association,	2018),	274–95,	
https://scholarsphere.psu.edu/concern/generic_works/bvq27zn11p.		

2	“A	Simple	Explanation	of	the	Triple	Bottom	Line,”	University	of	Wisconsin	Sustainable	
Management,	October	2,	2019,	https://perma.cc/2HWF-3MMQ.		

3	Goldman,	“It’s	Not	Easy	Being	Green(e).”	

4	Keith	L.	Pendergrass	et	al.,	“Toward	Environmentally	Sustainable	Digital	Preservation,”	The	
American	Archivist	82,	no.	1	(2019):	165–206,	https://doi.org/10.17723/0360-9081-82.1.165.		

5	Sarah	Barsness	et	al.,	2017	Fixity	Survey	Report:	An	NDSA	Report	(OSF,	April	24,	2018),	
https://doi.org/10.17605/OSF.IO/SNJBV;	David	S.	H.	Rosenthal	et	al.,	“Requirements	for	
Digital	Preservation	Systems:	A	Bottom-Up	Approach,”	D-Lib	Magazine	11,	no.	11	(2005),	
https://perma.cc/X2R7-R5XP.	

6	Matthew	Addis,	Which	Checksum	Algorithm	Should	I	Use?	(DPC	Technology	Watch	Guidance	note,	
Digital	Preservation	Coalition,	December	11,	2020),	https://doi.org/10.7207/twgn20-12.	

7	PREMIS	Editorial	Committee,	PREMIS	Data	Dictionary	for	Preservation	Metadata,	version	3.0	
(Library	of	Congress,	November	2015),	https://perma.cc/L79V-GQV7.	

8	Some	organizations	may	continue	to	use	a	strategy	where	fixity	is	only	checked	when	a	file	is	
accessed,	if	the	potential	loss	fits	within	a	defined	acceptable	loss.	While	this	strategy	may	not	
work	for	all	organizations,	recognizing	that	loss	is	inevitable	and	defining	a	level	of	acceptable	
loss	is	an	effective	and	pragmatic	approach	to	managing	risk	of	bit	decay.	

9	Barsness	et	al.,	2017	Fixity	Survey	Report.	

10	NDSA	Levels	of	Preservation	Revisions	Working	Group,	“Levels	of	Digital	Preservation,	2019	LOP	
Matrix,	V2.0	(OSF,	October	14,	2019),	https://osf.io/2mkwx/.	

11	Sibyl	Schaefer	et	al.,	“User	Guide	for	the	Preservation	Storage	Criteria,”	February	25,	2020,	
https://doi.org/10.17605/OSF.IO/SJC6U.	

12	Mark	Carlson	et	al.,	“Software	Defined	Storage,”	(white	paper,	Storage	Network	Industry	
Association,	January	2015),	https://perma.cc/AQ4T-9YXQ.	

13	Abutalib	Aghayev	et	al.,	“File	Systems	Unfit	as	Distributed	Storage	Backends”	(Proceedings	of	
the	27th	ACM	Symposium	on	Operating	Systems	Principles—SOSP	’19,	Huntsville,	Ontario,	
Canada:	Association	for	Computing	Machinery,	2019):	353–69,	
https://doi.org/10.1145/3341301.3359656.	

14	NDSA	Storage	Infrastructure	Survey	Working	Group,	2019	Storage	Infrastructure	Survey:	Results	
of	the	Storage	Infrastructure	Survey	(OSF,	2020),	https://doi.org/10.17605/OSF.IO/UWSG7.	

	
INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 17 

	
15	Joseph	Migga	Kizza,	“Virtualization	Technology	and	Security,”	in	Guide	to	Computer	Network	
Security,	Computer	Communications	and	Networks	(Springer,	Cham,	2017),	457–75,	
https://doi.org/10.1007/978-3-319-55606-2_21.	

16	NDSA	Storage	Infrastructure	Survey	Working	Group,	2019	Storage	Infrastructure	Survey.		

17	Eric	Jonas	et	al.,	“Cloud	Programming	Simplified:	A	Berkeley	View	on	Serverless	Computing”	
(University	of	California,	Berkeley,	February	10,	2019),	https://perma.cc/YAM2-TZ8W.	

18	Alex	Garnett,	Mike	Winter,	and	Justin	Simpson,	“Checksums	on	Modern	Filesystems,	or:	On	the	
Virtuous	Consumption	of	CPU	Cycles,”	in	IPres	1028	Conference	[Proceedings]	(International	
Conference	on	Digital	Preservation,	Boston,	Mass.,	2018),	
https://doi.org/10.17605/OSF.IO/Y4Z3E.	

19	Barsness	et	al.,	2017	Fixity	Survey	Report.	

20	“Import	Metadata,”	documentation	for	Archivematica	1.12.1,	Artefactual	Systems,	Inc.,	accessed	
May	21,	2021,	https://perma.cc/UE3R-BDGZ.;	“Ingest,”	documentation	for	Archivematica	
1.12.1,	Artefactual	Systems,	Inc.,	accessed	May	21,	2021,	https://perma.cc/5SN5-GFX3.	

21	NDSA	Storage	Infrastructure	Survey	Working	Group,	2019	Storage	Infrastructure	Survey.	

22	“Fedora	Content	Versioning,”	2005,	
https://duraspace.org/archive/fedora/files/download/2.0/userdocs/server/features/version
ing.html.	

23	Michael	Armbrust	et	al.,	Above	the	Clouds:	A	Berkeley	View	of	Cloud	Computing,	(technical	report,	
EECS	Department,	University	of	California,	Berkeley,	February	10,	2009),	
https://perma.cc/QJ5W-8S5Y.	

24	Armbrust	et	al.,	Above	the	Clouds.	

25	Micah	Altman	et	al.,	“NDSA	Storage	Report:	Reflections	on	National	Digital	Stewardship	Alliance	
Member	Approaches	to	Preservation	Storage	Technologies,”	D-Lib	Magazine	19,	no.	5/6	(May	
2013),	https://doi.org/10.1045/may2013-altman;	Michelle	Gallinger	et	al.,	“Trends	in	Digital	
Preservation	Capacity	and	Practice:	Results	from	the	2nd	Bi-Annual	National	Digital	
Stewardship	Alliance	Storage	Survey,”	D-Lib	Magazine	23,	no.	7/8	(2017),	
https://doi.org/10.1045/july2017-gallinger;	NDSA	Storage	Infrastructure	Survey	Working	
Group,	2019	Storage	Infrastructure	Survey;	Evviva	Weinraub	et	al.,	Beyond	the	Repository:	
Integrating	Local	Preservation	Systems	with	National	Distribution	Services	(Northwestern	
University,	2018),	https://doi.org/10.21985/N28M2Z.	

26	Ontario	Council	of	University	Libraries,	“Ontario	Library	Research	Cloud,”	accessed	April	14,	
2021,	https://perma.cc/KMP9-FS8K;	“Open	Source	Cloud	Computing	Infrastructure,”	
OpenStack,	accessed	April	14,	2021,	https://perma.cc/G9GE-92JD.	

27	Nathan	Tallman,	“Software	Defined	Storage,”	(presentation	for	the	NDSA	Infrastructure	Interest	
Group,	March	16,	2020),	https://doi.org/10.26207/3nn2-zv13.	

	
INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 18 

	
28	These	network	shares	typically	use	the	SMB	(Server	Message	Block)	or	CIFS	(Common	Internet	
File	System)	protocols	to	present	file	shares	through	a	graphical	user	interface	in	operating	
systems	such	as	Windows	or	macOS	while	the	NFS	(Network	File	Shares)	protocol	is	more	
often	used	to	mount	storage	in	Linux.	

29	Carlson	et	al.,	“Software	Defined	Storage.”	

30	RAID,	or	the	Redundant	Array	of	Independent	Disks,	is	technology	that	splits	a	file	into	multiple	
chunks	and	spreads	them	across	multiple	disks	in	a	storage	device,	adding	extra	copies	of	the	
chunks	so	that	the	file	can	be	recovered	if	an	individual	drive	fails.		

31	Abhijith	Shenoy,	“The	Pros	and	Cons	of	Erasure	Coding	&	Replication	vs	RAID	in	Next-Gen	
Storage	Platforms”	(Software	Developer	Conference,	Storage	Networking	Industry	Association,	
2015),	https://perma.cc/YFS5-KXKK.	

32	Glenn	Heinle,	“Unlocking	Ceph”	(presentation,	Designing	Storage	Architectures	for	Digital	
Collections,	Washington,	DC:	Library	of	Congress,	2019),	https://perma.cc/Z2R9-79ZE;	
Tamara	Scott,	“Big	Data	Storage	Wars:	Ceph	vs	Gluster,”	TechnologyAdvice	(blog),	May	14,	
2019,	https://perma.cc/2YY2-BBXG.	

33	Giacinto	Donvito,	Giovanni	Marzulli,	and	Domenico	Diacono,	“Testing	of	Several	Distributed	
File-Systems	(HDFS,	Ceph	and	GlusterFS)	for	Supporting	the	HEP	Experiments	Analysis,”	
Journal	of	Physics:	Conference	Series	513,	no.	4	(June	2014):	042014,	
https://doi.org/10.1088/1742-6596/513/4/042014.	

34	Matthew	Ahrens,	“OpenZFS:	A	Community	of	Open	Source	ZFS	Developers,”	in	AsiaBSDCon	2014	
(AsiaBSDCon,	Tokyo,	Japan:	BSD	Research,	2014),	27–32,	https://perma.cc/XG79-PBU7.	

35	Brian	Hickmann	and	Kynan	Shook,	“ZFS	and	RAID-Z:	The	Über-FS?”	(University	of	Wisconsin–
Madison,	December	2007),	https://perma.cc/W5PD-ENPP.	

36	Garnett,	Winter,	and	Simpson,	“Checksums	on	Modern	Filesystems.”	

37	Edward	Shishkin,	“Resier5	(Format	Release	5.X.Y),”	MARC	mailing	list	archive,	2019,	
https://perma.cc/DN8Y-V8KQ.	

38	“Fujifilm	Launches	‘Fujifilm	Software-Defined	Tape,’”	FUJIFILM	Europe,	May	19,	2020,	
https://perma.cc/B3GN-PLR9.	

39	Aghayev	et	al.,	“File	Systems	Unfit	as	Distributed	Storage	Backends.”	

40	IBM	Systems,	“Tape	Goes	High	Speed,”	2016,	https://perma.cc/FNV9-RTG9;	“Fujifilm	Launches	
‘Fujifilm	Software-Defined	Tape’”;	Desire	Athow,	“Here’s	What	Sony’s	Million	Gigabyte	Storage	
Cabinet	Looks	Like,”	TechRadar	(blog),	2020,	https://perma.cc/VHN4-LAYT.	

41	David	Rosenthal,	“Optical	Media	Durability:	Update,”	DSHR’s	Blog,	August	20,	2020,	
https://perma.cc/VKW9-83J3.	

	
INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 19 

	
42	Andrew	Hankinson	et	al.,	“The	Oxford	Common	File	Layout:	A	Common	Approach	to	Digital	
Preservation,”	Publications	7,	no.	2	(June	2019):	39,	
https://doi.org/10.3390/publications7020039.	

43	Andrew	Hankinson	et	al.,	“Oxford	Common	File	Layout	Specification,”	July	7,	2020,	
https://perma.cc/S73Z-3N6K.	

44	Marco	La	Rosa	et	al.,	“Our	Thoughts	on	OCFL	over	S3	·	Issue	#522	·	OCFL/Spec,”	GitHub,	
accessed	March	12,	2021,	https://perma.cc/PA3G-CB78.	

45	Hannah	Frost,	“Version	1.0	of	the	Oxford	Common	File	Layout	(OCFL)	Released,”	Stanford	
Libraries	(blog),	July	23,	2020,	1,	https://perma.cc/5J5M-GYQW;	Andrew	Woods,	
“Implementations	|	OCFL/Spec,”	GitHub,	February	10,	2021,	https://github.com/OCFL/spec.	

46	While	serverless	might	be	the	ultimate	microservice,	requiring	the	least	amount	of	overhead,	
costs	may	still	be	hard	to	predict.	

47	Ryan	Luecke,	“CRC32	Checksums;	The	Good,	the	Bad,	and	the	Ugly,”	Box	Blog,	October	12,	2011,	
https://perma.cc/MVP7-YVZV.	

48	Aghayev	et	al.,	“File	Systems	Unfit	as	Distributed	Storage	Backends.”	

49	Junkil	Ryu	and	Chanik	Park,	“Effects	of	Data	Scrubbing	on	Reliability	in	Storage	Systems,”	IEICE	
TRANSACTIONS	on	Information	and	Systems	E92-D,	no.	9	(September	1,	2009):	1639–49,	
https://doi.org/10.1587/transinf.E92.D.1639.	

50	Raghavendra	Talur,	“BitRot	Detection	|	Gluster/Glusterfs-Specs,”	GitHub,	August	15,	2015,	
https://github.com/gluster/glusterfs-
specs/blob/fe4c5ecb4688f5fa19351829e5022bcb676cf686/done/GlusterFS%203.7/BitRot.m
d.		

51	Schaefer	et	al.,	“User	Guide	for	the	Preservation	Storage	Criteria.”	

52	Bill	Branan,	“Cloud-Native	Preservation”	(OSF,	October	22,	2019),	https://osf.io/kmdyf/.	

53	Andrew	Hankinson	et	al.,	“Implementation	Notes,	Oxford	Common	File	Layout	Specification,”	
July	7,	2020,	https://perma.cc/PVF8-SQFN.	

54	Although	out	of	scope	in	terms	of	the	stack,	the	policies	and	practices	implemented	by	
organizations	can	have	a	direct	impact	on	digital	preservation	sustainability.	For	example,	
appraisal	can	be	the	most	powerful	tool	available	to	an	organization	to	control	the	amount	of	
content	being	preserved.	Despite	storage	vendors	proclamations	that	storage	is	cheap,	digital	
preservation	is	not.	It	is	not	wise	nor	necessary	to	keep	every	digital	file.	Organizations	will	
benefit	from	applying	flexible	appraisal	systems	that	reduce	the	amount	of	content	needing	
preservation,	but	also	establishing	different	classes	of	preservation	so	the	most	advanced	
activities	are	only	applied	as	needed.	Additionally,	organizations	should	consider	allowing	
lossy	compression	to	decrease	disk	usage,	where	appropriate;	compression	as	an	appraisal	
choice	is	very	similar	to	choosing	to	sample	a	grouping	of	material	rather	than	preserving	the	
whole.	For	additional	information	see	Nathan	Tallman	and	Lauren	Work,	“Approaching	

	
INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2021 

A 21ST CENTURY TECHNICAL INFRASTRUCTURE | TALLMAN 20 

	
Appraisal:	Framing	Criteria	for	Selecting	Digital	Content	for	Preservation,”	in	IPres	1028	
Conference	[Proceedings]	(International	Conference	on	Digital	Preservation,	Boston,	Mass.:	
OSF,	2018),	https://doi.org/10.17605/OSF.IO/8Y6DC.	

55	David	Rosenthal,	“Cloud	for	Preservation,”	DSHR’s	Blog,	2019,	https://perma.cc/ZLS9-R857.	

56	Pendergrass	et	al.,	“Toward	Environmentally	Sustainable	Digital	Preservation.”	

57	Henry	Newman,	“Industry	Trends”	(presentation,	Designing	Storage	Architectures	for	Digital	
Collections,	Washington,	DC:	Library	of	Congress,	2019),	https://perma.cc/3MGK-N5U3.	

58	T.	Bui	et	al.,	“ARCHANGEL:	Trusted	Archives	of	Digital	Public	Documents,”	in	Proceedings	ACM	
Document	Engineering	2018	(Association	for	Computing	Machinery,	arxiv.org,	2018),	
https://doi.org/10.1145/3209280.3229120.	

59	Ben	Fino-Radin	and	Michelle	Lee,	“[Starling]”	(presentation,	Designing	Storage	Architectures	for	
Digital	Collections,	Washington,	DC:	Library	of	Congress,	2019),	https://perma.cc/7LGU-
UEW9.	

60	For	additional	information	on	the	differences	of	proof-of-stake	vs.	proof-of-work	models,	see	
Peter	Fairley,	“Ethereum	Plans	to	Cut	Its	Absurd	Energy	Consumption	by	99	Percent,”	IEEE	
Spectrum	(blog),	January	2,	2019,	https://perma.cc/GCH7-T556.	

61	Julian	Morley,	“Storage	Cost	Modeling”	(presentation,	PASIG,	Mexico	City,	Mexico,	2019),	
https://doi.org/10.6084/m9.figshare.7795829.v1.