Digitization of Text documents Using
PDF/A
Yan Han
and Xueheng Wan
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 52
Yan Han (yhan@email.arizona.edu) is Full Librarian, the University of Arizona Libraries, and
Xueheng Wan (wanxueheng@email.arizona.edu) is a student, Department of Computer Science,
University of Arizona.
ABSTRACT
The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text
documents following FADGI’s recommendation of using PDF/A as a preferred digitization file format.
The authors demonstrate how to convert and combine TIFFs with associated metadata into a single
PDF/A-2b file for a document. Using real-life examples and open source software, the authors show
readers how to convert TIFF images, extract associated metadata and International Color
Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated
PDF/A file is a self-contained and self-described container that accommodates all the data from
digitization of textual materials, including page-level metadata and ICC profiles. Providing
theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over
the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents.
BACKGROUND
PDF has been primarily used as a file delivery format across many platforms in almost every
device since its initial release in 1993. PDF/A was designed to address concerns about long-term
preservation of PDF files, but there has been little research and few implementations of this file
format. Since the first standard (ISO 19005 PDF/A-1), published in 2005, some articles discuss the
PDF/A family of standards, relevant information, and how to implement PDF/A for born-digital
documents.1
There is growing interest in the PDF and PDF/A standards after both the US Library of Congress
and the National Archives and Records Administration (NARA) joined the PDF Association in
2017. NARA joined the PDF Association because PDF files are used as electronic documents in
every government and business agency. As explained in a blog post, the Library of Congress joined
the PDF Association because of the benefits to libraries, including participating in developing PDF
standards, promoting best-practice use of PDF, and access to the global expertise in PDF
technology.2
Few articles, if any, have been published about using this file format for preservation of digitized
content. Yan Han published a related article in 2015 about theoretical research on using PDF/A for
text documents.3 In this article, Han discussed the shortcomings of the widely used TIFF and
JPEG2000 as master preservation file formats and proposed using the then-emerging PDF/A as
the preferred file format for digitization of text documents. Han further analyzed the requirements
mailto:yhan@email.arizona.edu
mailto:wanxueheng@email.arizona.edu
DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 53
HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878
of digitization of text documents and discussed the advantages of PDF/A over TIFF and JPEG2000.
These benefits include platform independence, smaller file size, better compression algorithms,
and metadata encoding. In addition, the file format reduces workload and simplifies post-
digitization processing such as quality control, adding and updating missing pages, and creating
new metadata and OCR data for discovery and digital preservation. As a result, PDF/A can be used
in every phase of a digital object in an Open Archival Information System (OAIS)—for example, a
Submission Information Package (SIP), Archive Information Package (AIP), and Dissemination
Information Package (DIP). In summary, a PDF/A file can be a structured, self-contained, and self-
described container allowing a simpler one-to-one relationship between an original physical
document and its digital surrogate.
In September 2016, the Federal Agencies Digital Guidelines Initiative (FADGI) released its latest
guidelines for digitization related to raster images: Technical Guidelines for Digitizing Heritage
Materials.4 The de-facto best practices for digitization, these guidelines provide federal agencies
guidance and have been used in many cultural heritage institutions. Both the PDF Association and
the authors welcomed the recognition of PDF/A as the preferred master file format for digitization
of text documents such as unbound documents, bound volumes, and newspapers.5
GOALS AND TASKS
Since Han has previously provided theoretical methods of coding raster images, metadata, and
related information in PDF/A, the goals of this article are threefold:
1. present real-life experience of converting TIFFs/JPEG2000s to PDF/A and back, along with
image metadata
2. test open source libraries to create and manipulate images, image metadata, and PDF/A
3. validate generated PDF/As with the first legitimate validator for PDF/A validation
The tasks included the following:
● Convert all the master files in TIFFs/JPEG2000 from digitization of text documents into
single PDF/A files losslessly. One document, one PDF/A file.
● Evaluate and extract metadata from each TIFF/JPEG2000 image and encode it along with
its image when creating the corresponding PDF/A file.
● Demonstrate the runtimes of the above tasks for feasibility evaluation.
● Validate the PDF/A files against the newly released open source PDF/A validator veraPDF.
● Extract each digital image from the PDF/A file back to its original master image files along
with associated metadata.
● Verify the extracted image files in the back-and-forth conversion process against the
original master image files
Choices of PDF/A Standards and Conformance Level
This article demonstrates using PDF/A-2b as a self-contained self-describing file format.
Currently, there are three related PDF/A standards (PDF/A-1, PDF/A-2, and PDF/A-3), each with
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 54
three conformance levels (a, b, and u). The reasons for choosing PDF/A-2 (instead of PDF/A-1 or
PDF/A-3) are the following:
● PDF/A-1 is based on PDF 1.4. In this standard, images coded in PDF/A-1 cannot use
JPEG2000 compression (named in PDF/A as JPXDecode). One can still convert TIFFs to
PDF/A-1 using other lossless compression methods such as LZW. However, the space-
saving benefits of JPEG2000 compression over other methods would not be utilized.
● PDF/A-2 and PDF/A-3 are based on PDF 1.7. One significant feature of PDF 1.7 is that it
supports JPEG2000 compression, which saves 40–60 percent of space for raster images
compared to uncompressed TIFFs.
● PDF/A-3 has one major feature that PDF/A-2 does not have, which is to allow arbitrary
files to be embedded within the PDF file. In this case, there is no file to be embedded.
The authors chose conformance level b for simplicity.
● b is basic conformance, which requires only necessary components (e.g., all fonts
embedded in the PDF) for reproduction of a document’s visual appearance.
● a is accessible conformance, which means b conformance level plus additional accessibility
(structural and semantic features such as document structure). One can add tags to convert
PDF/2b to PDF/2a.
● u represents a conformance level with the additional requirement that all text in the
document have Unicode equivalents.
This article does not cover any post-processing of additional manual or computational features
such as adding OCR text to the generated PDF/A files. These features do not help faithfully capture
the look and feel of original pages in digitization, and they can be added or updated later without
any loss of information. In addition, OCR results rely on the availability of OCR engines for the
document’s language, and results can vary between different OCR engines over time. OCR
technology is getting better and will produce better results in the future. For example, current OCR
technology for English gives very reliable (more than 90 percent) accuracy. In comparison,
traditional Chinese manuscripts and Pashto/Persian give unacceptably low accuracy (less than 60
percent). The cutting edge on OCR engines has started to utilize artificial intelligence networks,
and the authors believe that a breakthrough will happen soon.
Data Source
The University of Arizona Libraries (UAL) and Afghanistan Center at Kabul University (ACKU)
have been partnering to digitize and preserve ACKU’s permanent collection held in Kabul. This
collaborative project created the largest Afghan digital repository in the world. Currently the
Afghan digital repository (http://www.afghandata.org) contains more than fifteen thousand titles
and 1.6 million pages of documents. Digitization of these text documents follows the previous
version of the FADGI guideline, which recommended scanning each page of a text document into a
separate TIFF file as the master file. These TIFFs were organized by directories in a file system,
where each directory represents a corresponding document containing all the scanned pages of
this title. An example of the directory structure can be found in Han’s article.
http://www.afghandata.org/
DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 55
HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878
PDF/A and Image Manipulation Tools
There are a few open source and proprietary PDF software development kits (SDK). Adobe PDF
Library and Foxit SDK are the most well-known commercial tools to manipulate PDFs. To show
readers that they can manipulate and generate PDF/A documents themselves, open source
software, rather than commercial tools, was used. Currently, only a very limited number of open
source PDF SDKs are available, including iText and PDFBox. iText was chosen because it has g ood
documentation and provides a well-built set of APIs to support almost all the PDF and PDF/A
features. Initially written by Bruno Lowagie (who was in the ISO PDF standard working group) in
1998 as an in-house project, Lowagie later started up his own company, iText, and published iText
in Action with many code examples.6 Moreover, iText has Java and C# coding options with good
code documentation. It is worth mentioning that iText has different versions. The author used
iText 5.5.10 and 5.4.4. Using an older version in our implementation generated a non-compatible
PDF/A file because the it was not aligned with the PDF/A standard.7
For image processing, there were a few popular open source options, including ImageMagick and
GIMP. ImageMagick was chosen because of its popularity, stability, and cross-platform
implementation. Our implementation identified one issue with ImageMagick: the current version
(7.0.4) could not retrieve all the metadata from TIFF files as it did not extract certain information
such as the Image File Directory and color profile. These metadata are critical because they are
part of the original data from digitization. Unfortunately, the author observed that some image
editors were unable to preserve all the metadata from the image files during the conversion
process. Hart and De Varies used case studies to show the vulnerability of metadata,
demonstrating metadata elements in a digital object can be lost and corrupted by use or
conversion of a file to another format. They suggested that action is needed to ensure proper
metadata creation and preservation so that all types of metadata must be captured and preserved
to achieve the most authentic, consistent, and complete digital preservation for future use.8
Metadata Extraction Tools and Color Profiles
As we digitize physical documents and manipulate images, color management is important. The
goal of color management is to obtain a controlled conversion between the color representations
of various devices such as image scanners, digital cameras, and monitors. A color profile is a set of
data that control input and output of a color space. The International Color Consortium (ICC)
standards and profiles were created to bring various manufacturers together because embedding
color profiles into images is one of the most important color management solutions. Image
formats such as TIFF and JPEG2000 and document formats such as PDF may contain embedded
color profiles. The authors identified a few open source tools to extract TIFF metadata, includin g
ExifTool, Exiv2, and tiffInfo. ExifTool is an open source tool for reading, writing, and manipulating
metadata of media files. Exiv2 is another free metadata tool supporting different image formats.
The tiffInfo program is widely used in the Linux platform, but it has not been updated for at least
ten years. Our implementations showed that ExifTool was the one that most easily extracted the
full ICC profiles and other metadata from TIFF and JPEG2000 files. ImageMagick and other image
processing software were examined in Van der Knijff’s article discussing JPEG2000 for long-term
preservation.9 He found that ICC profiles were lost in ImageMagick. Our implementation has
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 56
showed the current version of ImageMagick has fixed this issue. A metadata sample can be found
in appendix A.
IMPLEMENTATION
Converting and Ordering TIFFs into a Single PDF/A-2 File
When ordering and combining all individual TIFFs of a document into a single PDF/A-2b file, the
authors intended to preserve all information from the TIFFs, including raster image data streams
and metadata stored in each TIFF’s header. The raster image data streams are the main images
reflecting the original look and feel of these pages, while the metadata (including technical and
administrative metadata such as BitsPerSample, DateTime, and Make/Model/Software) tells us
important digitization and provenance information. Both are critical for delivery and digital
preservation.
The TIFF images were first converted to JPEG2000 with lossless compression using the open
source ImageMagick software. Our tests of ImageMagick demonstrated that it can handle different
color profiles and will convert images correctly if the original TIFF comes with a color profile. This
gave us confidence that past concerns about JPEG2000 and ImageMagick had been resolved. These
images were then properly sorted into their original order and combined into a single PDF/A-2
file. An alternative is to directly code TIFF’s image data stream into a PDF/A file, but this approach
would miss one benefit of PDF/A-2: tremendous file size reduction with JPEG2000. The following
is the pseudocode of ordering and combining all the TIFFs in a text document into a single PDF/A-
2 file.
CreatePDFA2(queue TiffList) {
Create an empty queue XMLQ;
Create an empty queue JP2Q;
/* TiffFileList is pre-sorted queue based on the original order */
/* Convert each TIFF to JPEG2000 losslessly, then add each JPEG2000 and its
metadata into a queue */
while (TiffList is NOT empty) {
String TiffFilePath = TiffList.dequeue();
string xmlFilePath = Tiff metadata extracted using exiftool;
XMLQ.enqueue(xmlFilePath);
String jp2FilePath = JPEG2000 file location from Tiff converted by
ImageMagick;
JP2Q.enqueue(jp2FilePath);
}
/* Convert each image’s metadata to XMP, add each JPEG2000 and its metadata
into the PDF/A-2 file based on its original order */
Document pdf2b = new Document();
/* create PDF/A-2b conformance level */
PdfAWriter writer = PdfAWriter.getInstance(doc, new
FileOutputStream(PdfAFilePath),PdfAConformaceLevel.PDF_A_2B);
writer.createXmpMetadata(); //Create Root XMP
DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 57
HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878
pdf2b.open();
while(JP2Q is NOT empty){
Image jp2 = Image.getInstance(JP2Q.dequeue());
Rectangle size = new Rectangle(jp2.getWidth(), jp2.getHeight()); //PDF
page size setting
pdf2b.setPageSize(size);
pdf2b.newPage(); // create a new page for a new image
byte[] bytearr = XmpManipulation(XMLQ.dequeue()); // convert original
metadata based on the XMP standard
writer .setPageXmpMetadata(bytearr);
pdf2b.add(jp2);
}
pdf2b.close();
}
Converting PDF/A-2 Files back to TIFFs and JPEG2000s
To ensure that we can extract raster images from the newly created PDF/A-2 file, the authors also
wrote code to convert a PDF/A-2 file back to the original TIFF or JPEG2000 format. This
implementation was a reverse process of the above operation. Once the reverse conversion
process was completed, the authors verified that the image files created from the PDF/A-2 file
were the same as before the conversion to PDF/A-2. Note that we generated MD5 checksums to
verify image data streams. Images data streams are the same, but metadata location can be varied
because of inconsistent TIFF tags used over the years. When converting one TIFF to another TIFF,
ImageMagick has its implementation of metadata tags. The code can be found in appendix B.
PDF/A Validation
PDF/A is one of the most recognized digital preservation formats, specially designed for long -term
preservation and access. However, no commonly accepted PDF/A validator was available in the
past, although several commercial and open source PDF preflight and validation engines (e.g.,
Acrobat) were available. Validating a PDF/A against the PDF/A standards is a challenging task for
a few reasons, including the complexity of the PDF and PDF/A formats. The PDF Association and
the Open Preservation Foundation recognized the need and started a project to develop an open
source PDF/A validator and build a maintenance community. Their result, VeraPDF, is an open
source validator designed for all PDF/A parts and conformance levels. Released in January 2017,
the goal of veraPDF is to become the commonly accepted PDF/A validator. 10 Our generated
PDF/As have been validated with veraPDF 1.4 and Adobe Acrobat Pro DC Preflight. Both products
validated the PDF/A-2b files as fully compatible. Our implementations showed that veraPDF 1.4
verified more cases than Acrobat DC Preflight. Figure 1 shows a PDF file structure and its
metadata.
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 58
Figure 1. A PDF object tree with root-level metadata.
RUNTIME AND CONCLUSION
The time complexity of our code is O(log n) because of the sorting algorithms used. TIFFs were first
converted to JPEG2000. When JPEG2000 images are added to a PDF/A-2 file, no further image
manipulation is required because the generated PDF/A-2 uses JPEG2000 directly (in other words,
it uses the JPXDecode filter). Tables 1 and 2 show the performance comparison running in our
computer hardware and software environment (Intel Core i7-2600 CPU@3.4GHz, 8GB DDR3 RAM,
3TB 7200-RPM 64MB-cache hard disk running Ubuntu 16.10).
DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 59
HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878
Table 1. Runtimes of converting grayscale TIFFs to JPEG2000s and to PDF/A-2b
No. of
Files
Total File
Size (MB)
Image Conversion Runtime
(TIFFs to JP2s in seconds)
Total Runtime
(TIFFs to JP2s to a single
PDF/A-2b in seconds)
1 9.1 3.61 3.98
10 91.1 35.63 36.71
20 182.2 71.83 73.98
50 455.5 179.06 184.63
100 910.9 358.3 370.91
Table 2. Runtimes of converting color TIFFs to JPEG2000s and to PDF/A-2b
No. of
Files
Total File
Size (MB)
Image Conversion
Runtime
(TIFFs to JP2s in seconds)
Total Runtime
(TIFFs to JP2s to a single
PDF/A-2b in seconds)
1 27.3 14.80 14.94
10 273 150.51 151.55
20 546 289.95 293.21
50 1,415 741.89 749.75
100 2,730 1490.49 1509.23
The results show that (a) the majority of the runtime (more than 95 percent) is spent in
converting a TIFF to a JPEG2000 using ImageMagick (see figure 2); (b) the average runtime of
converting a TIFF has a constant positive relationship with the file’s size (see figure 2); (c) in
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 60
comparison, the runtime of converting a color TIFF is significantly higher than that of converting a
greyscale TIFF (see figure 2); and (d) it is feasible in terms of time and resources to convert
existing master images of digital document collections to PDF/A-2b. For example, the runtime of 1
TB of conversion of color TIFFs will be 552,831 seconds (153.5 hours; 6.398 days) using the above
hardware. The authors have already processed more than 600,000 TIFFs using this method.
The authors conclude that using PDF/A gives institutions advantages of the newly preferred
master file format for digitization of text documents over TIFF/JPEG2000. The above
implementation demonstrates the ease, the reasonable runtime, and the availability of open
source software to perform such conversions. From both the theoretical analysis and empirical
evidences, the authors show that PDF/A has advantages over the traditional preferred file format
TIFF for digitization of text documents. Following best practice, a PDF/A file can be a self-
contained and self-described container that accommodates all the data from digitization of textual
materials, including page-level metadata and ICC profiles.
SUMMARY
The goal of this article is to demonstrate empirical evidences of using PDF/A for digitization of
text document. The authors evaluated and used multiple open source software programs for
processing raster images, extracting image metadata, and generating PDF/A files. These PDF/A
files were validated using the up-to-date PDF/A validators veraPDF and Acrobat Preflight.
The authors also calculated the time complexity of the program and measured the total runtime in
multiple testing cases. Most of the runtime was spent on image conversions from TIFF to
JPEG2000. The creation of the PDF/A-2b file with associated page-level metadata accounted for
less than 5 percent of the total runtime. Runtime of conversion of a color TIFF was much higher
than that of a greyscale one. Our theoretical analysis and empirical examples show that using
PDF/A-2 presents many advantages over the traditional preferred file format (TIFF/JPEG2000)
for digitization of text documents.
DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 61
HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878
Figure 2. File size, greyscale and color TIFFs and runtime ratio.
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 62
APPENDIX A: SAMPLE TIFF METADATA WITH ICC HEADER
8
3400
4680
8 8 8
Uncompressed
RGB
(Binary data 41025 bytes, use -b option to
extract)
3
1
(Binary data 28079 bytes, use -b option to
extract)
400
400
Chunky
APPL
2.2.0
Display Device Profile
RGB
XYZ
2006:02:02 02:20:00
acsp
Apple Computer Inc.
Not Embedded, Independent
none
Reflective, Glossy, Positive, Color
Perceptual
0.9642 1 0.82491
EPSO
0
EPSON sRGB
0.43607 0.22249 0.01392
0.38515 0.71687
0.09708
0.14307 0.06061 0.7141
0.95045 1 1.08905
Copyright (c) SEIKO EPSON CORPORATION 2000 - 2006.
All rights reserved.
(Binary data 8204 bytes, use -b option to
extract)
(Binary data 8204 bytes, use -b option to
extract)
(Binary data 8204 bytes, use -b option to
extract)
0 0 0
DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 63
HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878
APPENDIX B: SAMPLE CODE TO CONVERT PDF/A-2 BACK TO JPEG2000S
/* Assumption: The PDF/A-2b file was specifically generated from image objects
converted from TIFF images with JPXDecode along with page-level metadata */
public static void parse(String src, String dest) throws IOException{
PdfReader reader = new PdfReader(src);
PdfObject obj;
int counter = 0;
for(int i = 1; i <= reader.getXrefSize(); i ++){
obj = reader.getPdfObject(i);
if(obj != null && obj.isStream()){
PRStream stream = (PRStream) obj;
byte[] b;
try{
b = PdfReader.getStreamBytes(stream);
}catch(UnsupportedPdfException e){
b = PdfReader.getStreamBytesRaw(stream);
}
PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
FileOutputStream fos = null;
if (pdfsubtype != null &&
pdfsubtype.toString().equals(PdfName.XML.toString())) {
fos = new FileOutputStream(String.format(dest + "_xml/" +
counter+".xml", i));
System.out.println("Page Metadata Extracted!");
}
if (pdfsubtype != null &&
pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
counter ++;
fos = new FileOutputStream(String.format(dest + "_jp2/" +
counter+".jp2", i));
}
if (fos != null) {
fos.write(b);
fos.flush();
fos.close();
System.out.println("JPEG2000s Conversion from PDF completed !");
}
}
}
/* Then Use ImageMagick library to convert JPEG2000s to TIFFs */
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 64
REFERENCES
1 PDF-Tools.com and PDF Association, “PDF/A—The Standard for Long-Term Archiving,” version
2.4, white paper, May 20, 2009, http://www.pdf-
tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf; Duff Johnson, “White Paper:
How to Implement PDF/A,” Talking PDF, August 24, 2010, https://talkingpdf.org/white-paper-
how-to-implement-pdfa/; Alexandra Oettler, “PDF/A in a Nutshell 2.0: PDF for Long-Term
Archiving,” Association for Digital Standards, 2013, https://www.pdfa.org/wp-
content/until2016_uploads/2013/05/PDFA_in_a_Nutshell_211.pdf; Library of Congress,
“PDF/A, PDF for Long-Term Preservation,” last modified July 27, 2017,
https://www.loc.gov/preservation/digital/formats/fdd/fdd000318.shtml.
2 Library of Congress, “The Time and Place for PDF: An Interview with Duff Johnson of the PDF
Association,” The Signal (blog), December 12, 2017,
https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-
johnson-of-the-pdf-association/.
3 Yan Han, “Beyond TIFF and JPEG2000: PDF/A as an OAIS Submission Information Package
Container,” Library Hi Tech 33, no. 3 (2015): 409–23, https://doi.org/10.1108/LHT-06-2015-
0068.
4 Federal Agencies Digital Guidelines Initiative, Technical Guidelines for Digitizing Cultural Heritage
Materials. (Washington, DC: Federal Agencies Digital Guidelines Initiative, 2016),
http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20D
igital%20Guidelines%20Initiative-2016%20Final_rev1.pdf.
5 Duff Johnson, “US Federal Agencies Approve PDF/A,” PDF Association, September 2, 2016,
http://www.pdfa.org/new/us-federal-agencies-approve-pdfa/.
6 Bruno Lowagie, iText in Action, 2nd ed. (Stamford, CT: Manning, 2010).
7 “iText 5.4.4,” iText, last modified September 16, 2013, http://itextpdf.com/changelog/544.
8 Timothy Robert Hart and Denise de Vries, “Metadata Provenance and Vulnerability,” Information
Technology and Libraries 36, no. 4 (2017), https://doi.org/10.6017/ital.v36i4.10146.
9 Johan Van der Knijff, “JPEG 2000 for Long-Term Preservation: JP2 as a Preservation Format,” D-
Lib 17, no. 5/6 (2011), https://doi.org/10.1045/may2011-vanderknijff.
10 PDF Association, “How veraPDF does PDF/A Validation,” 2016, http://www.pdfa.org/how-
verapdf-does-pdfa-validation/.
http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
https://talkingpdf.org/white-paper-how-to-implement-pdfa/
https://talkingpdf.org/white-paper-how-to-implement-pdfa/
https://www.pdfa.org/wp-content/until2016_uploads/2013/05/PDFA_in_a_Nutshell_211.pdf
https://www.pdfa.org/wp-content/until2016_uploads/2013/05/PDFA_in_a_Nutshell_211.pdf
https://www.loc.gov/preservation/digital/formats/fdd/fdd000318.shtml
https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/
https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/
https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/
https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/
https://doi.org/10.1108/LHT-06-2015-0068
https://doi.org/10.1108/LHT-06-2015-0068
http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf
http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf
http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf
http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf
https://www.pdfa.org/new/us-federal-agencies-approve-pdfa/
https://www.pdfa.org/new/us-federal-agencies-approve-pdfa/
https://www.pdfa.org/new/us-federal-agencies-approve-pdfa/
http://itextpdf.com/changelog/544
http://itextpdf.com/changelog/544
https://doi.org/10.6017/ital.v36i4.10146
https://doi.org/10.6017/ital.v36i4.10146
https://doi.org/10.1045/may2011-vanderknijff
https://www.pdfa.org/how-verapdf-does-pdfa-validation/
https://www.pdfa.org/how-verapdf-does-pdfa-validation/
https://www.pdfa.org/how-verapdf-does-pdfa-validation/
Abstract
Background
Goals and Tasks
Choices of PDF/A Standards and Conformance Level
Data Source
PDF/A and Image Manipulation Tools
Metadata Extraction Tools and Color Profiles
Implementation
Converting and Ordering TIFFs into a Single PDF/A-2 File
Converting PDF/A-2 Files back to TIFFs and JPEG2000s
PDF/A Validation
Runtime and Conclusion
Summary
Appendix A: Sample TIFF Metadata with ICC header
Appendix B: Sample Code to convert PDF/A-2 back to JPEG2000s
References