Accessibility of Tables in PDF Documents: Issues, Challenges, and Future Directions ARTICLE Accessibility of Tables in PDF Documents Issues, Challenges, and Future Directions Nosheen Fayyaz, Shah Khusro, and Shakir Ullah INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2021 https://doi.org/10.6017/ital.v40i3.12325 Nosheen Fayyaz (nosheenfayaz@uop.edu.pk) is doctoral candidate, University of Peshawar. Shah Khusro (khusro@uop.edu.pk) is Professor, University of Peshawar. Shakir Ullah (ullah@ulm.edu) is Instructor, University of Louisiana Monroe. © 2021. ABSTRACT People access and share information over the web and in other digital environments, including digital libraries, in the form of documents such as books, articles, technical reports, etc. These documents are in a variety of formats, of which the Portable Document Format (PDF) is most widely used because of its emphasis on preserving the layout of the original material. The retrieval of relevant material from these derivative documents is challenging for information retrieval (IR) because the rich semantic structure of these documents is lost. The retrieval of important units such as images, figures, algorithms, mathematical formulas, and tables becomes a challenge. Among these elements, tables are particularly important because they can add value to the resource description, discovery, and accessibility of documents not only on the web but also in libraries if they are made retrievable and presentable to readers. Sighted users comprehend tables for sensemaking using visual cues, but blind and visually impaired users must rely on assistive technologies, including text- to-speech and screen readers, to comprehend tables. However, these technologies do not pay sufficient attention to tables in order to effectively present tables to visually impaired individuals. Therefore, ways must be found to make tables in PDF documents not only retrievable but also comprehensible. Before developing such solutions, it is necessary to review the available assistive technologies, tools, and frameworks for their capabilities, strengths, and limitations from the comprehension perspective of blind and visually impaired people, along with suitable environments like digital libraries. We found no such review article that critically and analytically presents and evaluates these technologies. To fill this gap in the literature, this review paper reports on the current state of the accessibility of PDF documents, digital libraries, assistive technologies, tools, and frameworks that make PDF tables comprehensible and accessible to blind and visually impaired people. The study findings have implications for libraries, information sciences, and information retrieval. INTRODUCTION The web has a huge collection of documents, including pages, books, blogs, articles, reports, etc., available in different formats. These formats include HTML (HyperText Markup Language), EPUB (Electronic PUBlication), AZW (AmaZon Word), and the ubiquitous PDF (Portable Document Format) format. PDF is layout oriented and unstructured, having elements such as text, images, tables, and metadata. All these elements carry specific information and have their relative importance. Tables can be part of a structured or unstructured document. A structured table, like in HTML, is relatively easy to extract and interpret, as it has a starting and ending tag pair for the table itself, its headings, each row, and discrete values. However, unstructured documents, which can include books, journals, audio, video, images, and documents, do not follow a specified format or structure for the organization of information.1 A table has levels of abstraction; the higher levels mailto:nosheenfayaz@uop.edu.pk mailto:khusro@uop.edu.pk mailto:ullah@ulm.edu INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 2 of abstraction have fewer details whereas a lower level gives more information. The human has to understand and comprehend the underlying semantics of the table content for sensemaking. The content of a table has a strong bond with its context, as it has concrete information regarding the surrounding text; therefore, tables are hard to comprehend when taken out of the context. Poorly conceived information is more dangerous, as it can lead to misconceptions and poor decisions. Any system or component that interacts with humans must be capable of offering comprehensible explanations.2 A reader understands a table in at least three cognitive processes: comprehension, searching, and interpretation & comparison.3 In contrast, blind and visually impaired persons need assistance in comprehending the tabulated information, for example, understanding table structure and its content, searching for particular information in a table, and comparing and interpreting tabular data. Therefore, they need technical solutions for reading documents.4 According to the World Health Organization (WHO), the number of blind and visually impaired people has increased significantly and has risen to 2.2 billion, so technical solutions or assistive technology are a must for their reading.5 Assistive technologies are supposed to handle the three main kinds of print disabilities: vision problems, motor skill problems, and cognitive problems.6 For vision problems we have tools like text-to-speech and screen readers to help blind and visually impaired people to read text documents. However, these tools work on the upper level of abstraction and give limited information to users because they focus on text and ignore components related to presentation such as tables, graphs, images, etc. This limitation is not only found on the web but also affects other digital environments including digital libraries, where more reliable document collections are present but their retrieval and presentation to blind and visually impaired people is challenging.7 For example, a study identified the limitations of digital libraries in meeting the specific needs of blind and visually impaired people and suggested including help features for a more user-friendly experience.8 Michigan State University has taken an initiative to make digital content accessible by adopting WCAG 2.0 as official technical guidelines. They presented a five- year accessibility plan for making new, existing, and purchased content accessible along with resource allocations, training for their staff, and future requirements.9 These efforts may motivate other libraries to adopt such measures and make it a convenient place for people with disabilities. Common accessibility problems include the lack of alternate descriptions, using visual cues to describe interactions in the user interface, fuzzy visuals, and audios.10 Furthermore, the sight- centered nature of the digital library creates problems for blind users, such as the absence of meaningful descriptions for nontext content and instructions, along with information about the digital library’s features due to missing textual or verbal instructions.11 The traditional usage of a digital library makes a canned or routine utilization of its collections, which may be broadened by making computational ready collections.12 The accessibility of these documents will help in knowledge dissemination to blind and visually impaired people. Researchers have presented frameworks and algorithms for exploring and interpreting PDF elements like images, charts, tables, and graphs. These interpretations are the basis for both humans and machines to gain meaningful insights out of tabular data. This paper highlights the significance of the rich semantics of PDF tables and the challenges in their interpretation and their presentation to blind and visually impaired people. It is proposed to present the tables’ explicit and implicit information in a progressive manner to reduce the cognitive overload on blind and visually impaired individuals. This might be achieved by providing some basic information (such as a table caption and the number of rows and columns in the table), which may be followed by INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 3 navigation and querying within the table. This stepwise approach of leveraging a table’s semantics may help in its better comprehension. Table semantics will also be helpful in libraries, information science, and information retrieval, as it has the potential to improve library cataloging and classification. The next section of this paper discusses and reviews the efforts and limitations in the existing literature and presents a general model of table processing and interpretation. The prominent issues and challenges are identified regarding general table structure, format, interpretations, and evaluation; the presentation of tables to blind and visually impaired people; and specifically the accessibility issues in digital library are presented in the following section. The last section contains some future research directions that will unleash some new dynamics of this domain. THE CURRENT STATE OF TABLE PROCESSING A table presents summarized information in a particular arrangement, where the structure of the table reveals some implicit semantics. In 1996, Xinxin Wang defined the logical structure and style rules of tables and presents Wang’s abstract model.13 The model separated the logical structure from layout specification and is considered a generic and complete model in the literature.14 Unstructured tables can be regular or irregular. A regular table has intersecting vertical and horizontal borders that develops a table of cell bounding boxes, while in an irregular table, there is no relationship between the number of rows and columns.15 Tables can be n-dimensional, having spanning cells or multiline cells. Tables can be long and span multiple pages and they can be floating (can be placed to the left or right of the page, with text wrapped around them). Sometimes tables have no explicit boundaries and even worse, cell separators may not be visible. A table can have a variety of content that includes numerical data, text, symbols, images, and equations.16 The location of table and identification of table structure in documents is evaluated in the International Conference on Document Analysis and Recognition (ICDAR) table competition.17 The methods used for the identification of table structure are rule-based methods, 18 data-driven methods,19 and the graphical neural network.20 Multiple frameworks and approaches are used for the extraction and processing of tables from structured documents like HTML,21 and unstructured documents like images and PDF documents.22 Keeping in view all the conducted studies and research, we present a general model of table processing and interpretation in figure 1, showing the prominent inputs (web, PDF, and images), processes, and some notable outputs. The model has a table extraction process that is followed by processing which yields multiple outputs including organized data, analyzed structure, and analyzed content. Processing can be followed by establishing relationships within the table, within table and context, or with other related tables. Moreover, tables presented in open formats such as CSV, XML, JSON, and RDF extend their potential by exploration, creating or extending ontologies and knowledge bases, and publishing tables on an LOD cloud to establish links with other open data sources. Below is a detailed discussion of table extraction and processing and the relationships of tables with content and context. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 4 Figure 1. A general model for table interpretation. Table Extraction and Processing The recognition, extraction, and processing of tables, from a variety of documents, have used multiple approaches.23 The hidden table semantics will not only help in understanding tables but can also contribute to digital library cataloging. These approaches are categorized in the following sections. Using Heuristics Different heuristic approaches are presented for extraction and processing of unstructured tables . For example, PDFTREX uses spatial features and follows a bottom up approach for the recognition and extraction of tables. It represents a table as a two-dimensional grid on a Cartesian plane and extracts the table as a set of cells along with their coordinates.24 In another approach, natural language processing (NLP) features are used for deeper understanding of text. These tools use parts of speech and dependency paths for the extraction of tables and for finding relations among tables by using the NLP toolkit.25 Milosevic et al. presented five steps for table processing, i.e., table detection, functional analysis, structural analysis, synthetic analysis, and semantic analysis,26 while Roya Rastan endorses the first three steps in his PhD dissertation and proposed a framework for the processing of tables. This framework consists of four layers: input management, table processing, storage, and management.27 To recognize and extract tables from documents, ad hoc heuristics are used with the existing methods and techniques, which includes three steps: (1) “preprocessing” to define and prepare text chunks from a source table by using the features of text like font, space, and bounding box; (2) “text block recovering” to identify the set of text chunks that could be treated as the content of a single cell; and (3) “cell recovering” to observe the arrangement of cells for identifying the rows and columns.28 The authors exploited appearance features of text printing instructions and position of drawing cursor for table detection and structure recognition in their web-based solution. They claimed to attain an accuracy or F-score of 83.18% for table extraction and 93.64% for structure recognition.29 Furthermore, an interactive document reader was presented by the researchers of Stanford University, in which structural analysis was combined with rule-based matching and natural language processing to associate a table’s values with the related text to develop sentence-table INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 5 pairs in the document. They also tried to relate tables in two data sets but unfortunately obtained 48.8 % results.30 Although the results were not satisfactory, this effort opens up new endeavors for practitioners and researchers. Using Segmentation The segmentation approach is used for identification of tables in unstructured and untagged PDF documents, along with its columns and rows.31 Visual separators and geometric content layout information is used for the extraction of tables from multiple pages of documents, and the technique is tested on e-books and scientific documents.32 Ali et al. adopted a segmentation approach to deal with incomplete, impure, and complex tables by extracting table schema, data, and reading paths of data to represent in a layout independent format.33 The extraction of tables from images also used segmentation through a top-down pipeline approach. The text and tables in medical laboratory reports were identified, where the content of tables needs to be correctly captured and interpreted.34 As the medical tables include text, numbers, characters, and symbols, therefore, their correct interpretations is critical in medical reports and a minor error can lead to very dangerous outcomes. A system named TEXUS used segmentation to prepare text chunks for finding relations among cells. The system provided end-to-end processing of tables, which claimed to detect a variety of tables in layout-independent format from a data set of complex financial tables. The system interpreted the tables and produced an XML file about the structure of the tables showing the access paths of each data cell as an attribute. 35 Similarly, page segmentation is carried out by using deep learning methods to identify tables, text, and figures.36 Using Machine Learning and Deep Learning Approaches Machine learning and deep learning techniques are also used for automatic detection and extraction of table data. Random forest classifiers are used to detect a table header.37 Multitask fully convolutional neural networks (FCN) are used for page segmentation to identify the tables, text block, and figure elements.38 The K-nearest neighbor method and layout heuristics are used in a system named TAO for the automatic detection, extraction, and organization of data in tables in order to generate an enriched representation of data. 39 Similarly, deep learning techniques like R- CNN are used to capture tables from a University of Las Vegas data set of document images. Based on the assumption that tabular data is mostly numeric, the researchers used color-coding/coloration to distinguish numeric and textual data and claim to have achieved improved performance.40 Table detection and recognition in born-digital documents and images is carried out by using transfer learning for faster R-CNN in order to overcome the problem of labeled data sets and FCN semantic segmentation is used for table structure recognition. The method is evaluated using the ICDAR 2013 data set.41 Another approach, named DeCNT, worked on images (in any format) for the extraction of tables by using a combination of deformable CNN with faster R-CNN/FPN. The method is evaluated using the ICDAR 2013 data set and the ICDAR 2017 POD data set, UNLV, Mormont.42 In other research, the authors pointed out the weakness in the existing methods and techniques for understanding tables and presented a “graph neural network” approach to analyze the structure of PDF tables and handle the spanning cells.43 Along with that, multiple deep learning techniques are used for integrating and querying tables using word embedding, RNN, KNN, and LSTM for the classification of financial tables. 44 In the case of web tables, annotation of table columns is performed by using Convolutional Neural Network (CNN) along with transfer learning INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 6 in order to overcome the problem of a shortage of target data sets. The column semantics are embedded into vector space and are used for predicting the type of columns without using the metadata. The method is tested on two web table data sets, T2Dv2 and Limaye.45 Using Ontologies Ontologies have also played a vital role in the detection, recognition, and annotation of tables from the web, images, and PDF documents. Ontologies consider the content and structure of a table for their conceptual representation. A system named TableMiner was used to interpret the tables semantically by identifying the semantic concepts for the columns and disambiguating the cell content by using RDFa and microdata for improved annotations of the table. 46 TableMiner considered relational tables and mapped the headers of table with the properties of ontology for linking the cell values with entities.47 The relationships between a table and its context are also extracted and annotated, and to remove disambiguation, the provenance of relationships is preserved.48 A framework named TABEL, developed by Varish Mulwad, has a module for converting a table to a graphical model to infer the semantics of table header, cells, and their relation to each other. These semantics are used to convert the graphical representation to RDF triples by using knowledge bases along with the author’s own defined ontology or any other ontology.49 The ontology is also used for finding the relevant tables in a domain of technical documents only.50 For easy interpretation by ontologies and more usability of the government data, the unstructured tabular data is suggested to be published in open format like CSV (comma separated value).51 The studies mentioned above have mostly used relational tables, technical document tables, government data, and medical data with a main objective of making tables open and interrelated. Along with that, another study argued that besides the metadata of a resource, user-generated content may also be considered and published as linked open data for improved consumption and would also contribute to better cataloging of digital libraries.52 Unfortunately, it still has problems like disambiguation and correlation of complex tables, besides other issues involved in publishing and consuming data as open and linked data.53 Relationship of Tables with Content and Context The content of a table is present in a particular arrangement in order to give some specific information. Therefore, the table content should be interpreted for the hidden semantics among the cell content, context, and with other related tables in a particular domain. In this reg ard, natural language techniques are used for the identification of relationships in the table and the related text using the NLP toolkit. The researchers claimed improvement in table schema identification and quality of relation.54 Similarly, ontologies are used to identify the semantic relations among the text, table contents, and table structure.55 Another research project used rule- based matching and structural analysis for finding the relationship in table cell and sentence text, by developing sentence-table pair in the document. This project tried to develop a relationship between tables of two data sets but achieved only 48.8% success rate.56 A system named TEXUS tried to find out the relations among the values of cells using cell entries, categories, and access paths. They used segmentation techniques for preparation of text chunks and produced an XML file about the structure of the table showing the access paths of each data cell as an attribute.57 Narrowing the table-understanding domain to clinical literature, with a focus on just the numerical and textual data of tables from XML documents, Milosevic et al. extended their previous work and tried to identify the relationship between the table and the surrounding text. They added pragmatic analysis, cell selection, and syntactic analysis, defined five categories of cells, depending upon the data in the cells, and identified seven semantic categories for the INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 7 specification of table extraction process. The PubMedCentral data set is used to test the developed system with regard to task, variables and complexity. The authors claimed to achieve an accuracy measure F-score between 82% and 92%.58 A “graph neural network” model was developed to build an undirected graph for the prediction of relations among adjacent cells. The model was tested on benchmark data sets, i.e., ICDAR 2013 and tableBank-2019, and claimed to outperform.59 Another system, named Tablepedia, unified the tables of experimental results with regard to method, data set, metric, score, and source into tuples. The system extracted the related tables and identified the conflicted results by using the rule-based and learning-based methods with the help of SQL operations.60 An SQL-like query was proposed for the financial tables in PDF formats by using deep learning approaches.61 All the mentioned techniques for establishing relationships follow the rule-based, learning-based, segmentation, neural network, heuristics, and ontologies. Among these, the ontologies can establish inter domain relationships and explorations. However, it still has issues that will be discussed in the conclusion section. Existing Accessibility-Driven Solutions for PDF Documents Apart from the systems and frameworks for table understanding and processing, a mech anism or a solution is needed to present tables in a meaningful way to blind and visually impaired people. The accessibility of digital documents is based on the captured structural information and its availability for processing by other software and applications, such as tagged PDF, can help in summarizing, navigating, and providing structural information of the content.62 Nazimi made an effort to present a framework for understanding the complex documents and its components, including images, charts, and tables, in a nonvisual representation to blind and visually impaired people.63 The existing available solutions for reading PDF documents to blind and visually impaired people focus on text and give little attention to its elements such as tables, images, graphs, and charts. Particularly speaking for tables, these solutions either read the table caption and ignore the content, or read the table as if it were text, which renders it meaningless. These assistive technologies are divided into four main categories. 1. Text-to-speech tools 2. Screen readers 3. Voice assistant 4. Natural Language Generator (NLG) Text to speech tools include products like WordTalk, Virtual Speaker, Audiobook Reader Voice, Voicepaper, Dream Reader, etc. These tools can read text from txt, PDF, or doc files aloud and have an interface for user interaction, where the user can copy-paste the text or mention the path of the file to read. Some of the tools are free with limited features while others are proprietary. They need human interaction, which might be difficult to use for a visually impaired or blind person. Screen Readers include JAWS, NVDA, COBRA, VoiceOver, and Talkback. They speak out every user activity that is taking place, like opening or closing a window, clicking on a button, reading text from a txt or PDF file. These tools are helpful for visually impaired people, as they do not need the user to open a specific software and then specify the path of files to read. The most popular tool , JAWS, is proprietary and used for Windows. NVDA is free software for Windows, while VoiceOver is a free tool provided with Apple’s operating systems (including macOS and iOS). NVDA reads the text, taking note of punctuation; it reads the table row by row like text and then reads the caption of the table at the end. It can also read the alternate text of the table if it is included in Acrobat Pro. These tools need the user to be aware of what he or she is doing. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 8 Similarly, there are voice assistants like Apple Siri, Microsoft Cortana, Google Assistant, and Amazon Alexa. All these tools take instructions from the user and then try to provide solutions. These tools may ask users several queries to clarify what they want and provide limited functionalities such as reading out GPS coordinates, playing music, etc. The Natural Language Generator (NLG) is used to convert raw text or data into narrations. Existing popular systems or tools include ARRIA NLG, Quill, IBM Watson, AX NLG Cloud, Amazon Polly, and Wordsmith. These systems are used to perform data analysis and convert the extracted analytics to narrations, which could be easily understand by the user. These tools are not for narrating tables from unstructured documents. However, a framework has been developed using a neural encoder-decoder to generate text from tables. It is claimed that the solution outperformed the existing solutions and achieved higher BLEU score and F-score using data sets WEATHERGOV, WIKIBIO, WIKITABLE, and a Chinese data set WIKIBIOCN.64 This technique focuses on formal tables and ignores complex tables as well as unstructured PDF. A new emerging category in this paradigm is the document-centered assistant, which tries to help users review documents by asking questions. The field is currently studied for the type of questions that a user may ask and the candidate machine learning models that can be used for answering them. The questions would be different from factoid questions and chitchat, because here the focus would be on relevant information from that specific document.65 This category seems to have a big scope for understanding, reviewing, and inferring knowledge from documents. Apart from the solutions mentioned above, there are some Java and Python tools and libraries that are used for table extraction from PDF documents and are shown in table 1. Some of the tools are commercial and claim to extract tables, table rows, and even table cells from documents and images, like PDFTables, DocParser, and PDFTron. Similarly, there are also open-source Java and Python libraries for table and metadata extraction from images and documents. The libraries that extract tables from images are Camelot, Excalibur, and PDFPlumber, whereas the libraries that extract tables from non-image-based documents are Tabula, PyPDF2, PDF Table Extractor, PDFPlumber, and PDFMiner. Among these five, PDF Table Extractor is browser-based and PDFMiner works with structured tables and digs out the semantic relations. For working with unstructured tables in PDF documents and developing a table extractor component for an integrated environment, Tabula, PDFPlumber, and PyPDF2 might be better choices. The research and solutions mentioned above regarding table detection and understanding are carried out to make them meaningful to machines and humans who have no visual impairment or dyslexia. Therefore, the future documentation may consider the inclusion of translations and lay summaries (concise descriptions in simple words) of objects or elements within the document, as essential components, to make them accessible to blind and visually impaired individuals as well.66 In this regard, the World Wide Web Consortium (W3C) developed the Web Accessibility Guidelines for developing web documents to make the nontext elements accessible. These guidelines include elements such as captions for tables and figures, description of figures, and summaries of the tables.67 Similarly, HTML has tags to include summaries of a table, including , ,

. Microsoft Word has an option “text alternative” to add a description of a table or figure for visually impaired people, who will use screen readers for reading the document. Adobe Acrobat Reader also has an accessibility pane to tag tables and add alternative text and descriptions of tables, which is used by the NVDA screen reader to read aloud. Moreover, CommonLook Office, whose motto is “build accessibility into documents early,” has add-ins for Microsoft Word or PowerPoint to add enough accessibility content to the documents to INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 9 Table 1. Solutions and libraries for table extraction and processing. S no. Tools Open source Image based Comments 1 Tabula Y N Extracts data tables from PDF and saves as CSV or Excel spreadsheet. It works on native PDF files and cannot extract scanned tables. It supports multiple platforms but does not support batch processing. 2 PDFTables N N Extracts page, table, table row, and even table cell. It is a fully automated API. It supports multiple platforms and multiple programming languages. 3 DocParser N Y Extracts information from images and forms. It is a cloud-based application and supports batch processing. It parses the documents and offers more features but needs human intervention. It shows poor accuracy in handwritten application forms. 4 PDFTron N N Supports multiple platforms and multiple programming languages. 5 Camelot Y Y A Python library that extracts table from images. It has built-in OCR. 6 Excalibur Y Y A web-based solution which is powered by Camelot. 7 PyPDF2 Y N A Python library that can do batch processing with multiple files. 8 PDFPlumber Y Y A Python library built on PDFMiner. 9 PDF Table Extractor Y N A web-based tool built on Tabula. It supports scraping of multiple page tables and comparison of cell values. 10 PDFMiner Y N A Python library that extracts information like location, fonts, and lines of the text. It focuses on analyzing text. It has a PDF parser. It figures out the semantic relationships among structured tables. make the resulting PDF accessible. However, already-developed unstructured documents, without any accessibility features, still need some measures to make the documents understandable to visually impaired or blind users. Keeping in mind the statistics of visually impaired people and the unstructured data of the future—the global data sphere will grow from 33ZB to 175ZB and 80% of this worldwide data will be unstructured—visually impaired individuals cannot be ignored for their access to knowledge.68 Therefore, we would need mechanisms for making these unstructured documents understandable to as many people as possible by incorporating accessibility measures in the document readers. The following section highlights some of the key issues in this domain. ISSUES AND CHALLENGES IN THE EXISTING SYSTEMS Tables can be utilized in multiple scenarios including information extraction, table search, ontology engineering, conversion to DBMS, and document engineering. 69 The situation becomes difficult when a blind or visually impaired person needs to understand the tables. The issues and challenges in dealing with PDF tables are categorized in the following sections. https://tabula.technology/ https://pdftables.com/ https://docparser.com/ https://www.pdftron.com/ https://resourcegovernance.org/analysis-tools/tools/pdf-table-extractor https://resourcegovernance.org/analysis-tools/tools/pdf-table-extractor INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 10 Table Structure Tables in PDF documents need more focus on table structure detection because they do not follow a defined formal structure.70 Several knowledge gaps are identified in literature regarding table structure, such as the identification of functional areas of tables, for which Silva argued the use of multiple heuristics and machine learning algorithms in parallel or in sequence.71 The variety of structural layouts creates problems in their identification, which can be handled by defining more rules at the lexical and syntactic layer of table processing. This could also be fruitful for better semantic annotations.72 In addition, the variety of cell content or inconsistent cell content, along with implicit header cells, creates problems in understanding the tables, especially by machines.73 The vector representation of web tables may be applied to PDF tables for semantic annotations and identification of column types.74 Along with that approach, graphical representation and a graphical neural network (GNN) can also be used for better structure identification in multiple domains.75 New data sets need to be introduced for structure recognition in various domains, including business and finance, as they use a huge amount of tables in their documents.76 From the discussion above, the table structure inconsistencies, cell content inconsistencies, functional and logical processing of tables needs more research effort to eliminate the stated problems. Along with that, the inclusion of more data sets will also help in handling the diversity in the field. Table Formats The existing format of tables in PDF lacks the metadata needed for further processing; therefore, the conversion of PDF tables to other formats, especially open formats, will open new endeavors. Some researchers have worked on converting tables to CSV format, which retains the basic structure but lacks some cell formatting. Researchers worked on the transformation of web tables to relational tables for easy manipulation.77 In contrast, XML can handle complex data and is more easily read by humans. Therefore, a methodology is presented to work on tables in XML format, but it considers tables having text and numerical data only.78 JSON, another format, can also be used as an alternative to XML; it is smaller in size than the XML and can handle complex and hierarchical data. The JSON format has less support than XML but is preferred for web application due to its interoperability and lightweight features. Table Interpretation The variable representation patterns of table values, dense content and natural languag e processing create problems in the correct interpretation of tables.79 Anaphoric resolution techniques and documenting level discourse parsers are suggested to handle complex references among multiple domains.80 Moreover, handling the locality features of a table and the annotation of its property feature can lead to better interpretation of tables.81 The use of a knowledge base is suggested for understanding and annotating the relationships among tables and text to get more information about the extracted entities from tables and text.82 Similarly, the extraction of data and its precision in medical and financial tables is an issue that needs the attention of researchers, as both fields have crucial and important data in its tables. 83 For easy interpretation of tables, machine learning classifiers, based on table headings and captions, can be used to classify them into their respective domain.84 The relationship of tables in a specific domain and or among multiple domains can be achieved by developing ontologies.85 This will enable the tables to be published on an LOD cloud that will establish more relationships and infer insights from multiple domains. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 11 Table Evaluation Most of the researchers working on PDF tables have tried to evaluate their work with popular data sets such as ICDAR 2013, ICDAR 2015, ICDAR 2017 POD, PubMed, UNLV, and Mormont. As we have PDF documents in multiple domains, therefore, new data sets should be introduced for structure recognition, especially in business and finance, as these domains use a large number of tables in their documents.86 An evaluation methodology was proposed for table detection, structure recognition, and its functional and semantic analysis.87 Unfortunately, there are no standard metrics, parameters, and formal methodology for table processing evaluation.88 Therefore, standard evaluation metrics should be defined for PDF tables, in order to standardize the evaluation of algorithms and frameworks. Table Presentation to Blind and Visually Impaired Users The available tools and techniques for reading aloud documents to blind and visually impaired people either read the table caption only and ignore the content or treat the tables as text and read the rows line by line. This does not help these users to understand the semantics of the table and its content. Besides the content of the table, its layout shows grouping and connections among the content which is not presented to blind and visually impaired people by current solutions.89 Therefore, tools and screen readers need to present tables in nonvisual format or give a summarized view of tables by following the guidelines of W3C, instead of reading the table like text.90 The summarized view of tables can become part of bibliographic metadata and can contribute in cataloging in the perspective of linked and open data. 91 A study highlighted the accessibility of published PDF articles by four journal publishers and presented the findings in graphs to show the trend from 2009 to 2013, by taking parameters including meaningful title, alternate text for images, and logical reading order.92 The author further applied the same methodology to analyze the articles published in next four years (2014 to 2018) and came to the conclusion that accessibility of PDF documents had improved. However, the journal publishers , who should be more aware of disability and accessibility, did not consistently follow the PDF/UA accessibility requirements and WCAG 2.0 when producing PDF versions of their articles.93 Therefore, visually impaired individuals should be provided with a mechanism for understanding the digital content and underlying semantics at multiple levels of abstractions, like the general information about the document and its elements—including tables—its structure and content, navigation in the table, and querying the table to get more details and lessen cognitive overload. Accessibility of Digital Library Collection The accessibility of large-scale digital library collections can enhance content for sighted as well as visually impaired users. The traditional utilization of digital library collections needs to be broadened by making computation-ready collections meant to be used and consumed in multiple domains.94 An effort was made by researchers to digitize and archive a digital repository of images and convert them to PDF/A documents but, unfortunately, the researchers came up with limited semantics as they did not consider the elements within the documents themselves.95 The accessibility of these converted documents may be compromised with these limited semantics. The rich semantics of tables can be used in the bibliographic classification of a digital library’s collection to increase the search width of the digital library.96 Blind and visually impaired users can be assisted in using digital libraries, as they may need help at physical and cognitive levels. At the physical level, the blind may face difficulty in accessing information, identifying path and status, and efficiently evaluating information. At the cognitive level, they may face problems in understanding multiple structures, programs, information, features of the digital library, and the need to stick to some specific formats. Therefore, the inclusion of help features will make the INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 12 digital library friendly to blind and visually impaired people by incorporating meaningful descriptions for nontextual elements.97 The sight-centered nature of the digital library creates problems for blind and visually impaired users due to missing textual or verbal instructions. Some researchers identified the inclusion of labels and meaningful descriptions for hyperlinks, instructions, structure, multimedia content and nontext content to make digital libraries friendly to blind and visually impaired people.98 At the same time, others argue for improvement in usability by introducing help features in terms of usefulness, ease of use, and user satisfaction.99 The accessibility of digital libraries in general and its content in specific may be improved by accommodating help features in the interface and meaningful descriptions for the contents’ nontext elements including tables. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS This study discusses the accessibility of tables included in PDF documents in general as well as in the specific environment of digital libraries. Existing frameworks, algorithms, and solutions for the processing and interpretation of PDF tables, specifically their presentation to blind and visually impaired people, are thoroughly discussed. A general workflow of table processing is also presented in figure 1. The available solutions for reading out PDF documents to blind and visually impaired people are analyzed for their output, specifically for their attitude towards handling tables. Furthermore, a list of resources for table interpretation and presentation are discussed along with their different features. The issues and challenges in table structure, format, interpretation, evaluation, its presentation to blind, and accessibility of digital library collection are discussed. The researchers working in the domain of accessibility, digital library, and PDF tables can extend and modify the current solutions and algorithms by following the future research directions given below. • The structure of a table has implicit semantic information which a sighted reader can infer but a blind reader needs assistance to understand. The structure of a PDF table is extracted using multiple approaches like heuristics, ontologies, machine learning and segmentation, whereas vectors are used for a web table.100 Therefore, the combinations of multiple approaches and use of vectors for PDF tables may produce better results. • The content of a table is usually numeric or very short text and needs proper interpretation. Therefore, a knowledge base can be used to get more information about the extracted entities from tables and text in order to understand and annotate the relationships among tables and text.101 These knowledge bases can be predetermined or may be selected automatically according to the table content or domain. • Table interpretation can become easy if tables are classified according to their domains by using machine learning classifiers. The classification can be based on table headings and captions, as well as the title and author of the document.102 • Ontologies are used to relate the tables in a specific domain and or among multiple domains, and publishing them on an LOD cloud will establish new relationships.103 This will help in inferring new insights from complex, long, and numerical tables. • Unstructured data and content can be made available for multiple usage and interpretations if it is converted to open formats like CSV, JSON and XML.104 Among these, CSV comes with repeated content, XML needs special parsers, whereas JSON is lightweigh t and easy to write and read.105 It has support from NoSQL databases like MongoDB and Apache CouchDB, and web Application APIs like Twitter, You Tube, and Facebook. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 13 Therefore, JSON might be a better option for the conversion of PDF tables for its multiple interpretation and navigation within tables. • The processes used for evaluation of tables have no defined matrices.106 Therefore, the table evaluation processes should be defined with their respective matrices in order to standardize the research in this domain. • The precision of extracted content of table is very crucial especially in medical, financial, and experimental tables that have numeric data. Therefore, the preprocessing of tables or conversion to other formats would need more attention to avoid any truncation or round off of the data. • The presentation of tables to blind or visually impaired people can be in nonvisual or summarized form.107 The summaries may be presented nonvisually, including the structural layout as well as a brief introduction of the table, to minimize the cognitive overload on these individuals. • To evaluate the accessibility of digital library interfaces, 16 heuristics were proposed to make the digital libraries in reach of users, however, more heuristics are needed to make generalized interfaces for all individuals.108 • The nontext elements of digital library collections should have meaningful descriptions for better understandability of blind and visually impaired individuals. The user-generated content about these nontext elements could be used for cataloging.109 • The rich semantics of tables can be exploited for cataloging and classification that will be helpful in exploratory searching. • As the Michigan State University Libraries has taken the initiative of assessing and improving the accessibility of digital library content by adopting the WCAG guidelines, other libraries can also adopt the model for providing accessible content to their users including blind and visually impaired individuals. • The development of new data sets for tables in multiple domains can facilitate the researchers in interpreting tables and establishing relationships in cross-domains. This review paper is an attempt to highlight the knowledge gap in processing the PDF tables and its accessibility for blind and visually impaired individuals. An efficient and open-source solution for making PDF documents accessible to blind and visually impaired people needs to exploit the heuristics, ontologies, machine learning, and deep learning by using open-source libraries and tools for understanding and interpreting the tabular content in order to reduce information overload. ENDNOTES 1 Roya Rastan, “Automatic Tabular Data Ex WCAG traction and Understanding” (PhD diss., University of New South Wales, 2017). 2 Mark T. Maybury, “Communicative Acts for Explanation Generation,” International Journal of Man-Machine Studies 37, no. 2 (1992): 135–72. 3 Patricia Wright, “The Comprehension of Tabulated Information: Some Similarities between Reading Prose and Reading Tables,” NSPI Journal 19, no. 8 (1980): 25–29, https://doi.org/10.1002/pfi.4180190810. https://doi.org/10.1002/pfi.4180190810 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 14 4 Jean-Claude Guédon et al., Future of Scholarly Publishing and Scholarly Communication: Report of the Expert Group to the European Commission (Brussels: European Commission, Directorate- General for Research and Innovation, 2019), https://doi.org/10.2777/836532. 5 World Health Organization, World Report on Vision, October 8, 2019, https://www.who.int/publications-detail/world-report-on-vision/. 6 Mireia Ribera Turró, “Are PDF Documents Accessible?” Information Technology and Libraries 27, no. 3 (2008): 25–43, https://doi.org/10.6017/ital.v27i3.3246. 7 Kyunghye Yoon, Laura Hulscher, and Rachel Dols, “Accessibility and Diversity in Library and Information Science: Inclusive Information Architecture for Library Websites,” Library Quarterly 86, no. 2 (2016): 213–29, https://doi.org/10.1086/685399. 8 Iris Xie et al., “Using Digital Libraries Non-Visually: Understanding the Help-Seeking Situations of Blind Users,” Information Research 20, no. 2 (2015): 673. 9 Heidi M. Schroeder, “Implementing Accessibility Initiatives at the Michigan State University Libraries,” Reference Services Review 46, no. 3 (2018): 399–413, https://doi.org/10.1108/RSR- 04-2018-0043. 10 Joanne Oud, “Accessibility of Vendor-Created Database Tutorials for People with Disabilities,” Information Technology and Libraries 35, no.4 (2016): 7–18, https://doi.org/10.6017/ital.v35i4.9469. 11 Rakesh Babu and Iris Xie, “Haze in the Digital Library: Design Issues Hampering Accessibility for Blind Users,” Electronic Library 35, no. 5 (2017): 1052–65, https://doi.org/10.1108/EL-10- 2016-0209. 12 Rachel Wittmann et al., “From Digital Library to Open Datasets,” Information Technology and Libraries 38, no. 4 (2019): 49–61, https://doi.org/10.6017/ital.v38i4.11101. 13 Xinxin Wang, “Tabular Abstraction, Editing, and Formatting” (PhD diss., University of Waterloo, 1996). 14 Rastan, “Automatic Tabular Data Extraction,” 25. 15 Azadeh Nazemi, “Non-Visual Representation of Complex Documents for Use in Digital Talking Books” (PhD diss., Curtin University, 2015). 16 Rastan, “Automatic Tabular Data Extraction,” 14. 17 Max Göbel et al., “ICDAR 2013 Table Competition,” in 2013 12th International Conference on Document Analysis and Recognition (2013): 1449–53, https://doi.org/10.1109/ICDAR.2013.292. 18 Burcu Yildiz, Katharina Kaiser, and Silvia Miksch, “pdf2table: A Method to Extract Table Information from PDF Files,” in Proceedings of the 2nd Indian International Conference on Artificial Intelligence (IICAI, 2005): 1773–85; Tamir Hassan and Robert Baumgartner, “Table Recognition and Understanding from PDF Files,” in Ninth International Conference on https://doi.org/10.2777/836532 https://www.who.int/publications-detail/world-report-on-vision/ https://doi.org/10.6017/ital.v27i3.3246. https://doi.org/10.1086/685399 https://doi.org/10.1108/RSR-04-2018-0043 https://doi.org/10.1108/RSR-04-2018-0043 https://doi.org/10.6017/ital.v35i4.9469 https://doi.org/10.1108/EL-10-2016-0209 https://doi.org/10.1108/EL-10-2016-0209 https://doi.org/10.6017/ital.v38i4.11101 https://doi.org/10.1109/ICDAR.2013.292 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 15 Document Analysis and Recognition (ICDAR 2007) (2007): 1143–47, https://doi.org/ 10.1109/ICDAR.2007.4377094; Alexey Shigarov et al., “Tabbypdf: Web-Based System for PDF Table Extraction,” in International Conference on Information and Software Technologies (Springer International Publishing, 2018): 257–69, https://doi.org/10.1007/978-3-319- 99972-2_20. 19 Minghao Li et al., “TableBank: Table Benchmark for Image-Based Table Detection and Recognition,” preprint, arXiv:1903.01949; Sebastian Schreiber et al., “Deepdesrt: Deep Learning for Detection and Structure Recognition of Tables in Document Images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2017): 1162–67, https://doi.org/10.1109/ICDAR.2017.192. 20 Zewen Chi et al., “Complicated Table Structure Recognition,” preprint, arXiv:1908.04729. 21 Michael Cafarella et al., “Ten Years of Webtables,” in Proceedings of the VLDB Endowment 11, no. 12 (August 2018): 2140–49, https://doi.org/10.14778/3229863.3240492. 22 Shah Khusro, Asima Latif, and Irfan Ullah. “On Methods and Tools of Table Detection, Extraction and Annotation in PDF Documents,” Journal of Information Science 41, no. 1 (2015): 41–57, https://doi.org/10.1177/0165551514551903. 23 Hassan, “Table Recognition and Understanding”; Richard Zanibbi, Dorothea Blostein, and James R Cordy, “A Survey of Table Recognition,” Document Analysis and Recognition 7, no. 1 (2004): 1–16, https://doi.org/10.1007/s10032-004-0120-9; Andreiwid Sheffer Corrêa and Pär-Ola Zander, “Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools,” in Proceedings of the 18th Annual International Conference on Digital Government Research (June 2017): 54–63, https://doi.org/10.1145/3085228.3085278; Christopher Clark and Santosh Divvala, “Looking beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers” (paper, AAAI Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, January 25–26, 2015)., 24 Ermelinda Oro and Massimo Ruffolo, “PDF–Trex: An Approach for Recognizing and Extracting Tables from PDF Documents,” in 2009 10th International Conference on Document Analysis and Recognition (ICDAR) (2009): 906–10, https://doi.org/10.1109/ICDAR.2009.12. 25 Vidhya Govindaraju, Ce Zhang, and Christopher Ré, “Understanding Tables in Context Using Standard NLP Toolkits,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Sofia, Bulgaria: Association for Computational Linguistics, August 2013): 658–64. 26 Nikola Milosevic et al., “Disentangling the Structure of Tables in Scientific Literature,” in Natural Language Processing and Information Systems, NLDB 2016, Lecture Notes in Computer Science 9612 (Springer, Cham), https://doi.org/10.1007/978-3-319-41754-7_14. 27 Rastan, “Automatic Tabular Data Extraction,” 48. https://10.0.4.85/ICDAR.2007.4377094 https://10.0.4.85/ICDAR.2007.4377094 https://doi.org/10.1007/978-3-319-99972-2_20 https://doi.org/10.1007/978-3-319-99972-2_20 https://doi.org/10.1109/ICDAR.2017.192 https://doi.org/10.14778/3229863.3240492 https://doi.org/10.1177/0165551514551903 https://doi.org/10.1007/s10032-004-0120-9 https://doi.org/10.1145/3085228.3085278 https://doi.org/10.1109/ICDAR.2009.12 https://doi.org/10.1007/978-3-319-41754-7_14 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 16 28 Alexey Shigarov, Andrey Mikhailov, and Andrey Altaev, “Configurable Table Structure Recognition in Untagged PDF Documents,” in Proceedings of the 2016 ACM Symposium on Document Engineering, (2016): 119–22, https://doi.org/10.1145/2960811.2967152. 29 Shigarov et al., “Tabbypdf,” 262, 263, 265. 30 Dae Hyun Kim et al., “Facilitating Document Reading by Linking Text and Tables,” in Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (October 2018): 423–34, https://doi.org/10.1145/3242587.3242617. 31 Hassan, “Table Recognition and Understanding,” 1145. 32 Jing Fang et al., “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures,” in 2011 International Conference on Document Analysis and Recognition (2011): 779–83, https://doi.org/10.1109/ICDAR.2011.304. 33 Bahadar Ali and Shah Khusro, “A Divide-and-Merge Approach for Deep Segmentation of Document Tables,” in Proceedings of the 10th International Conference on Informatics and Systems (May 2016): 43–49, https://doi.org/10.1145/2908446.2908473. 34 Wenyuan Xue et al., “Table Analysis and Information Extraction for Medical Laboratory Reports,” in 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech) (2018): 193–99, https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00043. 35 Roya Rastan, Hye-Young Paik, and John Shepherd, “TEXUS: A Unified Framework for Extracting and Understanding Tables in PDF Documents,” Information Processing & Management 56, no. 3 (2019): 895–918, https://doi.org/10.1016/j.ipm.2019.01.008. 36 Dafang He et al., “Multi-scale Multi-task FCM for Semantic Page Segmentation and Table Detection,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2017): 254–61, https://doi.org/10.1109/ICDAR.2017.50. 37 Jing Fang et al., “Table Header Detection and Classification,” in Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (July 2012): 599–605. 38 He et al., “Multi-scale Multi-task,” 255. 39 Martha O. Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota, “TAO: System for Table Detection and Extraction from PDF Documents,” Florida Artificial Intelligence Research Society Conference, North America (2016). 40 Saman Arif and Faisal Shafait, “Table Detection in Document Images using Foreground and Background Features,” in 2018 Digital Image Computing: Techniques and Applications (DICTA), (2018): 1–8, https://doi.org/10.1109/DICTA.2018.8615795. 41 Schreiber et al., “Deepdesrt,” 1163, 1164. https://doi.org/10.1145/2960811.2967152 https://doi.org/10.1145/3242587.3242617 https://doi.org/10.1109/ICDAR.2011.304 https://doi.org/10.1145/2908446.2908473 https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00043 https://doi.org/10.1016/j.ipm.2019.01.008 https://doi.org/10.1109/ICDAR.2017.50 https://doi.org/10.1109/DICTA.2018.8615795 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 17 42 Shoaib Ahmed Siddiqui et al., “Decnt: Deep Deformable CNN for Table Detection,” IEEE Access 6 (2018): 74151–61, https://doi.org/10.1109/ACCESS.2018.2880211. 43 Chi et al., “Complicated Table Structure Recognition.” 44 Rahul Anand, Hye-Young Paik, and Cheng Wang, “Integrating and Querying Similar Tables from PDF Documents Using Deep Learning,” 2019, preprint, arXiv:1901.04672. 45 Jiaoyan Chen et al., “Colnet: Embedding the Semantics of Web Tables for Column Type Prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence 33, no. 1: 29–36, https://doi.org/10.1609/aaai.v33i01.330129. 46 Ziqi Zhang, “Towards Efficient and Effective Semantic Table Interpretation,” in International Semantic Web Conference (2014): 487–502, https://doi.org/10.1007/978-3-319-11964-9_31. 47 Ivan Ermilov, Sören Auer, and Claus Stadler, “User-Driven Semantic Mapping of Tabular Data,” in Proceedings of the 9th International Conference on Semantic Systems (September 2013): 105–12, https://doi.org/10.1145/2506182.2506196. 48 Martha O Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota, “Table Interpretation and Extraction of Semantic Relationships to Synthesize Digital Documents,” in Proceedings of the 6th International Conference on Data Science, Technology and Application—DATA (2017): 223– 32, https://doi.org/10.5220/0006436902230232. 49 Varish Mulwad, “TABEL—A Domain-Independent and Extensible Framework for Inferring the Semantics of Tables,” (PhD diss., University of Maryland, 2015). 50 Syed Tahseen Raza Rizvi et al., “Ontology-based Information Extraction from Technical Documents,” in Proceedings of the 10th International Conference on Agents and Artificial Intelligence (ICAART) (2018): 493–500, https://doi.org/10.5220/0006596604930500. 51 Corrêa and Zander, “Unleashing Tabular Content to Open Data,” 55. 52 Irfan Ullah et al., “An Overview of the Current State of Linked and Open Data in Cataloging,” Information Technology and Libraries 37, no. 4 (2018): 47–80, https://doi.org/10.6017/ital.v37i4.10432. 53 Nosheen Fayyaz, Irfan Ullah, and Shah Khusro, “On the Current State of Linked Open Data: Issues, Challenges, and Future Directions,” International Journal on Semantic Web and Information Systems (IJSWIS) 14, no. 4 (2018): 110–28, https://doi.org/10.4018/IJSWIS.2018100106. 54 Govindaraju, Zhang, and Ré , “Understanding Tables in Context Using Standard NLP Toolkits,” 660, 661. 55 Perez-Arriaga, Estrada, and Abad-Mota, “Table Interpretation and Extraction,” 227. 56 Kim et al., “Facilitating Document Reading,” 425, 426. https://doi.org/10.1109/ACCESS.2018.2880211 https://doi.org/10.1609/aaai.v33i01.330129 https://doi.org/10.1007/978-3-319-11964-9_31 https://doi.org/10.1145/2506182.2506196 https://doi.org/10.5220/0006436902230232 https://doi.org/10.5220/0006596604930500 https://doi.org/10.6017/ital.v37i4.10432 https://doi.org/10.4018/IJSWIS.2018100106 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 18 57 Rastan, Pail, and Shepherd, “TEXUS,” 906. 58 Nikola Milosevic et al., “A Framework for Information Extraction from Tables in Biomedical Literature,” International Journal on Document Analysis and Recognition (IJDAR) 22, no. 1 (2019): 55–78, https://doi.org/10.1007/s10032-019-00317-0. 59 Chi et al., “Complicated Table Structure Recognition.” 60 Wenhao Yu et al., “Tablepedia: Automating PDF Table Reading in an Experimental Evidence Exploration and Analytic System,” in The World Wide Web Conference (May 2019): 3615–19, https://doi.org/10.1145/3308558.3314118. 61 Anand, Paik, and Wang, “Integrating and Querying Similar Tables.” 62 Turró, “Are PDF Documents Accessible?” 2, 4. 63 Nazemi, “Non-Visual Representation of Complex Documents,” 110, 111, 112, 118. 64 Juan Cao, “Generating Natural Language Descriptions from Tables,” IEEE Access 8 (2020): 46206–16, https://doi.org/10.1109/ACCESS.2020.2979115. 65 Maartje ter Hoeve et al., “Conversations with Documents: An Exploration of Document-Centered Assistance,” in Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (March 2020): 43–52, https://doi.org/10.1145/3343413.3377971. 66 Guédon et al., “Future of Scholarly Publishing,” 42. 67 W3C, “WCAG 2.0.” 68 World Health Organization, “World Report on Vision”; David Reinsel, John Gantz, and John Rydning, “Data Age 2025: The Digitization of the World, From Edge to Core,” IDC white paper, #US44413318 (Framingham, MA: IDC, November 2018), https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage- whitepaper.pdf/. 69 Rastan, “Automatic Tabular Data Extraction,” 18, 19. 70 Arif and Shafait, “Table Detection in Document Images,” 1. 71 Ana Costa e Silva, “Parts that Add up to a Whole: A Framework for the Analysis of Tables,” (PhD diss., Edinburgh University, UK, 2010). 72 Milosevic et al., “A Framework for Information Extraction from Tables,” 60. 73 Rastan, “Automatic Tabular Data Extraction,” 14. 74 Chen et al., “Colnet,” 31. 75 Mulwad, “TABEL,” 23; Zewen, “Complicated Table Structure Recognition.” 76 Siddiqui et al., “Decnt,” 74160. https://doi.org/10.1007/s10032-019-00317-0 https://doi.org/10.1145/3308558.3314118 https://doi.org/10.1109/ACCESS.2020.2979115 https://doi.org/10.1145/3343413.3377971 https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf/ https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf/ INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 19 77 David W Embley, Sharad Seth, and George Nagy, “Transforming Web Tables to a Relational Database,” 2014 22nd International Conference on Pattern Recognition (2014) 2781–86, https://doi.org/10.1109/ICPR.2014.479. 78 Milosevic et al., “A Framework for Information Extraction from Tables,” 56. 79 Milosevic et al., “A Framework for Information Extraction from Tables,” 55, 56. 80 Kim et al., “Facilitating Document Reading,” 432. 81 Chen et al., “Colnet,” 36. 82 Asima Latif et al., “A Hybrid Technique for Annotating Book Tables,” Int. Arab J. Inf. Technol 15, no. 4 (2018): 777–83. 83 Rastan, Paik, and Shepherd, “TEXUS,” 909. 84 Milosevic et al., “A Framework for Information Extraction from Tables,” 61, 62, 65, 66. 85 Rizvi et al., “Ontology-based Information Extraction,” 496. 86 Siddiqui et al., “Decnt,” 74160. 87 Max Göbel et al., “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents,” in Proceedings of the 2012 ACM Symposium on Document Engineering (September 2012): 45–48, https://doi.org/10.1145/2361354.2361365. 88 Rastan, Paik, and Shepherd, “TEXUS,” 917. 89 David Pinto et al., “Table Extraction Using Conditional Random Fields,” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (July 2003): 235–42, https://doi.org/10.1145/860435.860479. 90 Nazemi, “Non-Visual Representation of Complex Documents,” 118–44; W3C, “WCAG 2.0.” 91 Ullah et al., “Current State of Linked and Open Data in Cataloging,” 47, 48. 92 Julius T. Nganji, “The Portable Document Format (PDF) Accessibility Practice of Four Journal Publishers,” Library and Information Science Research 37, no.3 (2015): 254–62, https://doi.org/10.1016/j.lisr.2015.02.002. 93 Julius T. Nganji, “An Assessment of the Accessibility of PDF Versions of Selected Journal Articles Published in a WCAG 2.0 Era (2014–2018),” Learned Publishing 31, no. 4 (2018): 391–401, https://doi.org/10.1002/leap.1197. 94 Wittmann et al., “From Digital Library to Open Datasets,” 49, 50. 95 Yan Han and Xueheng Wan, “Digitization of Text Documents Using PDF/A,” Information Technology and Libraries 37, no. 1 (2018): 52–64, https://doi.org/10.6017/ital.v37i1.9878. https://doi.org/10.1109/ICPR.2014.479 https://doi.org/10.1145/2361354.2361365 https://doi.org/10.1145/860435.860479 https://doi.org/10.1016/j.lisr.2015.02.002 https://doi.org/10.1002/leap.1197 https://doi.org/10.6017/ital.v37i1.9878 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 ACCESSIBILITY OF TABLES IN PDF DOCUMENTS | FAYYAZ, KHUSRO, AND ULLAH 20 96 Asim Ullah, Shah Khusro, and Irfan Ullah, “Bibliographic Classification in the Digital Age: Current Trends & Future Directions,” Information Technology and Libraries 36, no. 3 (2017): 48–77, https://doi.org/10.6017/ital.v36i3.8930. 97 Xie et al., “Using Digital Libraries Non-Visually,” paper 673. 98 Babu and Xie, “Haze in the Digital Library,” 1057–59. 99 Iris Xie et al., “Enhancing Usability of Digital Libraries: Designing Help Features to Support Blind and Visually Impaired Users,” Information Processing and Management 57, no. 3 (2020): 102110, https://doi.org/10.1016/j.ipm.2019.102110. 100 Chen et al., “Colnet,” 31, 32. 101 Kim et al., “Facilitating Document Reading,” 432. 102 Milosevic et al., “A Framework for Information Extraction from Tables,” 61. 103 Rizvi et al., “Ontology-based Information Extraction,” 496. 104 Embley, Seth, and Nagy, “Transforming Web Tables to a Relational Database,” 2783; Milosevic et al., “A Framework for Information Extraction from Tables,” 60. 105 Nicholas J Tierney and Karthik Ram, “A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility,” preprint, arXiv:2002.11626. 106 Rastan, Paik, and Shepherd, “TEXUS,” 917. 107 Nazemi, “Non-Visual Representation of Complex Documents,” 118–44; W3C, “WCAG 2.0.” 108 Mexhid Ferati and Wondwossen M. Beyene, “Developing Heuristics for Evaluating the Accessibility of Digital Library Interfaces,” in Universal Access in Human–Computer Interaction, Design and Development Approaches and Methods, UAHCI 2017, Lecture Notes in Computer Science 10277 (Springer, Cham), https://doi.org/10.1007/978-3-319-58706-6_14. 109 Ullah et al., “Current State of Linked and Open Data in Cataloging,” 64. https://doi.org/10.6017/ital.v36i3.8930 https://doi.org/10.1016/j.ipm.2019.102110 https://doi.org/10.1007/978-3-319-58706-6_14 ABSTRACT INTRODUCTION THE CURRENT STATE OF TABLE PROCESSING Table Extraction and Processing Using Heuristics Using Segmentation Using Machine Learning and Deep Learning Approaches Using Ontologies Relationship of Tables with Content and Context Existing Accessibility-Driven Solutions for PDF Documents ISSUES AND CHALLENGES IN THE EXISTING SYSTEMS Table Structure Table Formats Table Interpretation Table Evaluation Table Presentation to Blind and Visually Impaired Users Accessibility of Digital Library Collection CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS ENDNOTES