ExtractIQ Process

Throughout history, documents have always needed organization and classification. Physical libraries have document indexes which identify books by unique identifier, title, author, library location, publication date and other metadata. This helps organize the library and allows library users to search for documents using the metadata. Electronic files are no different and require similar organization.

The challenge with electronic files however is often the sheer volume of files that require manually processing. A subject matter expert will need to open the file, review its contents and then manually assign metadata to describe it. It may take a day just to process a hundred or so files and this makes the process expensive from a time and resource perspective. ExtractIQ Process performs this automatically, lowering the costs and dramatically accelerating the time required to process the files.

The digitization of paper records and electronic files opens a wealth of new insight and intelligence about the organization’s operation and history. Information that was previously locked away within a few domain experts is now disseminated to a much wider audience supporting faster and better decision making. New employee on-boarding is accelerated and enhanced as there is a ready knowledge base now available.

ExtractIQ Process unlocks this wealth of information through its configurable Recognition Engine which automatically extracts metadata from within your files.

The system can perform metadata extraction from a broad range of document types as shown in the diagram opposite.

ExtractIQ Process takes an electronic file and uses natural language processing to automatically extract useful metadata from the contents of the file. If the file is currently in paper form, it can be scanned and converted to an electronic file before being processed. The metadata for each file is stored in the ExtractIQ database and can then be used to support bulk uploading the files into a target repository using ExtractIQ Upload and / or being used to search for files using the ExtractIQ Search app.

ExtractIQ processing involves a sequence of processing steps with the output from one step being fed into the next step. The first step is to extract all the words from the source file. As many types of file are supported, several different extraction technologies are integrated into ExtractIQ Process to achieve this

As an example, paper documents that have been scanned into an electronic format are processed with Optical Character Recognition (OCR) technology. This technology automatically recognizes individual characters and words from the content of raster images. One would not would expect photographs to contain any useful metadata. Modern camera’s and phones record EXIF metadata in addition to the photographic image and this contains things like aperture, shutter speed, ISO, focal length, camera model, date the photo was taken, geographic location and much more.

Some business documents like Invoices and Purchase Orders have positional layout which allows metadata extraction to be zonal. Graphic zones or areas of the document will contain specific metadata items. The ExtractIQ recognition engine can be configured to extract metadata from within spatial zones within documents.

Following the extraction of all words from the source document, three further ExtractIQ linguistic processing steps will take place in order to extract metadata from these words.

Step 1

Named Entity Recognition

The first step, Named-Entity Recognition seeks to locate and classify named entity metadata mentioned in the words into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Step 2

Organizational Entity Recognition

The second natural language processing step is called Organizational-entity recognition. In this step, real-world taxonomies are leveraged from the organization to classify types of information that are relevant and important to the history and future operation of the organization.

The classes of information can be weighted to control their relative impact on the assessment of relevancy. As an example, the organization has a database of suppliers that is linked to ExtractIQ. When processing all the words within the file, ExtractIQ will perform entity recognition of supplier names and hence be able to cross reference a specific supplier with a file.

Step 3

Container Recognition

The third processing step allows pattern matching processing to be configured to extract metadata values from the filename and path. Frequently, useful information about the file is encoded with the filename and directory location. As an example, financial or tax records are often classified by year and grouped together in a filing cabinet drawer or file server folder representing the year the record applies too.

ExtractIQ will automatically extract this information and assign it to a metadata property such as “financial year”. This process can apply to just a subset of files or to the whole corpus.