Wipro has filed a patent for a method and system to extract information from an input document with multiple data formats. The method involves creating an HTML document from the input document, realigning it based on columns and spatial information, determining a document ID for each document, generating a hierarchy configuration file, and extracting information using data extractors. The system utilizes a pretrained machine learning model and can handle documents with different data formats. GlobalData’s report on Wipro gives a 360-degree view of the company including its patenting strategy. Buy the report here.
According to GlobalData’s company profile on Wipro, Hybrid cloud mgmt was a key innovation area identified from patents. Wipro's grant share as of September 2023 was 62%. Grant share is based on the ratio of number of grants to total number of patents.
Method and system for extracting information from multi-format documents
A recently filed patent (Publication Number: US20230315799A1) describes a method and system for extracting information from an input document that contains multi-format information. The method involves creating an HTML document corresponding to the input document, which may consist of multiple documents merged into one file, each in a different data format. The HTML document is then realigned based on the number of columns and spatial information of the words.
To determine the document identifier (ID) associated with each of the merged documents, a pretrained machine learning (ML) model is used to classify the information on each page of the HTML document. A hierarchy configuration file is generated, which includes the free-flowing text of the merged documents along with the document ID associated with each document. The lines of the merged documents are extracted and categorized as main headings, sub-headings, sub-sections, paragraphs, sub-paragraphs, or a combination thereof, based on various text features. The hierarchy configuration file can be split at the document level based on each document ID.
Information extraction from the hierarchy configuration file is performed by orchestrating one or more data extractors, which can be dependent or independent, to extract data attributes. The data extraction system includes a processor and memory to execute the necessary instructions for performing these operations.
The patent also describes additional features, such as creating the HTML document directly from text portions of a text document or converting scanned documents into images and performing data pre-processing operations on the images to validate the information. The number of columns in the HTML document can be determined by analyzing the distance between consecutive words and creating clusters of words. The realignment of the HTML document involves sorting words and determining if they are the starting words of lines within each column.
The generation of the hierarchy configuration file involves determining various text features for each line, such as coordinates, indentations, font style, boldness, line height, uppercase and lowercase letters, special characters, and numbers. The lines are then arranged based on heading categories determined by the pretrained ML model.
The information extraction process includes determining the appropriate extractors based on the document type and document ID, splitting the hierarchy configuration file, extracting text and image attributes, orchestrating between extractors, and aggregating the extracted information into an output document.
Overall, this patent presents a comprehensive method and system for extracting information from input documents with multi-format information, providing a structured and organized output.
To know more about GlobalData’s detailed insights on Wipro, buy the report here.
Data Insights
From
The gold standard of business intelligence.
Blending expert knowledge with cutting-edge technology, GlobalData’s unrivalled proprietary data will enable you to decode what’s happening in your market. You can make better informed decisions and gain a future-proof advantage over your competitors.