Custom Data Parsing Solution
About the project
Businesses generate vast amounts of data, and analyzing the data could answer many questions that could help a business grow. For example, how are customers using a product? Which customer segment generates more revenue? However, in most cases, data is not generated in a structured format, making analyzing the data difficult. Here’s where data parsers come to the rescue – data parsing is the process of transforming and formatting the data into a structured format to make it easy for consumption or analysis.
Why did we build Optimus?
A client of 91social, a rapidly growing startup in the FinTech sector, was faced with the problem of extracting data from millions of PDFs and parsing it into a structured format. Extracting and parsing data from PDFs to text is a feature at the core of their product. They looked for a reliable tool that accurately extracts specific data and wrangles and parses it to their target format. After using a couple of existing out-of-the-box solutions to solve their need, they realized that a few factors limited the existing solutions:
- Configurability of rules to extract the data was slightly technical and complex for their requirements.
- They had to use multiple tools to come up with different templates for the extraction of data from various PDF structures.
- The licensing and implementation cost of multiple tools was high.
These limitations in the existing tools made the client want to develop a custom parsing solution that is tailor-made for their needs. 91social architected and developed a parser for them, which is tightly integrated into their app.
How did we build Optimus?
For Optimus to scan, extract, and parse the unstructured data(from PDFs) into a structured format, it makes use of three modules:
- Transformation Module: This module takes care of wrangling and transforming the extracted data to the desired target format.
- Page Resolution Module: The page resolution module gathers inputs on the details about the number of pages required to be scanned; fields from the specified pages that need to be scanned.
- Extraction Module: The extraction modules take care of the rules building part, where the regular expressions are developed according to the requirements, and the filters for the rules are specified.
Use Cases for Optimus:
Optimus is designed as a generic solution for extracting and parsing textual data from PDFs to a structured format. The solution could be put to use in multiple industries and functions. However, a tested and reliable use case within the financial services industry is to use it for KYC purposes, and for the financial services providers to retrieve particular customer information to better understand their behavior.
Benefits of Optimus:
- Optimus provides great flexibility to customize the rules for the extraction of specific data from the different types of PDFs.
- Optimus reduces the dependency on the use of regular expressions in creating the rules through the use of proprietary customizable settings, which makes it faster and easier for the ops team to scale the extraction volume.
- Optimus allows for retrieving particular values not only from one source but also from multiple sources.
Value-add to the client:
Some of the value additions of implementing the use of Optimus in their stack, in comparison to the other tools, are:
- Provides flexibility to customize the rules for parsing needs. Optimus is being used as the primary parser replacing multiple tools available in the market.
- The efficiency of the custom data parsing solution is 99.95%, making it the reliable choice for their data extraction and parsing needs.
- Time taken for the data ops team to write the rules has reduced considerably. The team size was reduced by 75% to maintain the customer service, as opposed to maintaining and writing rules for two tools.
- It is a scalable solution, that currently parses more than 45 million PDFs and serving about 10 million requests a month.