Engineering

Robust L2 Support for Lending Integrations zero SLA breaches across 12 third-party connectors

May 22, 2024

12
Third-party Connectors
100K
Issues / Month
100×
Business Growth
0
SLA Breaches

The Challenge

A fast-growing lending platform integrated with 12 third-party financial connectors credit bureaus, bank account verifiers, KYC providers, and disbursement rails. As the business scaled 100× in 18 months, so did the volume and variety of integration failures. The existing L2 support team of 10 was overwhelmed, SLA breaches were rising, and the on-call engineers were fielding the same classes of failure repeatedly with no systemic fix.

What We Built

  • Scaled the dedicated L2 ops team from 10 to 30 engineers with clear ownership per connector domain.
  • Authored comprehensive runbooks for every known failure class across all 12 connectors timeout escalation paths, retry windows, manual override procedures.
  • Built automation scripts to handle the top 40% of recurring issue types without human intervention auto-retry on transient failures, alert deduplication, status-page sync.
  • Implemented a ticket triage pipeline that classified incoming issues by connector, severity, and recurrence pattern before any engineer touched them.
  • Introduced weekly failure-mode reviews so runbooks evolved with the connectors rather than going stale.

The Outcome

The team now handles over 100,000 support issues per month with zero SLA breaches. Automation absorbs the long tail of repetitive failures, freeing engineers to focus on new connector classes and edge cases. As the business continues its 100× growth trajectory, the support infrastructure scales with it rather than against it.

Engineering

One-Stop Customer Information Gateway unified data access across 8 disconnected enterprise systems

May 22, 2024

8
Systems Unified
1
Single View of Truth
~0ms
Cache Hit Latency

The Challenge

A large enterprise had customer data siloed across 8 separate systems CRM, ERP, billing, identity, support, product usage, contract management, and communications. Every internal team queried each system independently. The result was inconsistent customer records, duplicated API calls hammering upstream systems, and customer-facing reps unable to see a complete picture of any account in under 10 minutes.

What We Built

  • A single gateway service that acts as the authoritative source of customer truth aggregating identity, account, billing, support, and usage data into one unified customer record.
  • Connector adapters for all 8 upstream systems with normalised schemas so consumers never need to know which source system holds which field.
  • Layered caching: in-memory L1 for hot records, distributed L2 for warm data, with TTL policies tuned per data domain (identity vs. billing vs. usage).
  • Role-based access controls at the field level support agents see contact and ticket history, finance teams see billing and contract data, engineers see usage metrics.
  • A change-event pipeline so the gateway invalidates cached records the moment any upstream system writes a change, keeping the single view current within seconds.

The Outcome

Customer-facing teams now have a complete account view in under 3 seconds. Upstream API call volume dropped by 70% due to caching. Data inconsistencies between systems surfaced and were remediated as a side-effect of building the normalisation layer. The gateway became the foundation for every new internal tool the organisation built thereafter.

Client Stories

Creating customer proxies to enable easy access to all banking services

Aug 16, 2023

1000s
Concurrent Sessions
2FA
CAPTCHA Handled
IP
Rotation Cluster

The Challenge

A fintech aggregator needed to access customer banking data from dozens of banks that did not offer open APIs. The only path was to act on behalf of the customer through the bank's own web and mobile interfaces reliably, at scale, across banks with wildly different session management, CAPTCHA implementations, and 2FA flows.

What We Built

  • Headless browser proxies that mimic real customer behaviour navigating login flows, handling OTPs, clicking through authenticated sessions exactly as a human would.
  • A CAPTCHA resolution layer integrated with third-party solvers, with fallback queuing so sessions never drop mid-flow.
  • A 2FA orchestration module that intercepts OTP delivery, routes the code to the proxy session in real time, and completes authentication within the bank's timeout window.
  • An IP rotation cluster ensuring each proxy session originates from a geographically plausible, clean IP reducing bank-side bot detection triggers to near zero.
  • Session pooling to run thousands of concurrent customer proxy sessions with stable performance and automatic recovery on session failure.

The Outcome

The client successfully accessed account data across all target banks without any open API dependency. Thousands of concurrent sessions run reliably with automatic failover. The system handles the full spectrum of authentication complexity from simple password login to OTP + CAPTCHA + security questions without manual intervention.

Engineering

Building a secure credit card store that could easily integrate and scale PCI-DSS compliant from day one

Aug 16, 2023

50ms
p99 Card Retrieval
100ms
Card Addition
100 RPS
2 Cores / 2GB RAM
PCI-DSS
Compliant

The Challenge

A payments company needed a dedicated, auditable store for customer card data one that could be integrated by any internal service without those services ever touching raw card numbers, while meeting PCI-DSS requirements without requiring every consuming service to enter scope.

What We Built

  • A standalone card vault service built on GRPC and Protobuf, exposing a clean API for card storage, retrieval by token, and deletion.
  • AWS KMS-backed encryption: card data encrypted at write time with envelope encryption. No service outside the vault ever sees a raw PAN.
  • Single-use tokenisation: consuming services receive a time-bound token, not the card number. Tokens expire after one use or a configurable TTL, whichever comes first.
  • PCI-DSS scope isolation: the vault is the only component in scope. All other services interact via tokens, keeping the audit surface minimal.
  • Performance designed for efficiency p99 card retrieval under 50ms, card addition under 100ms, sustaining 100 RPS on just 2 cores and 2GB RAM.

The Outcome

The vault passed PCI-DSS audit on first submission. Consuming services integrated in days using the GRPC client libraries. Card data exposure risk was eliminated across the platform no service outside the vault ever processes a raw card number. The system runs cost-efficiently in production with headroom for significant traffic growth.

Client Stories

Building a parser that processed millions of credit card statements quickly

Aug 16, 2023

100%
Parse Accuracy
<200ms
Per Statement
M+/hr
Throughput

The Challenge

A credit analytics company received bank statements as PDFs from customers each bank with its own layout, font, column arrangement, and formatting quirks. They needed to extract structured transaction data from millions of these PDFs reliably, fast enough to support real-time credit decisions, and accurately enough that errors couldn't slip through to underwriting.

What We Built

  • A PDF-to-HTML spatial parsing engine written in Java. PDFs are first converted to an intermediate HTML representation that preserves the spatial position of every text element on the page.
  • A bank-specific rule layer: each bank template is described by a configurable extraction ruleset column positions, date formats, transaction delimiters rather than hard-coded regex.
  • A spatial grouping algorithm that reconstructs tabular rows from text fragments even when the PDF renderer has scattered them across the coordinate space.
  • An accuracy validation step that cross-checks extracted totals against statement summary fields, flagging any statement where the numbers don't reconcile for human review.
  • A horizontally scalable processing cluster capable of handling millions of statements per hour.

The Outcome

100% parse accuracy across all supported bank templates. A typical 4–5 page statement is parsed in under 200ms. The system processes millions of statements per hour in production, enabling the client to deliver real-time credit decisions based on fully structured, verified transaction history.

Client Stories

Consolidating bank offers for customers in one place

Aug 16, 2023

50+
Banks Crawled
Static +
Dynamic Sites
Low-code
Config Ruleset

The Challenge

A consumer fintech wanted to aggregate credit card and banking offers from across 50+ banks into a single discovery platform. Each bank published offers differently some through static web pages, some through JavaScript-rendered portals, some behind login walls. Manual curation was too slow and error-prone for daily freshness.

What We Built

  • A low-code crawler platform where each bank's offer pages are described by a configurable ruleset rather than bespoke scraping code selectors, pagination patterns, authentication flows, and data field mappings.
  • A cluster of virtual browsers (headless Chromium) capable of rendering JavaScript-heavy pages and executing the same interaction sequences a human would scrolling, clicking "load more", dismissing cookie banners.
  • Handling for both static HTML sites (fast, lightweight parsing) and dynamic SPA sites (full browser rendering pipeline) within the same orchestration layer.
  • A change-detection layer that compares today's crawl against yesterday's and surfaces only new or modified offers for review reducing noise for content editors.
  • A normalisation schema that maps each bank's raw offer data into a consistent structure: bank, card name, offer type, value, eligibility, expiry.

The Outcome

Offers from 50+ banks are crawled daily and made available to customers on a single platform. Adding a new bank requires writing a configuration file, not writing new code. The platform supports both static and dynamic sites without architectural changes. Customers now discover offers they would have missed by checking bank sites individually.

Engineering

Episilia: the power of logs full-stack observability at billion-event scale

Aug 16, 2023

1%
Index Size vs Log
10 MBps
Per Core Indexing
TB→100GB
LZ4 Compression

The Challenge

Log management at scale breaks most commercial and open-source tools. Lucene-based systems (Elasticsearch, OpenSearch) build indexes that consume 10–50% of the original log volume in storage at terabyte scale, that becomes hundreds of gigabytes of index overhead. Query latency degrades as index size grows. The cost of running these systems at billion-event-per-day volumes is prohibitive for most organisations.

What We Built

  • Episilia: a purpose-built log management system written in C++, designed from first principles for high-volume ingestion and query at minimal cost.
  • S3 as the primary storage layer log data is written directly to object storage, eliminating the need for expensive attached volumes at scale.
  • Kafka as the ingestion pipeline decoupling producers from the indexing layer and providing durable buffering for traffic spikes.
  • LZ4 compression applied to raw log streams before storage, reducing terabytes of log data to approximately 100GB a 10× reduction vs uncompressed storage.
  • A custom index structure that maintains index size at just 1% of the original log volume, compared to 10–50% for Lucene-based alternatives.
  • Indexing throughput of 10 MBps per CPU core making the system horizontally scalable without the per-node overhead of JVM-based alternatives.

The Outcome

Episilia delivers full observability at a fraction of the infrastructure cost of Elasticsearch or Splunk. A terabyte of logs becomes 100GB stored, with a 10GB index. Query performance remains consistent as data volume grows because the index never balloons. Organisations running billions of log events per day can do so on commodity hardware rather than enterprise storage clusters.

Engineering

Building a high throughput Snowflake data warehouse that scales without query timeouts

Aug 16, 2023

Cost Reduction
8–10×
Throughput Gain
$1K/mo
Down from $4K

The Challenge

A data-heavy SaaS company had built its analytics stack on Snowflake but was hitting query timeouts on their largest tables and paying $4,000/month in compute credits. The ingestion pipeline was batching data inefficiently, table clustering was absent, and analytical queries were doing full micro-partition scans on every run. Costs and latency were both trending upward.

What We Built

  • Redesigned the batch ingestion pipeline to align write patterns with Snowflake's micro-partition boundaries reducing partition sprawl and improving scan efficiency on subsequent reads.
  • Implemented automatic clustering keys on the highest-cardinality query dimensions (date, account_id, event_type), allowing Snowflake's automatic clustering to keep hot partitions sorted and prunable.
  • Rewrote the heaviest analytical queries to push filter predicates earlier and avoid correlated subqueries reducing the rows touched before aggregation by an average of 85%.
  • Right-sized the virtual warehouse fleet: identified which query patterns needed large warehouses vs. which could run on XS or S with proper query structure.
  • Introduced result caching discipline standardising query patterns so Snowflake's result cache could be hit reliably rather than bypassed by minor query variations.

The Outcome

Monthly Snowflake spend dropped from $4,000 to $1,000 a 4× cost reduction while query throughput improved 8–10× and timeouts were eliminated entirely. The client's analysts moved from scheduling long-running queries overnight to running them interactively during the day. The same architecture has since scaled to 3× the original data volume without regression.

Engineering

Re-engineer subledger PnL engine from slow batch processing to real-time financial reporting

Aug 16, 2023

0.5M
RPS Target
100M
Records / 10 min
Multi-tenant
Stateless Cluster

The Challenge

A large financial institution ran its subledger PnL calculations as nightly batch jobs. With growing transaction volumes, the batch window was expanding to the point where results were not available until mid-morning too late for risk desks that needed end-of-day positions at market open. The existing system was single-tenant, stateful, and could not be horizontally scaled without a fundamental redesign.

What We Built

  • Completely re-architected the PnL engine in Java from a stateful batch processor to a stateless, horizontally scalable cluster any node can process any tenant's data without local state.
  • Multi-tenant isolation at the data and computation layer: each tenant's ledger entries are partitioned and processed independently with no cross-tenant data leakage.
  • A bulk copy persistence layer replacing row-by-row database writes ledger records are accumulated in memory and flushed to the database in bulk batches, achieving 100 million records persisted in under 10 minutes.
  • A parallel computation pipeline that distributes PnL calculations across the cluster, targeting 500,000 requests per second aggregate throughput.
  • An event-driven trigger model replacing the scheduled batch PnL calculations begin as transactions are confirmed rather than waiting for a nightly window.

The Outcome

The institution moved from nightly batch PnL results delivered mid-morning to real-time PnL available within minutes of transaction confirmation. 100 million records are processed and persisted in under 10 minutes. The stateless cluster scales horizontally to meet peak demand. Risk desks now have accurate end-of-day positions available at market open rather than hours later.

91social Blog

Insights. Stories. Engineering.

Deep-dives into client engagements, engineering challenges solved, and lessons from building production software at scale.

Robust L2 Support for Lending Integrations
Engineering May 22, 2024

Robust L2 Support for Lending Integrations zero SLA breaches across 12 third-party connectors

91
91Social 1 min read
One-Stop Customer information gateway
Engineering May 22, 2024

One-Stop Customer Information Gateway unified data access across 8 disconnected enterprise systems

91
91Social 1 min read
Creating customer proxies
Client Stories Aug 16, 2023

Creating customer proxies to enable easy access to all banking services

91
91Social 1 min read
Building a secure credit card store
Engineering Aug 16, 2023

Building a secure credit card store that could easily integrate and scale PCI-DSS compliant from day one

91
91Social 1 min read
Building a parser for credit card statements
Client Stories Aug 16, 2023

Building a parser that processed millions of credit card statements quickly

91
91Social 1 min read
Consolidating bank offers
Client Stories Aug 16, 2023

Consolidating bank offers for customers in one place

91
91Social 1 min read
Episilia: the power of logs
Engineering Aug 16, 2023

Episilia: the power of logs full-stack observability at billion-event scale

91
91Social 3 min read
Building a high throughput Snowflake data warehouse
Engineering Aug 16, 2023

Building a high throughput Snowflake data warehouse that scales without query timeouts

91
91Social 2 min read
Re-engineer subledger PnL engine
Engineering Aug 16, 2023

Re-engineer subledger PnL engine from slow batch processing to real-time financial reporting

91
91Social 3 min read
No posts found in this category.