AutoFlow/ Services/ Data Extraction

Turn the PDF pile into
queryable rows by Friday.

Invoices, contracts, statements, forms, scanned documents. Pulled into clean structured data with confidence scores. We combine OCR, layout-aware AI, and human review for the edges so you trust every field that lands in your database.

Scope
OCR · LLM · review
Timeline
1–3 weeks
Starts at
CAD $4k
Accuracy
~99% post-review
IN

Ingest

Email · folder · API · scanner, any document, any format
formats PDF/IMG/DOC
OCR

OCR + layout

Detects tables, line items, signatures, stamps, handwriting
languages 100+
AI

LLM extraction

Schema-aware extraction with confidence per field
fields your schema
GATE

Confidence gate

Below threshold: human reviews. Above: straight to DB
auto-rate ~85%
OUT

Structured out

JSON · CSV · DB · API webhook, wired into your stack
webhook real-time
// Section 02 · What we ship

What we extract, cleaned.

6 doc patterns · ~99% post-review accuracy
D.01 AP

Invoices & receipts

Vendor, line items, totals, tax. Straight into your accounting system. Duplicate detection on the way in.

Line-item parse Tax + currency Dup detection
D.02 LEG

Contracts & agreements

Parties, dates, renewal terms, payment terms, jurisdictions. Pulled into a contract register you can actually search.

Renewal alerts Clause flags Searchable register
D.03 FIN

Bank & financial statements

Transactions, balances, fees. Categorized and reconciled against your books. Multi-currency, multi-account.

Auto-categorize Reconcile rules Multi-currency
D.04 FRM

Forms & applications

Onboarding forms, applications, intake docs. Into a structured record with the file attached, validated.

Field validation Attach original Webhook on save
D.05 TAB

Tables in PDFs

Multi-page tables, even ones split across pages, with merged cells, footnotes, and rotated headers. Into clean CSV.

Cross-page join Cell de-merge Schema enforce
D.06 SCN

Scanned & handwritten

Phone-camera photos, scanned faxes, mixed handwriting and print. With confidence scores you can trust.

De-skew + clean Handwriting OCR Confidence per field
// Section 03 · Stack we use

The extraction engines we use.

Picked per document · always with a fallback
GPT-4o
reasoning
Claude 3.5
long-context
Gemini
multimodal
Tesseract
open ocr
AWS Textract
cloud ocr
Azure Doc
forms
Unstructured
parser
LayoutLM
layout
// Section 04 · How we engage

From folder to database in 10 days.

Pilot first · production second
Day 1

Sample

You send 50 representative documents. We extract them blind and ship an accuracy report by document type.

Accuracy + cost report
Day 2–3

Schema

We write the JSON schema for the extracted data: every field, every type, every validation rule.

Schema + validation
Day 4–7

Pipeline

Build the ingest, the extraction, the confidence gate, and the human review UI for low-confidence records.

End-to-end pipeline
Day 8–9

Pilot batch

Run the next 500 documents. We tune prompts, fix edge cases, raise the confidence threshold.

Tuned · benchmarked
Day 10+

Production

Live ingest from email, drive, or API. Slack alert on anything below threshold. You get a dashboard.

Live · monitored
// Section 05 · How we work

How we keep the data clean.

Confidence-gated · human-in-loop · audit log
01 · CONFIDENCE

Every field is scored.

Each extracted field has a 0–1 confidence. Below your threshold: reviewer queue. Above: straight through.

02 · REVIEW UI

Humans only for edges.

Reviewers see the document and the extraction side-by-side, fix in place. Their corrections train the next batch.

03 · VALIDATION

Schema-enforced.

Date is a date. Amount is a number with currency. We never let a malformed record into your system.

04 · AUDIT

Original always linked.

Every extracted record links to the page and bounding box it came from. One click to verify.

// Section 06 · Common questions

What people ask before signing.

Will it work on our messy scans?
Most likely yes. We benchmark on 50 of your real documents before quoting, and tell you honestly if the pile needs pre-processing first.
How accurate is "accurate"?
After the pilot, we typically hit 95–98% field accuracy without review, and ~99% with human-in-loop on the bottom 15%. We commit to numbers in writing.
Where does our data go?
Your cloud, your storage, your choice. We support OpenAI, Anthropic, Google, AWS Bedrock, or fully on-prem with open models. Nothing trains on your data.
What if document layouts change?
Schema-aware extraction handles most layout drift. For seasonal forms (e.g. tax season), we baseline once a year and re-tune in a day.

Sitting on a pile of
PDFs you can't query?

Send us 5–10 sample documents. We'll extract them blind, share an accuracy report, and quote the pipeline. Usually 1–3 weeks to production.