Extract clean data from tricky documents.

Extract clean markdown or structured data from PDFs, word docs, webpages, and more. Powered by vision-language models and all open source.

Get clean data, fast.

Deploy an AI-native ETL pipeline in minutes, powered by state-of-the-art vision-language models.

Scrape complex documents.: Extract markdown, data tables, figures, and equations from complex documents and webpages. Our models are trained on a diverse range of tricky layouts and data sources.
Extract structured data.: Deploy an AI document pipeline in 5 minutes. Get clean markdown and structured data using top vision-language models. Output to Markdown, JSON, CSV, or SQL.
Double-check every extraction.: Use agentic "double checking", with a human-in-the-loop (HITL) workflow for mission-critical data extraction and auditing.
Secure by design.: Easily set up your own on-prem air-gapped deployment with local AI models.

Trusted by

Integrates with your favorite tools

Developer-first API

Instantly get scraping results via API. Use our GPU-acclerated cloud or set up thepipe on your own local hardware.

Get markdown for any LLM

Use our /scrape endpoint to get clean markdown and table data from tricky documents. Use pre-built functions to convert to OpenAI format, ready for any language model such as GPT-4o or Claude-3.5-Sonnet

Chunking for vector databases

Select per-doc, per-page, or semantic chunking to integrate with vector databases such as ChromaDB or RAG frameworks such as LlamaIndex.

Get unstructured or structured data

Extract data from documents accurately with SOTA vision-language models. Get results ready in Markdown, JSON, CSV, or SQL

API Pricing

We offer a hosted API and cloud platform to scrape and extract data. One token is roughly one word.

Self-Hosted

Free

Deploy on your own infrastructure with full control and privacy.

✔️ Bring your own key

✔️ Open source

✔️ Community support

Hobby

$25/month

Instantly extract clean data from documents and webpages with our hosted platform.

✔️ 1M tokens/month

✔️ Platform access

✔️ Files up to 20 MB

✔️ Official support

Scale

$210/month

Instantly extract clean data from documents and webpages with our hosted platform.

✔️ 10M tokens/month

✔️ Platform access

✔️ Files up to 50 MB

✔️ Official support

Business

Book a chat to discuss custom projects, integrations, on-prem support, and pricing.

✔️ Custom limits

✔️ Custom features

✔️ Unlimited upload

✔️ On-premise support

Machine learning has a large environmental impact, so we contribute 2% of our revenue to CO₂ removal. Ask how our offset program can help you meet your sustainability goals.

Cancel anytime. If you are not satisfied within the first 30 days, contact emmett@thepi.pe for a full refund.

Frequently Asked Questions

thepi.pe

Extract clean data from tricky documents.

Integrates with your favorite tools

API Pricing

Self-Hosted

Hobby

Scale

Business

Frequently Asked Questions

Recent Articles

Extracting Structured Data From Tricky PDFs With Gemini Flash

Automate Posts To Medium Based on Live Web Data

Get Clean Data from Any Document: Using AI to “Learn” PDF Formats On-the-Fly

thepi.pe

Extract clean data from tricky documents.

Integrates with your favorite tools

API Pricing

Self-Hosted

Hobby

Scale

Business

Frequently Asked Questions

What is thepi.pe?

What files can thepi.pe work with?

What websites can thepi.pe work with?

Does thepi.pe work with RAG frameworks?

Is thepi.pe free to use?

Recent Articles

Extracting Structured Data From Tricky PDFs With Gemini Flash

Automate Posts To Medium Based on Live Web Data

Get Clean Data from Any Document: Using AI to “Learn” PDF Formats On-the-Fly