Python Quickstart
Thepipe can be installed via the command line:
pip install thepipe-apiIf you need full functionality with media-rich sources such as webpages, video, and audio, you can choose to install the following dependencies:
apt-get update && apt-get install -y git ffmpeg
python -m playwright install --with-deps chromiumDefault setup (OpenAI)
By default, thepipe uses the OpenAI API (opens in a new tab), so VLM features will work out-of-the-box provided you pass in an OpenAI client.
Custom VLM server setup (OpenRouter, OpenLLM, etc.)
If you wish to use a local vision-language model or a different cloud provider, you can provide a custom OpenAI client, for example, by setting the base url to https://openrouter.ai/api/v1 for OpenRouter (opens in a new tab), or http://localhost:3000/v1 for a local server such as OpenLLM (opens in a new tab). Note that uou must also pass the api key to your non-OpenAI cloud provider into the OpenAI client. The model name can be changed with the model parameter. By default, the model will be gpt-4o.
Scraping
from thepipe.scraper import scrape_file
# scrape text and page images from a PDF
chunks = scrape_file(filepath="paper.pdf")For enhanced scraping with a vision-language model, you can pass in an OpenAI-compatible client and a model name.
from openai import OpenAI
from thepipe.scraper import scrape_file
# create an OpenAI-compatible client
client = OpenAI()
# scrape clean markdown and page images from a PDF
chunks = scrape_file(
filepath="paper.pdf",
openai_client=client,
model="gpt-4o"
)Chunking
To satisfy token-limit constraints, the following chunking methods are available to split the content into smaller chunks.
chunk_by_document: Returns one chunk with the entire content of the file.chunk_by_page: Returns one chunk for each page (for example: each webpage, PDF page, or PowerPoint slide).chunk_by_length: Splits chunks by length.chunk_by_section: Splits chunks by markdown section.chunk_by_keyword: Splits chunks at keywords.chunk_semantic(experimental, requires sentence-transformers (opens in a new tab)): Returns chunks split by spikes in semantic changes, with a configurable threshold.chunk_agentic(experimental, requires OpenAI (opens in a new tab)): Returns chunks split by an LLM agent that attempts to find semantically meaningful sections.
For example,
from thepipe.scraper import scrape_file
from thepipe.chunker import chunk_by_document, chunk_by_page
# optionally, pass in chunking_method
# chunk_by_document returns one chunk for the entire document
chunks = scrape_file(
filepath="paper.pdf",
chunking_method=chunk_by_document
)
# you can also re-chunk later.
# chunk_by_page returns one chunk for each page (for example: each webpage, PDF page, or PowerPoint slide).
chunks = chunk_by_page(chunks)OpenAI Chat Integration 🤖
from openai import OpenAI
from thepipe.core import chunks_to_messages
# Initialize OpenAI client
client = OpenAI()
# Use OpenAI-formatted chat messages
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "What is the paper about?"
}]
}]
# Simply add the scraped chunks to the messages
messages += chunks_to_messages(chunks)
# Call LLM
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)chunks_to_messages takes in an optional text_only parameter to only output text from the source document. This is useful for downstream use with LLMs that lack multimodal capabilities.
⚠️ It is important to be mindful of your model's token limit. Be sure your prompt is within the token limit of your model. You can use chunking to split your messages into smaller chunks.
LLamaIndex Integration 🦙
A chunk can be converted to LlamaIndex Document/ImageDocument with .to_llamaindex.
Structured extraction 🗂️
Note that structured extraction is being deprecated and will be removed in future releases. The current implementation is a simple wrapper around OpenAI's chat API, which is not ideal for structured data extraction. We recommend OpenAI's structured outputs (opens in a new tab) for structured data extraction, or using Trellis AI (opens in a new tab) for automated workflows with structured data.
from thepipe.extract import extract
from openai import OpenAI
client = OpenAI()
schema = {
"description": "string",
"amount_usd": "float"
}
results, tokens_used = extract(
chunks=chunks,
schema=schema,
multiple_extractions=True, # extract multiple rows of data per chunk
openai_client=client
)How it works 🛠️
thepipe uses a combination of computer-vision models and heuristics to scrape clean content from the source and process it for downstream use with large language models (opens in a new tab), or vision-language models (opens in a new tab). You can feed these messages directly into the model, or alternatively you can chunk these messages for downstream storage in a vector database such as ChromaDB, LLamaIndex, or an equivalent RAG framework.
Supported File Types 📚
| Source | Input types | Multimodal | Notes |
|---|---|---|---|
| Webpage | URLs starting with http, https, ftp | ✔️ | Scrapes markdown, images, and tables from web pages. AI extraction available by passing an OpenAI client for screenshot analysis |
.pdf | ✔️ | Extracts page markdown and page images. AI extraction available when an OpenAI client is supplied for complex or scanned documents | |
| Word Document | .docx | ✔️ | Extracts text, tables, and images |
| PowerPoint | .pptx | ✔️ | Extracts text and images from slides |
| Video | .mp4, .mov, .wmv | ✔️ | Uses Whisper for transcription and extracts frames |
| Audio | .mp3, .wav | ✔️ | Uses Whisper for transcription |
| Jupyter Notebook | .ipynb | ✔️ | Extracts markdown, code, outputs, and images |
| Spreadsheet | .csv, .xls, .xlsx | ❌ | Converts each row to JSON format, including row index for each |
| Plaintext | .txt, .md, .rtf, etc | ❌ | Simple text extraction |
| Image | .jpg, .jpeg, .png | ✔️ | Uses VLM for OCR in text-only mode |
| ZIP File | .zip | ✔️ | Extracts and processes contained files |
| Directory | any path/to/folder | ✔️ | Recursively processes all files in directory. Optionally use inclusion_pattern to pass regex strings for file inclusion rules. |
| YouTube Video (known issues) | YouTube video URLs starting with https://youtube.com or https://www.youtube.com. | ✔️ | Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your pytube installation to send a valid user-agent header (see this issue (opens in a new tab)). |
| Tweet | URLs starting with https://twitter.com or https://x.com | ✔️ | Uses unofficial API, may break unexpectedly |
| GitHub Repository | GitHub repo URLs starting with https://github.com or https://www.github.com | ✔️ | Requires GITHUB_TOKEN environment variable |
Configuration & Environment
Set these environment variables to control API keys, hosting, and model defaults:
# If you want longer-term image storage and hosting (saves to ./images and serves via HOST_URL)
export HOST_IMAGES=true
# GitHub token for scraping private/public repos via `scrape_url`
export GITHUB_TOKEN=ghp_...
# Control scraping defaults
export DEFAULT_AI_MODEL=gpt-4o
export DEFAULT_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
export FILESIZE_LIMIT_MB=50
# Max duration (in seconds) for audio transcription
export MAX_WHISPER_DURATION=600
# Filesize limit for webpages in mb
export FILESIZE_LIMIT_MB = 50
# Credientials for scraping repositories
export GITHUB_TOKEN=...CLI Usage
thepipe <source> [options]
AI scraping options
--openai-api-key=KEY To enable VLM scraping, pass in your OpenAI API key
--openai-model=MODEL Model to use for scraping (default is DEFAULT_AI_MODEL, currently gpt-4o)
--openai-base-url=URL Custom LLM endpoint, for local LLMs or hosted APIs like OpenRouter (default: https://api.openai.com/v1 (opens in a new tab))
--ai_extraction ⚠️ DEPRECATED; will get API key from OPENAI_API_KEY environment variable
General scraping options
--text_only Output text only (suppress images)
--inclusion_pattern=REGEX Include only files whose _full path* matches REGEX (for dirs/zips)
--verbose Print detailed progress messages