The Pipe API Documentation

GitHub status

API status

Welcome to The Pipe API Documentation! The Pipe is a powerful tool designed to prepare PDFs, word documents, slides, web pages, and more for use with vision-language models such as GPT-4. This documentation will walk you through using The Pipe to feed your data into LLM and RAG applications.

Active Development Notice 🚧

The Pipe is open source, and currently in active development. Expect errors, and be sure to mitigate errors by upgrading frequently with pip install thepipe_api --upgrade.

Getting Started 🚀

Installation

First, install the latest version of The Pipe Python client via pip:

pip install thepipe_api --upgrade

API Key

Ensure you have set the THEPIPE_API_KEY environment variable with your API key. If you don't have an API key, click here to get one. Don't know how to set an environment variable? If you're a windows user, click here. If you're a Mac user, click here. You must restart the terminal for the changes to take effect.

Extract from a File

from thepipe_api import thepipe
messages = thepipe.extract("example.pdf")

Extract from a Website

messages = thepipe.extract("https://example.com")

Feeding Data into LLMs

After extraction, you can feed the data into any LLM, including GPT-4-Vision. (You will need to similarly set your OPENAI_API_KEY environment variable.)

from openai import OpenAI
openai_client = OpenAI()
response = openai_client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages = messages,
)

The Pipe's output is a sensible list of messages, ready for a vision-language model:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "..."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/jpeg;base64,..."
        }
      }
    ]
  }
]

Note ⚠️

If you want to feed these messages directly into the model, it is important to be mindful of the token limit. OpenAI does not allow too many images in the prompt (see discussion here), so long files should be extracted with text_only=True to avoid this issue. Even longer files can be compressed by setting a token limit with the limit parameter, albeit with quality loss correlating to the compression ratio.

API Web Portal

You can find a portal to help you debug your API requests here.

Advanced Usage

These messages may also be prepared for a vector database with thepipe.core.create_chunks_from_messages. See a guide here on how to use The Pipe with a RAG framework (ChromaDB).

If you are looking to use other LLMs, LiteLLM can be useful to integrate the output with LLM providers other than OpenAI.

While data tables and bar charts will be readily interpreted by the vision-LLM, you may still want extra features such as data table extraction into text, bar chart extraction into text, web scrape blocker bypassing, and more. These features cost extra (see pricing) on our backend and thus are opt-in by request to emmett@thepi.pe.

Usage limits 📏

We ask that you respect our rate limits of 100 requests per hour (if you need more, contact emmett@thepi.pe).

Want to monitor your usage? Sign into the API manager here.

Command Line Interface 🖥️

To use The Pipe via the command line, you can run:

thepipe path/to/document.pdf

or, for directories:

thepipe path/to/directory --match .tsx --ignore node_modules

Supported File Types 📚

Source Type	Input types	Token Compression 🗜️	Image Extraction 👁️	Notes 📌
Plaintext	`.txt`, `.md`, etc	✔️	❌	Regular text files
PDF	`.pdf`	✔️	✔️	Extracts text and images of each page; opt-in (see pricing details) AI extraction (this converts table data, equations, etc into text, and returns images within pages rather than images of them)
Code	`.py`, `.tsx`, `.js`, `.html`, `.css`, `.cpp`, etc	✔️ (varies)	❌	Combines all code files. `.c`, `.cpp`, `.py` are compressible with ctags, others are not
Image	`.jpg`, `.jpeg`, `.png`	❌	✔️	Extracts images, uses OCR if text_only
Spreadsheet	`.csv`, `.xls`, `.xlsx`	✔️	❌	Extracts data from spreadsheets; converts to text representation. For very large datasets, will only extract column names and types
Jupyter Notebook	`.ipynb`	❌	✔️	Extracts code, markdown, and images from Jupyter notebooks
Microsoft Word Document	`.docx`	✔️	✔️	Extracts text and images from Word documents
Microsoft PowerPoint Presentation	`.pptx`	✔️	✔️	Extracts text and images from PowerPoint presentations
Video	`.mp4`, `.avi`, `.mov`, `.wmv`	✔️	✔️	Extracts frames from video files and transcripts using OpenAI Whisper; Returns 1 chunk per minute
Audio	`.mp3`, `.wav`	✔️	❌	Extracts text from audio files using OpenAI Whisper; supports multiple languages and accents
ZIP File	`.zip`	✔️	✔️	Extracts contents of ZIP files; supports nested directory extraction
YouTube Video	YouTube video URLs (inputs containing `https://youtube.com`, `https://www.youtube.com`)	✔️	✔️	Extracts from YouTube videos (see Video).
Website	URLs (inputs containing `http`, `https`)	✔️	✔️	Extracts text from web page along with image (or images if scrollable); text-only extraction available
GitHub Repository	GitHub repo URLs. Must be public. (inputs containing `https://github.com`, `https://www.github.com`)	✔️	✔️	Extracts from GitHub repositories; supports branch specification
Directory	Any `/path/to/directory`	✔️	✔️	Extracts from all files in directory, supports match and ignore patterns

REST API Endpoint 🌐

If you want to use The Pipe from a languge other than Python, we provide a versatile API endpoint for data extraction:

POST https://thepipe.up.railway.app/extract

The request body should be a FormData object as described below.

Request Parameters 📋

When making requests to the API, you can include the following parameters:

source (string): The source from which to extract data. Can be a file path, a URL, or a directory path.
limit (integer, optional): The token limit for the output. Exceeding results will be compressed. NOTE: This feature may not work as expected, as it is still in active development.
ai_extraction (boolean, optional): Enables AI-based extraction for tables, figures, and math from PDFs. This feature is opt-in and requires additional charges.
text_only (boolean, optional): If true, extracts only text (images get OCR'd and converted to text).
api_key (string, required): Your API key. Note that we are moving to JWT tokens, so this may change in the future.

Response Format 📬

The Pipe's API response is structured to provide easily consumable data for LLMs. Here's what you can expect:

Success Response

A successful request returns a list of chunks containing the extracted data:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What’s in this image?"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": f"data:image/jpeg;base64,{base64_image}"
        }
      }
    ]
  },
  ...
]

Error Response

In case of an error, the response will detail the issue:

{
  "error": "..."
}

Local Installation 🛠️

If you do not wish to use our API, you are welcome host The Pipe for yourself locally. If you choose to do this, you must install a number of dependencies for the code to function correctly, some of which will likely incur compute costs and/or require a GPU for reasonable performance. Additional installed dependencies are required: pytorch, universal-ctags, playwright, pytesseract, llmlingua, moviepy, and pytube. This installation process will depend on your system and compute capabilities. After installing them, follow these steps for a local setup:

git clone https://github.com/emcf/thepipe
cd thepipe
pip install -r requirements_local.txt

Windows users must install python-magic-bin for file type detection:

pip install python-magic-bin

To use The Pipe locally, adjust your extraction code to use the local parameter:

from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf", local=True)

thepipe example.pdf --local

Common Errors 🚫

Below is a table of common errors you might encounter when using The Pipe API, along with their descriptions and suggestions for how to fix them.

Error Name	Error Message	Description	How to Fix
Failed URL extraction	None	Many websites enforce anti-scraping policies that prevent content extraction.	Contact support for assistance getting data from your specific URL.
Source Type Detection Failure	"Could not detect source type for source."	The system was unable to determine the type of the source provided.	Ensure the source path or URL is correct and accessible. Check if the source's format is supported by The Pipe.
Rate Limit Exceeded	"Rate limit exceeded."	The number of requests made to the API has exceeded the allowed rate limit.	Wait until the rate limit resets or contact support to discuss rate limit adjustments.
Invalid API Key	"Invalid API key."	The provided API key is invalid or not found in the system.	Check the API key for typos and ensure it is correctly set in your request. For additional help, see the error message below.
No API Key Provided	"No valid API key given."	No API key was provided in the request. This can also occur if your THEPIPE_API_KEY environment variable is not set correctly.	Include a valid `api_key` in your request. If you are struggling with environment variables, you can make a `.env` file where your script resides (see details here).
Usage Limit Exceeded	"Usage limit exceeded."	The usage limit associated with the provided API key has been exceeded.	Check your usage or contact support to increase your usage limit.
AI PDF Extraction Not Allowed	"AI PDF extraction is available for enterprise users only."	The request attempted to use AI-based PDF extraction, which is not enabled for the user's account.	Upgrade to an enterprise account or contact support for access to AI PDF extraction.
File Size Limit Exceeded	"File size exceeds limit."	The size of the uploaded file exceeds the allowed limit.	Reduce the file size or contact support to discuss file size limit adjustments.
OpenAI BadRequestError	"Invalid 'messages': array too long. Expected an array with maximum length X, but got an array with length Y instead."	The array of messages sent to OpenAI's API exceeds the maximum allowed length.	Reduce the number of messages sent in a single request to OpenAI's API to below the token limit.
Content Extraction Failure	"No content extracted from URL."	The system was unable to extract any content from the provided URL.	Check the URL for correctness and ensure the webpage is accessible and contains extractable content.
GitHub Clone Error	"Failed to clone GitHub repository."	An error occurred while attempting to clone the specified GitHub repository.	Ensure the GitHub URL is correct. If it is private, it must be accessible to the GitHub user `emcf`.
Invalid Source Error	"The provided source is not supported or invalid."	The source provided for extraction is not supported or cannot be processed by The Pipe.	Verify the source type and format. Use a supported source type such as a PDF, DOCX, PPTX, or a valid URL.
File Not Found	"The specified file was not found."	The file specified for extraction does not exist or is not accessible.	Check the file path for typos and ensure the file exists and is accessible.
Unsupported File Type	"The file type is not supported for extraction."	The file type provided is not supported by The Pipe for content extraction. Please suggest this feature to emmett@thepipe!	Use a supported file type such as PDF, DOCX, PPTX, images, or spreadsheets.
API Access Error	"Error accessing the API."	An unspecified error occurred while accessing The Pipe API. Check if your API key is valid	Check your network connection and API endpoint URL. If the issue persists, contact support.

If you encounter an error not listed here or need further assistance, please contact support at emmett@thepi.pe.

Support and Feedback 📢

For support, feedback, or questions about using The Pipe API, please contact emmett@thepi.pe

Thank you for keeping The Pipe flowing! Happy extracting! 🎉