Extract API
The Extract API allows you to extract structured information from various sources using Large Language Models (LLMs) or Vision Language Models (VLMs). You can extract data from URLs, local files, or text content by defining the structure of the data you want to extract using a schema.
Basic Usage
Here's an example of extracting financial data from a PDF file:
from thepipe.extract import extract_from_file
# Define the schema for the data extraction
schema = {
"Fund": "string",
"NAV": "float",
"Distribution": "float",
"Record Date": "string (yyyy-mm-dd)",
}
# Extract multiple rows of data from a PDF file
results = extract_from_file(
"capital_gains.pdf",
schema=schema,
multiple_extractions=True,
)
# Print the extracted data
for result in results:
if 'extraction' in result:
for extraction in result['extraction']:
for key in schema:
print(f"{key}: {extraction[key]}")
This example would produce output similar to the following:
Fund | NAV | Record Date | Distributions |
---|---|---|---|
International Infrastructure Trust Class C | $45.08 | 2024-12-07 | $1.2606 |
LargeCap Yield Fund ETF | $39.03 | 2024-12-07 | $1.1264 |
Healthcare Growth Trust Class C | $49.75 | 2024-12-07 | $1.5268 |
... | ... | ... | ... |
You can also extract data from URLs.
For example, if we wanted to collect basic details about a research paper from its URL, we could set text_only=True
to extract only the text content, set our chunking method to chunk_by_document
to extract the entire document as a single chunk, and set multiple_extractions=False
to get a single extraction result per chunk.
from thepipe.chunker import chunk_by_document
from thepipe.extract import extract_from_url
schema = {
"title": "string",
"author": "string",
"abstract": "string",
"year": "string (yyyy-mm-dd)",
"section_names": "string[]"
}
results = extract_from_url(
"https://arxiv.org/pdf/2201.02177.pdf",
schema=schema,
multiple_extractions=False,
ai_extraction=False,
chunking_method=chunk_by_document,
ai_model="gpt-4o",
)
# print the extracted data
for result in results:
for key in schema:
print(f"{key}: {result[key]}")
Defining the Schema
The schema defines the structure of the data you want to extract. It should be a dictionary where the keys are the field names and the values are the data types.
Supported data types:
"string"
: For text data"int"
: For integer numbers"float"
: For decimal numbers"bool"
: For true/false values"string[]"
: For arrays of strings"string (yyyy-mm-dd)"
: For date strings in the format yyyy-mm-dd
Example schema:
schema = {
"title": "string",
"author": "string",
"publication_year": "int",
"abstract": "string",
"keywords": "string[]",
"is_peer_reviewed": "bool",
"submission_date": "string (yyyy-mm-dd)"
}
Advanced Options
Both extract_from_url
and extract_from_file
functions accept several optional parameters:
ai_model
: The AI model to use for extraction (default is'google/gemma-2-9b-it'
)multiple_extractions
: Allow multiple extractions per chunk (default isFalse
)extraction_prompt
: Custom prompt for extraction (default is a predefined prompt)host_images
: Whether to host images on the server (default isFalse
)text_only
: Extract only text content (default isFalse
)ai_extraction
: Use AI to analyze layout before extracting structured content (default isFalse
)verbose
: Print status messages (default isFalse
)chunking_method
: Method to chunk the content (default ischunk_by_page
)local
: Use local processing instead of the API (default isFalse
)
Example with advanced options:
from thepipe.chunker import chunk_semantic
results = extract_from_url(
"https://arxiv.org/pdf/2201.02177.pdf",
schema={
"title": "string",
"abstract": "string",
"sections": "string[]"
},
ai_model="anthropic/claude-3-haiku",
multiple_extractions=True,
ai_extraction=True,
chunking_method=chunk_semantic
)
Chunking Methods
The Extract API supports various chunking methods to split the input content:
chunk_by_page
: Default method, splits content by pagechunk_by_document
: Combines all content into a single chunkchunk_by_section
: Splits content based on markdown headerschunk_semantic
: Uses semantic similarity to group related contentchunk_by_keywords
: Splits content based on specified keywords
To use a specific chunking method, import it from thepipe.chunker
and pass it to the chunking_method
parameter:
from thepipe.chunker import chunk_by_section
results = extract_from_file(
"example.pdf",
schema={"section_title": "string", "content": "string"},
chunking_method=chunk_by_section
)
Multiple vs. Single Extractions
The extraction results are returned as a list of dictionaries. Each dictionary represents a chunk and contains the following keys:
chunk_index
: The index of the chunk from which the data was extractedsource
: The source URL or file path
The structure of the extracted data depends on whether multiple_extractions
is enabled:
-
If
multiple_extractions
isFalse
:- The extracted fields (as defined in your schema) are directly included in each chunk's dictionary.
Example:
[ { 'chunk_index': 0, 'source': 'https://example.com', 'title': '...', 'abstract': '...', }, # ... more chunks ... ]
-
If
multiple_extractions
isTrue
:- Each chunk's dictionary includes an
extractions
key, which contains a list of dictionaries, each representing a separate extraction from that chunk.
Example:
[ { 'chunk_index': 0, 'source': 'example.pdf', 'extraction': [ { 'section_title': '...', 'section_content': '...' }, { 'section_title': '...', 'section_content': '...' } ] }, # ... more chunks ... ]
- Each chunk's dictionary includes an
Error Handling
If an error occurs during extraction, the result dictionary will contain an error
key with a description of the error. It's important to check for this key when processing results:
for result in results:
if 'error' in result:
print(f"Error in chunk {result['chunk_index']}: {result['error']}")
else:
# Process successful extraction
pass
API vs Local Processing
The Extract API supports both API-based and local processing:
-
API Processing (default):
- Set
local=False
(default behavior) - Utilizes the thepipe API for extraction
- Supports streaming responses for real-time processing
- Handles large files and complex extractions efficiently
- Set
-
Local Processing:
- Set
local=True
- Performs extraction on the local machine
- Useful for offline work or when processing sensitive data
- May have limitations on file size and processing speed compared to the API
- Set
Example of local processing:
results = extract_from_file(
"local_document.pdf",
schema={"title": "string", "content": "string"},
local=True
)
When using the API, the extraction process is streamed, allowing for real-time processing of results as they become available. The stream ends when an 'extraction_complete' flag is received.