Extract API
The Extract API allows you to extract structured information from various sources using LLMs or VLMs. You can extract data from URLs, local files, or text content by simply defining the structure of the data you want to extract using a schema.
Basic Usage
To extract data from a URL or a local file, you can use the extract_from_url
and extract_from_file
functions:
from thepipe.extract import extract_from_url, extract_from_file
# Extract structured data from a URL
results = extract_from_url(
"https://arxiv.org/abs/2106.14789",
schema={
"title": "string",
"authors": "string[]",
"year": "int"
},
text_only=True
)
In the above example, we extract the title, authors, and publication year from an arXiv paper.
from thepipe.extract import extract_from_url, extract_from_file
# Extracts multiple pieces of data from one URL
results = extract_from_url(
"https://www.bbc.co.uk/",
schema={
"article_title": "string",
"image_sentiment": "string",
"article_sentiment": "float"
}
multiple_extractions=True
)
In the above example, we set multiple_extractions=True
, since we want to extract multiple articles from a news website.
Extract from a file
results = extract_from_file(
"example.pdf",
schema={
"author": "string",
"abstract": "string",
"keywords": "string[]"
}
)
Defining the Schema
The schema defines the structure of the data you want to extract. It should be a JSON string representing an object where the keys are the field names and the values are the data types.
Supported data types:
"string"
: For text data"int"
: For integer numbers"float"
: For decimal numbers"bool"
: For true/false values"string[]"
: For arrays of strings
Example schema:
{
"title": "string",
"author": "string",
"publication_year": "int",
"abstract": "string",
"keywords": "string[]",
"is_peer_reviewed": "bool"
}
Advanced Options
The extract_from_url
and extract_from_file
functions accept several optional parameters:
ai_model
: The AI model to use for extraction (default is'google/gemma-2-9b-it'
)multiple_extractions
: Allow multiple extractions per chunk (default isFalse
)extraction_prompt
: Custom prompt for extraction (default is a predefined prompt)host_images
: Whether to host images on the server (default isFalse
)text_only
: Extract only text content (default isFalse
)ai_extraction
: Use AI to analyze layout before extracting structured content. See the Scrape docs for details. (default isFalse
)verbose
: Print status messages (default isFalse
)chunking_method
: Method to chunk the content (default ischunk_by_document
)local
: Use local processing instead of the API (default isFalse
)
Example with advanced options:
from thepipe.chunker import chunk_semantic
results = extract_from_url(
"https://arxiv.org/pdf/2201.02177.pdf",
schema='{"title": "string", "abstract": "string", "sections": "string[]"}',
ai_model="anthropic/claude-3-haiku",
multiple_extractions=True,
ai_extraction=True,
chunking_method=chunk_semantic
)
Handling Results
The extraction results are returned as a list of dictionaries. Each dictionary represents an extraction and contains the following keys:
chunk_index
: The index of the chunk from which the data was extractedsource
: The source URL or file path- Fields defined in your schema
If multiple_extractions
is True
, the results will contain an additional extraction
key with an array of extracted data.
Example of processing results:
for result in results:
print(f"Chunk {result['chunk_index']} from {result['source']}:")
if 'extraction' in result:
for extraction in result['extraction']:
print(f" Title: {extraction['title']}")
print(f" Abstract: {extraction['abstract']}")
else:
print(f" Title: {result['title']}")
print(f" Abstract: {result['abstract']}")
print()
print(f"Total tokens used: {tokens_used}")
Error Handling
If an error occurs during extraction, the result dictionary will contain an error
key with a description of the error. It's important to check for this key when processing results, since LLMs are not perfect and may fail on some inputs:
for result in results:
if 'error' in result:
print(f"Error in chunk {result['chunk_index']}: {result['error']}")
else:
# Process successful extraction
pass