Python
Extract API

Extract API

The Extract API allows you to extract structured information from various sources using Large Language Models (LLMs) or Vision Language Models (VLMs). You can extract data from URLs, local files, or text content by defining the structure of the data you want to extract using a schema.

Basic Usage

Here's an example of extracting financial data from a PDF file:

from thepipe.extract import extract_from_file
 
# Define the schema for the data extraction
schema = {
    "Fund": "string",
    "NAV": "float",
    "Distribution": "float",
    "Record Date": "string (yyyy-mm-dd)",
}
 
# Extract multiple rows of data from a PDF file
results = extract_from_file(
    "capital_gains.pdf",
    schema=schema,
    multiple_extractions=True,
)
 
# Print the extracted data
for result in results:
    if 'extraction' in result:
        for extraction in result['extraction']:
            for key in schema:
                print(f"{key}: {extraction[key]}")

This example would produce output similar to the following:

FundNAVRecord DateDistributions
International Infrastructure Trust Class C$45.082024-12-07$1.2606
LargeCap Yield Fund ETF$39.032024-12-07$1.1264
Healthcare Growth Trust Class C$49.752024-12-07$1.5268
............

You can also extract data from URLs. For example, if we wanted to collect basic details about a research paper from its URL, we could set text_only=True to extract only the text content, set our chunking method to chunk_by_document to extract the entire document as a single chunk, and set multiple_extractions=False to get a single extraction result per chunk.

from thepipe.chunker import chunk_by_document
from thepipe.extract import extract_from_url
 
schema = {
    "title": "string",
    "author": "string",
    "abstract": "string",
    "year": "string (yyyy-mm-dd)",
    "section_names": "string[]"
}
 
results = extract_from_url(
    "https://arxiv.org/pdf/2201.02177.pdf",
    schema=schema,
    multiple_extractions=False,
    ai_extraction=False,
    chunking_method=chunk_by_document,
    ai_model="gpt-4o",
)
 
# print the extracted data
for result in results:
    for key in schema:
        print(f"{key}: {result[key]}")

Defining the Schema

The schema defines the structure of the data you want to extract. It should be a dictionary where the keys are the field names and the values are the data types.

Supported data types:

  • "string": For text data
  • "int": For integer numbers
  • "float": For decimal numbers
  • "bool": For true/false values
  • "string[]": For arrays of strings
  • "string (yyyy-mm-dd)": For date strings in the format yyyy-mm-dd

Example schema:

schema = {
    "title": "string",
    "author": "string",
    "publication_year": "int",
    "abstract": "string",
    "keywords": "string[]",
    "is_peer_reviewed": "bool",
    "submission_date": "string (yyyy-mm-dd)"
}

Advanced Options

Both extract_from_url and extract_from_file functions accept several optional parameters:

  • ai_model: The AI model to use for extraction (default is 'google/gemma-2-9b-it')
  • multiple_extractions: Allow multiple extractions per chunk (default is False)
  • extraction_prompt: Custom prompt for extraction (default is a predefined prompt)
  • host_images: Whether to host images on the server (default is False)
  • text_only: Extract only text content (default is False)
  • ai_extraction: Use AI to analyze layout before extracting structured content (default is False)
  • verbose: Print status messages (default is False)
  • chunking_method: Method to chunk the content (default is chunk_by_page)
  • local: Use local processing instead of the API (default is False)

Example with advanced options:

from thepipe.chunker import chunk_semantic
 
results = extract_from_url(
    "https://arxiv.org/pdf/2201.02177.pdf",
    schema={
        "title": "string",
        "abstract": "string",
        "sections": "string[]"
    },
    ai_model="anthropic/claude-3-haiku",
    multiple_extractions=True,
    ai_extraction=True,
    chunking_method=chunk_semantic
)

Chunking Methods

The Extract API supports various chunking methods to split the input content:

  • chunk_by_page: Default method, splits content by page
  • chunk_by_document: Combines all content into a single chunk
  • chunk_by_section: Splits content based on markdown headers
  • chunk_semantic: Uses semantic similarity to group related content
  • chunk_by_keywords: Splits content based on specified keywords

To use a specific chunking method, import it from thepipe.chunker and pass it to the chunking_method parameter:

from thepipe.chunker import chunk_by_section
 
results = extract_from_file(
    "example.pdf",
    schema={"section_title": "string", "content": "string"},
    chunking_method=chunk_by_section
)

Multiple vs. Single Extractions

The extraction results are returned as a list of dictionaries. Each dictionary represents a chunk and contains the following keys:

  • chunk_index: The index of the chunk from which the data was extracted
  • source: The source URL or file path

The structure of the extracted data depends on whether multiple_extractions is enabled:

  1. If multiple_extractions is False:

    • The extracted fields (as defined in your schema) are directly included in each chunk's dictionary.

    Example:

    [
        {
            'chunk_index': 0,
            'source': 'https://example.com',
            'title': '...',
            'abstract': '...',
     
        },
        # ... more chunks ...
    ]
  2. If multiple_extractions is True:

    • Each chunk's dictionary includes an extractions key, which contains a list of dictionaries, each representing a separate extraction from that chunk.

    Example:

    [
        {
            'chunk_index': 0,
            'source': 'example.pdf',
            'extraction': [
                {
                    'section_title': '...',
                    'section_content': '...'
                },
                {
                    'section_title': '...',
                    'section_content': '...'
                }
            ]
        },
        # ... more chunks ...
    ]

Error Handling

If an error occurs during extraction, the result dictionary will contain an error key with a description of the error. It's important to check for this key when processing results:

for result in results:
    if 'error' in result:
        print(f"Error in chunk {result['chunk_index']}: {result['error']}")
    else:
        # Process successful extraction
        pass

API vs Local Processing

The Extract API supports both API-based and local processing:

  1. API Processing (default):

    • Set local=False (default behavior)
    • Utilizes the thepipe API for extraction
    • Supports streaming responses for real-time processing
    • Handles large files and complex extractions efficiently
  2. Local Processing:

    • Set local=True
    • Performs extraction on the local machine
    • Useful for offline work or when processing sensitive data
    • May have limitations on file size and processing speed compared to the API

Example of local processing:

results = extract_from_file(
    "local_document.pdf",
    schema={"title": "string", "content": "string"},
    local=True
)

When using the API, the extraction process is streamed, allowing for real-time processing of results as they become available. The stream ends when an 'extraction_complete' flag is received.