Python
Extract API

Extract API

⚠️ Note: The cloud-based API is currently being deprecated in favour of local installation. Please upgrade to the latest version of thepipe-api and follow the instructions in the README (opens in a new tab) to set up the local API. The cloud-based API will be removed in the next release.

The Extract API allows you to extract structured information from various sources using Large Language Models (LLMs) or Vision Language Models (VLMs). You can extract data from URLs, local files, or text content by defining the structure of the data you want to extract using a schema.

Basic Usage

Here's an example of extracting financial data from a PDF file:

from thepipe.extract import extract_from_file
 
# Define the schema for the data extraction
schema = {
    "Fund": "string",
    "NAV": "float",
    "Distribution": "float",
    "Record Date": "string (yyyy-mm-dd)",
}
 
# Extract multiple rows of data from a PDF file
results = extract_from_file(
    "capital_gains.pdf",
    schema=schema,
    multiple_extractions=True,
)
 
# Print the extracted data
for result in results:
    if 'extraction' in result:
        for extraction in result['extraction']:
            for key in schema:
                print(f"{key}: {extraction[key]}")

This example would produce output similar to the following:

FundNAVRecord DateDistributions
International Infrastructure Trust Class C$45.082024-12-07$1.2606
LargeCap Yield Fund ETF$39.032024-12-07$1.1264
Healthcare Growth Trust Class C$49.752024-12-07$1.5268
............

You can also extract data from URLs. For example, if we wanted to collect basic details about a research paper from its URL, we could set text_only=True to extract only the text content, set our chunking method to chunk_by_document to extract the entire document as a single chunk, and set multiple_extractions=False to get a single extraction result per chunk.

from thepipe.chunker import chunk_by_document
from thepipe.extract import extract_from_url
 
schema = {
    "title": "string",
    "author": "string",
    "abstract": "string",
    "year": "string (yyyy-mm-dd)",
    "section_names": "string[]"
}
 
results = extract_from_url(
    "https://arxiv.org/pdf/2201.02177.pdf",
    schema=schema,
    multiple_extractions=False,
    ai_extraction=False,
    chunking_method=chunk_by_document,
    ai_model="gpt-4o",
)
 
# print the extracted data
for result in results:
    for key in schema:
        print(f"{key}: {result[key]}")

Defining the Schema

The schema defines the structure of the data you want to extract. It should be a dictionary where the keys are the field names and the values are the data types.

Supported data types:

  • "string": For text data
  • "int": For integer numbers
  • "float": For decimal numbers
  • "bool": For true/false values
  • "string[]": For arrays of strings
  • "string (yyyy-mm-dd)": For date strings in the format yyyy-mm-dd

Example schema:

schema = {
    "title": "string",
    "author": "string",
    "publication_year": "int",
    "abstract": "string",
    "keywords": "string[]",
    "is_peer_reviewed": "bool",
    "submission_date": "string (yyyy-mm-dd)"
}

Advanced Options

Both extract_from_url and extract_from_file functions accept several optional parameters:

  • ai_extraction: Use AI to analyze layout before extracting structured content (default is False)
  • ai_model: The AI model to use for extraction (default is 'google/gemma-2-9b-it')
  • multiple_extractions: Allow multiple extractions per chunk (default is False)
  • extraction_prompt: Custom prompt for extraction (default is a predefined prompt)
  • host_images: Whether to host images on the server (default is False)
  • verbose: Print status messages (default is False)
  • chunking_method: Method to chunk the content (default is chunk_by_page)

Example with advanced options:

from thepipe.chunker import chunk_semantic
 
results = extract_from_url(
    "https://arxiv.org/pdf/2201.02177.pdf",
    schema={
        "title": "string",
        "abstract": "string",
        "subsection_titles": "string[]"
    },
    ai_model="anthropic/claude-3-haiku",
    multiple_extractions=True,
    ai_extraction=True,
    chunking_method=chunk_by_section,
)

Chunking Methods

The Extract API supports various chunking methods to split the input content:

  • chunk_by_page: Default method, splits content by page
  • chunk_by_document: Combines all content into a single chunk
  • chunk_by_section: Splits content based on markdown headers
  • chunk_by_keywords: Splits content based on specified keywords
  • chunk_by_length: Splits content into chunks of a specified length
  • chunk_semantic: Uses semantic similarity to group related content

To use a specific chunking method, import it from thepipe.chunker and pass it to the chunking_method parameter:

from thepipe.chunker import chunk_by_section
 
results = extract_from_file(
    "example.pdf",
    schema={"section_title": "string", "content": "string"},
    chunking_method=chunk_by_section,
)

Multiple vs. Single Extractions

The extraction results are returned as a list of dictionaries. Each dictionary represents a chunk and contains the following keys:

  • chunk_index: The index of the chunk from which the data was extracted
  • source: The source URL or file path

The structure of the extracted data depends on whether multiple_extractions is enabled:

  1. If multiple_extractions is False:

    • The extracted fields (as defined in your schema) are directly included in each chunk's dictionary.

    Example:

    [
        {
            'chunk_index': 0,
            'source': 'https://example.com',
            'title': '...',
            'abstract': '...',
     
        },
        # ... more chunks ...
    ]
  2. If multiple_extractions is True:

    • Each chunk's dictionary includes an extractions key, which contains a list of dictionaries, each representing a separate extraction from that chunk.

    Example:

    [
        {
            'chunk_index': 0,
            'source': 'example.pdf',
            'extraction': [
                {
                    'section_title': '...',
                    'section_content': '...'
                },
                {
                    'section_title': '...',
                    'section_content': '...'
                }
            ]
        },
        # ... more chunks ...
    ]