Python
Extract API

Extract API

The Extract API allows you to extract structured information from various sources using LLMs or VLMs. You can extract data from URLs, local files, or text content by simply defining the structure of the data you want to extract using a schema.

Basic Usage

To extract data from a URL or a local file, you can use the extract_from_url and extract_from_file functions:

from thepipe.extract import extract_from_url, extract_from_file
 
# Extract structured data from a URL
results = extract_from_url(
    "https://arxiv.org/abs/2106.14789",
    schema={
        "title": "string",
        "authors": "string[]",
        "year": "int"
    },
    text_only=True
)

In the above example, we extract the title, authors, and publication year from an arXiv paper.

from thepipe.extract import extract_from_url, extract_from_file
 
# Extracts multiple pieces of data from one URL
results = extract_from_url(
    "https://www.bbc.co.uk/",
    schema={
        "article_title": "string", 
        "image_sentiment": "string", 
        "article_sentiment": "float"
    }
    multiple_extractions=True
)

In the above example, we set multiple_extractions=True, since we want to extract multiple articles from a news website.

Extract from a file

results = extract_from_file(
    "example.pdf",
    schema={
        "author": "string",
        "abstract": "string",
        "keywords": "string[]"
    }
)

Defining the Schema

The schema defines the structure of the data you want to extract. It should be a JSON string representing an object where the keys are the field names and the values are the data types.

Supported data types:

  • "string": For text data
  • "int": For integer numbers
  • "float": For decimal numbers
  • "bool": For true/false values
  • "string[]": For arrays of strings

Example schema:

{
  "title": "string",
  "author": "string",
  "publication_year": "int",
  "abstract": "string",
  "keywords": "string[]",
  "is_peer_reviewed": "bool"
}

Advanced Options

The extract_from_url and extract_from_file functions accept several optional parameters:

  • ai_model: The AI model to use for extraction (default is 'google/gemma-2-9b-it')
  • multiple_extractions: Allow multiple extractions per chunk (default is False)
  • extraction_prompt: Custom prompt for extraction (default is a predefined prompt)
  • host_images: Whether to host images on the server (default is False)
  • text_only: Extract only text content (default is False)
  • ai_extraction: Use AI to analyze layout before extracting structured content. See the Scrape docs for details. (default is False)
  • verbose: Print status messages (default is False)
  • chunking_method: Method to chunk the content (default is chunk_by_document)
  • local: Use local processing instead of the API (default is False)

Example with advanced options:

from thepipe.chunker import chunk_semantic
 
results = extract_from_url(
    "https://arxiv.org/pdf/2201.02177.pdf",
    schema='{"title": "string", "abstract": "string", "sections": "string[]"}',
    ai_model="anthropic/claude-3-haiku",
    multiple_extractions=True,
    ai_extraction=True,
    chunking_method=chunk_semantic
)

Handling Results

The extraction results are returned as a list of dictionaries. Each dictionary represents an extraction and contains the following keys:

  • chunk_index: The index of the chunk from which the data was extracted
  • source: The source URL or file path
  • Fields defined in your schema

If multiple_extractions is True, the results will contain an additional extraction key with an array of extracted data.

Example of processing results:

for result in results:
    print(f"Chunk {result['chunk_index']} from {result['source']}:")
    if 'extraction' in result:
        for extraction in result['extraction']:
            print(f"  Title: {extraction['title']}")
            print(f"  Abstract: {extraction['abstract']}")
    else:
        print(f"  Title: {result['title']}")
        print(f"  Abstract: {result['abstract']}")
    print()
 
print(f"Total tokens used: {tokens_used}")

Error Handling

If an error occurs during extraction, the result dictionary will contain an error key with a description of the error. It's important to check for this key when processing results, since LLMs are not perfect and may fail on some inputs:

for result in results:
    if 'error' in result:
        print(f"Error in chunk {result['chunk_index']}: {result['error']}")
    else:
        # Process successful extraction
        pass