Scrape API

⚠️ Note: The cloud-based API is currently being deprecated in favour of local installation. Please upgrade to the latest version of thepipe-api and follow the instructions in the README (opens in a new tab) to set up the local API. The cloud-based API will be removed in the next release.

To scrape markdown and visuals from URLs or files you can use the scrape_url and scrape_file functions. The scraped content is ready to be fed into any LLM, and can be used for training models, structured data extraction, or storage in a vector database.

Basic Usage

To scrape the contents of a URL or a local file, you can use the scrape_url and scrape_file functions:

from thepipe.scraper import scrape_url, scrape_file
 
# Scrape a local file
chunks = scrape_file("example.pdf")
 
# Or scrape a URL
chunks = scrape_url("https://example.com")

Feed into LLM

To feed the scraped content into an LLM, you can convert these chunks into a OpenAI messages format:

from thepipe.core import chunks_to_messages
 
# Feed the scraped results to OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=chunks_to_messages(chunks),
)

Feed into Vector Database

Alternatively, you can chunk the content, then embed them into a vector database:

from thepipe.chunker import chunk_by_page
 
# Can also use `chunk_by_document`, `chunk_by_section`, `chunk_semantic`
chunks = chunk_by_page(chunks)

Raw Scraping Results

For maximum flexibility, you can use these chunks directly in your own processing pipeline or storage system. Each chunk contains the following attributes:

for chunk in chunks:
    print(chunk.path)
    print(chunk.texts)
    print(chunk.images)

We will be releasing implementations for chunk.video and chunk.audio in the coming months as fully multimodal updates to GPT-4o roll out.

Advanced Options

The scrape_url and scrape_file functions accept the following optional parameters:

ai_extraction: Accurately extract clean markdown, cropped images, tables, and equations using a fine-tuned vision model. A 20-page PDF takes ~1 minute. (default is False)
ai_model: The name of the AI model on your LLM server to use for extraction (default is openai/gpt-4o-mini)

Example with advanced options:

# AI extraction takes ~1 minute for a 20-page PDF
chunks = scrape_url("https://arxiv.org/pdf/2201.02177.pdf", ai_extraction=True, ai_model="openai/gpt-4o-mini")

LlamaIndex Integration

You can chunk the content, then embed them with LlamaIndex:

from thepipe.chunker import chunk_by_page
 
# Can also use `chunk_by_document`, `chunk_by_section`, `chunk_semantic`, `chunk_by_length`
chunks = chunk_by_page(chunks)
 
# Ready to be indexed by LlamaIndex
llama_docs = [chunk.to_llamaindex() for chunk in chunks]

Setup Extract API