Scrape API
To scrape markdown and visuals from URLs or files you can use the scrape_url
and scrape_file
functions. The scraped content is ready to be fed into any LLM, and can be used for training models, structured data extraction, or storage in a vector database.
Basic Usage
To scrape the contents of a URL or a local file, you can use the scrape_url
and scrape_file
functions:
from thepipe.scraper import scrape_url, scrape_file
# Scrape a local file
chunks = scrape_file("example.pdf")
# Or scrape a URL
chunks = scrape_url("https://example.com")
Feed into LLM
To feed the scraped content into an LLM, you can convert these chunks into a OpenAI messages format:
from thepipe.core import chunks_to_messages
# Feed the scraped results to OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=chunks_to_messages(chunks),
)
Feed into Vector Database
Alternatively, you can chunk the content, then embed them into a vector database:
from thepipe.chunker import chunk_by_page
# Can also use `chunk_by_document`, `chunk_by_section`, `chunk_semantic`
chunks = chunk_by_page(chunks)
Raw Scraping Results
For maximum flexibility, you can use these chunks directly in your own processing pipeline or storage system. Each chunk contains the following attributes:
for chunk in chunks:
print(chunk.path)
print(chunk.texts)
print(chunk.images)
We will be releasing implementations for chunk.video
and chunk.audio
in the coming months as fully multimodal updates to GPT-4o roll out.
Advanced Options
The scrape_url
and scrape_file
functions accept the following optional parameters:
ai_extraction
: Accurately extract clean markdown, cropped images, tables, and equations using a fine-tuned vision model. A 20-page PDF takes ~1 minute. (default isFalse
)ai_model
: The name of the AI model on your LLM server to use for extraction (default isopenai/gpt-4o-mini
)text_only
: Extract only text content, ideal for models without vision or large documents (default isFalse
)local
: Use local processing instead of the API (default isFalse
)
Example with advanced options:
# AI extraction takes ~1 minute for a 20-page PDF
chunks = scrape_url("https://arxiv.org/pdf/2201.02177.pdf", ai_extraction=True, text_only=True)
LlamaIndex Integration
You can chunk the content, then embed them with LlamaIndex:
from thepipe.chunker import chunk_by_page
# Can also use `chunk_by_document`, `chunk_by_section`, `chunk_semantic`
chunks = chunk_by_page(chunks)
# Ready to be indexed by LlamaIndex
llama_docs = [chunk.to_llamaindex() for chunk in chunks]