API Quickstart
Useful Endpoints
POST /scrape→ stream chunks from files/URLsPOST /extract→ stream structured extraction; ends with{ "extraction_complete": true }GET /get_tokens_available→ remaining tokens
Auth header
Every request uses an Authorization header:
Authorization: Bearer <thepipe-api-key>Scrape API
Stream chunks (text + optional images) from files and/or URLs.
Endpoint: POST /scrape
Content-Type: multipart/form-data
Auth: Authorization: Bearer <token>
Response: streaming NDJSON (one JSON object per line)
Form fields
files: one or more uploaded filesurls: one or more URL strings (can repeat the key)text_only_input:true|false(default: false)text_only_output:true|false(default: false)ai_extraction:true|false(default: false) → lets the scraper use the LLM to enrich extraction when availablechunking_method: one ofchunk_by_document|chunk_by_page|chunk_by_section|chunk_by_keywordskeywords: JSON array of strings (only used withchunk_by_keywords)
Curl examples
Single PDF file
curl -N \
-H "Authorization: Bearer $USER_ID" \
-F "files=@/path/to/file.pdf" \
-F "text_only_output=true" \
-F "chunking_method=chunk_by_page" \
https://thepipe-api.up.railway.app/scrapeSample streamed line
{
"result": {
"chunk_index": 0,
"source": "file.pdf",
"content": [
{ "type": "text", "text": "..." },
{ "type": "image_url", "image_url": { "url": "http://.../images/1.png" } }
]
},
"tokens_used": 123
}Consume as an event stream (each line is a complete JSON object).
Token usage is tracked and deducted automatically per chunk.
Extract API
Structured data extraction against a schema from files and/or URLs.
Endpoint: POST /extract
Content-Type: multipart/form-data
Auth: Authorization: Bearer <token>
Response: streaming NDJSON (one JSON object per line) ending with { "extraction_complete": true }
Required fields
-
schema: JSON describing fields. Two accepted shapes:- Ordered form (recommended):
{ "name": { "type": "string", "order": 0 }, "price": { "type": "number", "order": 1 } } - Simple form (backend will assign order automatically):
{ "name": "string", "price": "number" }
- Ordered form (recommended):
-
One or both of:
files: uploadsurls: repeated field
Optional fields
ai_extraction:true|false(default: true is fine)text_only_input,text_only_outputchunking_method:chunk_by_document|chunk_by_page|chunk_by_section|chunk_by_keywordsmultiple_extractions:true|false→ return many rows per chunkkeywords: JSON array (forchunk_by_keywords)custom_prompt: extra guidance appended to the default extraction prompt
Curl examples
Multiple PDFs, many rows per chunk
curl -N \
-H "Authorization: Bearer $USER_ID" \
-F 'schema={"item":{"type":"string","order":0},"qty":{"type":"number","order":1}}' \
-F 'multiple_extractions=true' \
-F files=@/a.pdf -F files=@/b.pdf \
-F chunking_method=chunk_by_page \
https://thepipe-api.up.railway.app/extractSample streamed line
{
"result": {
"extraction": [{ "name": "Widget", "price": 9.99 }]
},
"tokens_used": 456,
"chunk": { "path": "a.pdf", "text": "..." }
}The stream ends with:
{ "extraction_complete": true }Errors
Standard HTTP status codes; body contains { "error": "message" }.