API Quickstart

Useful Endpoints

POST /scrape → stream chunks from files/URLs
POST /extract → stream structured extraction; ends with { "extraction_complete": true }
GET /get_tokens_available → remaining tokens

Auth header

Every request uses an Authorization header:

Authorization: Bearer <thepipe-api-key>

Scrape API

Stream chunks (text + optional images) from files and/or URLs.

Endpoint: POST /scrape Content-Type: multipart/form-data Auth: Authorization: Bearer <token> Response: streaming NDJSON (one JSON object per line)

Form fields

files: one or more uploaded files
urls: one or more URL strings (can repeat the key)
text_only_input: true|false (default: false)
text_only_output: true|false (default: false)
ai_extraction: true|false (default: false) → lets the scraper use the LLM to enrich extraction when available
chunking_method: one of chunk_by_document | chunk_by_page | chunk_by_section | chunk_by_keywords
keywords: JSON array of strings (only used with chunk_by_keywords)

Curl examples

Single PDF file

curl -N \
  -H "Authorization: Bearer $USER_ID" \
  -F "files=@/path/to/file.pdf" \
  -F "text_only_output=true" \
  -F "chunking_method=chunk_by_page" \
  https://thepipe-api.up.railway.app/scrape

Sample streamed line

{
  "result": {
    "chunk_index": 0,
    "source": "file.pdf",
    "content": [
      { "type": "text", "text": "..." },
      { "type": "image_url", "image_url": { "url": "http://.../images/1.png" } }
    ]
  },
  "tokens_used": 123
}

Consume as an event stream (each line is a complete JSON object).

Token usage is tracked and deducted automatically per chunk.

Extract API

Structured data extraction against a schema from files and/or URLs.

Endpoint: POST /extract Content-Type: multipart/form-data Auth: Authorization: Bearer <token> Response: streaming NDJSON (one JSON object per line) ending with { "extraction_complete": true }

Required fields

schema: JSON describing fields. Two accepted shapes:

Ordered form (recommended):

{
  "name": { "type": "string", "order": 0 },
  "price": { "type": "number", "order": 1 }
}

Simple form (backend will assign order automatically):
```
{ "name": "string", "price": "number" }
```

One or both of:
- files: uploads
- urls: repeated field

Optional fields

ai_extraction: true|false (default: true is fine)
text_only_input, text_only_output
chunking_method: chunk_by_document | chunk_by_page | chunk_by_section | chunk_by_keywords
multiple_extractions: true|false → return many rows per chunk
keywords: JSON array (for chunk_by_keywords)
custom_prompt: extra guidance appended to the default extraction prompt

Curl examples

Multiple PDFs, many rows per chunk

curl -N \
  -H "Authorization: Bearer $USER_ID" \
  -F 'schema={"item":{"type":"string","order":0},"qty":{"type":"number","order":1}}' \
  -F 'multiple_extractions=true' \
  -F files=@/a.pdf -F files=@/b.pdf \
  -F chunking_method=chunk_by_page \
  https://thepipe-api.up.railway.app/extract

Sample streamed line

{
  "result": {
    "extraction": [{ "name": "Widget", "price": 9.99 }]
  },
  "tokens_used": 456,
  "chunk": { "path": "a.pdf", "text": "..." }
}

The stream ends with:

{ "extraction_complete": true }

Errors

Standard HTTP status codes; body contains { "error": "message" }.

Python Quickstart Data Sources