Documentation
Platform Quickstart

Platform Quickstart

The thepi.pe platform provides a user-friendly interface for scraping and extracting data from various sources. This guide will walk you through the main features of the platform.

AI Model

All operations on the platform use Google Gemini 2.5 Flash Lite, a state-of-the-art vision-language model optimized for:

  • Fast processing of documents and images
  • High accuracy on complex layouts
  • Support for multiple languages
  • Understanding of charts, tables, and diagrams
  • Excellent performance on structured data extraction

This model is automatically selected for all operations to ensure consistent, high-quality results across all your extractions.

Scraping

Scraping Interface

The scraping interface allows you to extract data from websites, PDFs, and other sources.

  1. Upload files or enter URLs in the designated area.
  2. Choose your scraping options:
    • Input text only: Ignore images and non-text content
    • Output text only: Return only text content in the response
    • AI Extraction: Use AI to analyze layout and extract structured content
  3. Select a chunking method:
    • By Document: Keep entire documents together
    • By Page: Split content page by page
    • By Section: Split by headers and sections
    • By Keywords: Split when specified keywords are found
    • Semantic: AI-powered semantic chunking
  4. Click "Scrape" to start the process.

The scraped data will appear in the table on the right. You can view the full API response by clicking "View API Response" or download all chunks as a ZIP file.

Structured Extraction

Extraction Interface

The extraction interface helps you extract structured data from your scraped content using Google Gemini 2.5 Flash Lite.

  1. Upload files or enter URLs as in the scraping interface.
  2. Define your schema:
    • Add fields and specify their types (string, int, float, bool, date)
    • Use recent schemas from previous extractions
    • Fields are processed in the order you define them
  3. Configure advanced options:
    • Choose a chunking method (each chunk will be processed by the AI model)
    • Enable/disable Text Only and AI Extraction
    • Add custom prompts for specific extraction instructions
    • Configure keyword-based chunking if needed
  4. Click "Extract" to start the process.

Advanced Options

The extracted data will appear in the table on the right. You can download the results as a CSV file.

CSV

Job History

Job History

The job history section in the dashboard shows your recent scraping and extraction jobs. For each job, you can see:

  • Endpoint used (scrape or extract)
  • Source (file or URL)
  • Date and time
  • Tokens used
  • Status code
  • Any errors encountered

Supported File Types

The platform supports a wide variety of file types:

  • Documents: PDF, Word (.docx, .doc), PowerPoint (.pptx, .ppt)
  • Images: JPG, PNG, GIF, and other common formats
  • Spreadsheets: Excel (.xlsx, .xls), CSV
  • Media: Videos (MP4, MOV, AVI), Audio (MP3, WAV)
  • Web: Any public webpage or URL
  • Code: Jupyter notebooks (.ipynb), text files
  • Archives: ZIP files containing supported formats

Chunking Strategies

Choose the appropriate chunking method based on your use case:

  • By Document: Best for small documents or when you want to keep content together
  • By Page: Ideal for PDFs and paginated documents
  • By Section: Good for documents with clear section headers
  • By Keywords: Useful when you want to split on specific terms
  • Semantic: AI-powered chunking that groups related content together

For questions or support:

Email: emmett@thepi.pe