Source Types
The scraping and extraction functions can handle various source types in different manners:
| Source Type | Input pattern | Multimodal | Notes |
|---|---|---|---|
| ✔️ | Can use ai_extraction for clean markdown, images, and tables | ||
| Word Document | *.docx | ✔️ | Scrapes text, tables, and images |
| PowerPoint | *.pptx | ✔️ | Scrapes text and images from each slide |
| Image | *.jpg, *.jpeg, *.png | ✔️ | Can extract text using OCR if text_only is False |
| Spreadsheet | *.csv, *.xlsx | ❌ | Scrapes each row to a chunk containing JSON (row_index added as key) |
| Jupyter Notebook | *.ipynb | ✔️ | Scrapes markdown, code, outputs, and images |
| Plain Text | *.txt | ❌ | Scrapes text content |
| Video | *.mp4 | ✔️ | Transcribes audio and extracts frames |
| Audio | *.wav, *.mp3 | ✔️ | Transcribes audio content |
| ZIP Archive | *.zip | ✔️ | Extracts contents and scrapes each file |
| Web Page | http://, https:// | ✔️ | Scrapes markdown and images, can use ai_extraction for better results |
| GitHub Repository | https://github.com/ (opens in a new tab) | ✔️ | Clones repo and processes files |
| Tweet | https://twitter.com/ (opens in a new tab), https://x.com/ (opens in a new tab) | ✔️ | Extracts tweet text and images |
| YouTube Video | https://youtube.com/ (opens in a new tab), https://www.youtube.com/ (opens in a new tab) | ✔️ | Downloads video, transcribes audio, and extracts a thumbnail |