Source Types
The scraping and extraction functions can handle various source types in different manners:
Source Type | Input pattern | Multimodal | Notes |
---|---|---|---|
✔️ | Can use ai_extraction for clean markdown, images, and tables | ||
Word Document | *.docx | ✔️ | Scrapes text, tables, and images |
PowerPoint | *.pptx | ✔️ | Scrapes text and images from each slide |
Image | *.jpg, *.jpeg, *.png | ✔️ | Can extract text using OCR if text_only is False |
Spreadsheet | *.csv, *.xlsx | ❌ | Scrapes each row to a chunk containing JSON (row_index added as key) |
Jupyter Notebook | *.ipynb | ✔️ | Scrapes markdown, code, outputs, and images |
Plain Text | *.txt | ❌ | Scrapes text content |
Video | *.mp4 | ✔️ | Transcribes audio and extracts frames |
Audio | *.wav, *.mp3 | ✔️ | Transcribes audio content |
ZIP Archive | *.zip | ✔️ | Extracts contents and scrapes each file |
Web Page | http://, https:// | ✔️ | Scrapes markdown and images, can use ai_extraction for better results |
GitHub Repository | https://github.com/ (opens in a new tab) | ✔️ | Clones repo and processes files |
Tweet | https://twitter.com/ (opens in a new tab), https://x.com/ (opens in a new tab) | ✔️ | Extracts tweet text and images |
YouTube Video | https://youtube.com/ (opens in a new tab), https://www.youtube.com/ (opens in a new tab) | ✔️ | Downloads video, transcribes audio, and extracts a thumbnail |