Source Types

The scraping and extraction functions can handle various source types in different manners:

Source Type	Input pattern	Multimodal	Notes
PDF	*.pdf	✔️	Can use `ai_extraction` for clean markdown, images, and tables
Word Document	*.docx	✔️	Scrapes text, tables, and images
PowerPoint	*.pptx	✔️	Scrapes text and images from each slide
Image	.jpg, .jpeg, *.png	✔️	Can extract text using OCR if `text_only` is False
Spreadsheet	.csv, .xlsx	❌	Scrapes each row to a chunk containing JSON (row_index added as key)
Jupyter Notebook	*.ipynb	✔️	Scrapes markdown, code, outputs, and images
Plain Text	*.txt	❌	Scrapes text content
Video	*.mp4	✔️	Transcribes audio and extracts frames
Audio	.wav, .mp3	✔️	Transcribes audio content
ZIP Archive	*.zip	✔️	Extracts contents and scrapes each file
Web Page	http://, https://	✔️	Scrapes markdown and images, can use `ai_extraction` for better results
GitHub Repository	https://github.com/ (opens in a new tab)	✔️	Clones repo and processes files
Tweet	https://twitter.com/ (opens in a new tab), https://x.com/ (opens in a new tab)	✔️	Extracts tweet text and images
YouTube Video	https://youtube.com/ (opens in a new tab), https://www.youtube.com/ (opens in a new tab)	✔️	Downloads video, transcribes audio, and extracts a thumbnail