Data Sources

Source Types

The scraping and extraction functions can handle various source types in different manners:

Source TypeInput patternMultimodalNotes
PDF*.pdf✔️Can use ai_extraction for clean markdown, images, and tables
Word Document*.docx✔️Scrapes text, tables, and images
PowerPoint*.pptx✔️Scrapes text and images from each slide
Image*.jpg, *.jpeg, *.png✔️Can extract text using OCR if text_only is False
Spreadsheet*.csv, *.xlsxScrapes each row to a chunk containing JSON (row_index added as key)
Jupyter Notebook*.ipynb✔️Scrapes markdown, code, outputs, and images
Plain Text*.txtScrapes text content
Video*.mp4✔️Transcribes audio and extracts frames
Audio*.wav, *.mp3✔️Transcribes audio content
ZIP Archive*.zip✔️Extracts contents and scrapes each file
Web Pagehttp://, https://✔️Scrapes markdown and images, can use ai_extraction for better results
GitHub Repositoryhttps://github.com/ (opens in a new tab)✔️Clones repo and processes files
Tweethttps://twitter.com/ (opens in a new tab), https://x.com/ (opens in a new tab)✔️Extracts tweet text and images
YouTube Videohttps://youtube.com/ (opens in a new tab), https://www.youtube.com/ (opens in a new tab)✔️Downloads video, transcribes audio, and extracts a thumbnail