Fine-tune Multi-modal Video + Text Models

Опубликовано: 17 Май 2024
на канале: Trelis Research

1,979

➡️ Get Life-time Access to the Complete Scripts (and future improvements): https://Trelis.com/ADVANCED-vision/
➡️ One-click fine-tuning and LLM templates: https://github.com/TrelisResearch/one...
➡️ Trelis Livestreams: Thursdays 5 pm Irish time on YouTube and X.
➡️ Newsletter: https://blog.Trelis.com
➡️ Resources/Support/Discord: https://Trelis.com/About

VIDEO RESOURCES:
IDEFICS 2 Chatty Model: https://huggingface.co/Trelis/idefics...
IDEFICS 2 Blog: https://huggingface.co/blog/idefics2

TIMESTAMPS:
0:00 "Video + Text" from "Image + Text" models
0:25 Clipping and Querying Videos with an IDEFICS 2 endpoint
4:57 Fine-tuning video + text models
6:01 Dataset generation for video fine-tuning + pushing to hub
11:42 Clipping and querying videos with image splitting in a Jupyter Notebook
15:54 Side-note - IDEFICS 2 vision to text adapter architecture
18:23 Video clip notebook evaluation - continued
19:40 Loading a video dataset for fine-tuning
20:55 Recap of video + text model fine-tuning