Distillation of Transformer Models

Опубликовано: 25 Сентябрь 2024
на канале: Trelis Research

3,741

166

➡️ Get Life-time Access to the Complete Scripts (and future improvements): https://Trelis.com/ADVANCED-fine-tuning
➡️ One-click fine-tuning and LLM templates: https://github.com/TrelisResearch/one...
➡️ Newsletter: https://blog.Trelis.com
➡️ Resources/Support/Discord: https://Trelis.com/About
➡️ Thumbnail made with this tutorial: • Fine Tune Flux Diffusion Models with ...

With credit to Rohan Sharma for work on these scripts on a Trelis Internship: https://trelis.com/internships/. Find Rohan on GitHub: https://github.com/rs545837/

Thanks also to Elie Bakouch of HuggingFace for guidance on using SmolLM corpus: https://huggingface.co/eliebak

VIDEO RESOURCES:
Slides: https://docs.google.com/presentation/...
Minitron Distillation Paper: https://d1qx31qr3h6wln.cloudfront.net...
Distil-Whisper Paper: https://arxiv.org/pdf/2311.00430
SmolLM Corpus: https://huggingface.co/datasets/Huggi...
Trelis SmolLM 2% split: https://huggingface.co/datasets/Treli...
WebInstruct: https://huggingface.co/datasets/TIGER...

TIMESTAMPS:
0:00 AI model distillation (Whisper, Flux, Minitron, gpt-4o-mini?)
0:46 Video Overview - Distillation Tutorial and Code Walk-through
2:00 Distillation Examples (Diffusion - Flux Schnell / Dev, Transcription - Distil-Whisper, LLMs - Nvidia Minitron)
6:51 How distillation works
7:22 Student model initialization
8:36 Layer / depth pruning
11:52 Width pruning
15:25 Pre-training versus distillation
18:40 Cross-entropy loss vs KL-divergence
22:41 Instruction fine-tuning
23:28 Distilling SmolLM 135M to a 99M model
24:43 Code walk-through setup.
26:49 Pruning Notebook
28:56 Layer Pruning
31:41 Width Pruning
35:01 Why pruning works?
36:17 Distillation Script - Multi-GPU Setup
39:36 Distillation Script Walk-through
54:05 Distillation Configuration File Walk-through
56:32 Distillation Startup and Performance Monitoring with tensorboard
1:03:01 Instruction fine-tuning and dataset selection
1:09:02 Instruction FT Startup and Performance Monitoring with tensorboard
1:12:40 Running inference to evaluate distillation performance
1:12:54 Teacher model performance (base SmolLM 135M)
1:13:53 SmolLM Instruct model performance
1:14:15 Raw pruned model performance (layer pruned) 99M
1:14:38 Width + Layer pruning performance (raw) 99M
1:15:18 Distilled model performance (before instruction tuning) 99M
1:15:57 Instruction tuning performance evaluation
1:16:21 SmolLM 135M Instruct performance
1:17:17 Instruction tuned distilled model performance (99M model)
1:18:33 Final Tips (best pruning approach, learning rate, batch size and model size effects)
1:20:21 Video Resources