Full Fine tuning with Fewer GPUs - Galore, Optimizer Tricks, Adafactor

Опубликовано: 17 Сентябрь 2024
на канале: Trelis Research

1,970

105

➡️ Get Life-time Access to the Complete Scripts (and future improvements): https://Trelis.com/ADVANCED-fine-tuning
➡️ One-click fine-tuning and LLM templates: https://github.com/TrelisResearch/one...
➡️ Newsletter: https://blog.Trelis.com
➡️ Resources/Support/Discord: https://Trelis.com/About

VIDEO RESOURCES:
Slides: https://docs.google.com/presentation/...
Original GaLore repo: https://github.com/jiaweizzhao/GaLore
GaLore with Subspace Descent: https://github.com/kyleliang919/Onlin...

TIMESTAMPS:
0:00 LLM Full fine-tuning with lower VRAM
0:37 Video Overview
4:02 Understanding Optimisers
6:17 Stochastic Gradient Descent (SGD)
7:53 AdamW Optimizer and VRAM requirements
9:31 AdamW 8-bit optimizer
11:03 Adafactor optimiser and memory requirements
14:28 GaLore - reducing gradient and optimizer VRAM
19:10 LoRA versus GaLoRe
19:49 Better and Faster GaLoRe via Subspace Descent
22:59 Layerwise gradient updates
26:17 Training Scripts
27:10 How gradient checkpointing works to reduce memory
40:30 AdamW Performance
41:14 AdamW 8bit Performance
42:45 Adafactor with manual learning rate and schedule
44:10 Adafactor with default/auto learning rate
45:47 Galore AdamW
50:22 Galore AdamW with Subspace descent
52:25 Using AdamW8bit and Adafactor with GaLoRe
53:14 Notebook demo of layerwise gradient updates
55:28 Running with LoRa
58:36 Inferencing and Pushing Models to Hub
1:00:00 Single GPU Recommendations
1:25:00 Multi-GPU Recommendations
1:03:25 Resources