Over the past few months, I have built a distillation toolkit that supports cross-tokenizer distillation (e.g., distilling from LLaMA to Qwen vocab, or others). This approach has worked well on reasoning datasets like AIME, and we’ve validated on models like Phi and Qwen.
We’ve also integrated Modal for quick deployment (with $30/month credits to try it out).
Would love any feedback!
vikramxD2 days ago
Cool , are you accepting contributions for adding new models
shikharM072 days ago
this is kinda interesting but I'm curious what is the smallest model size that I can distill without compromising the accuracy?
agokrani2 days ago
We can distill 14B model to 4B model with performance improvements on AIME24 and GSM8K. We will share our results with a detailed blog post later.
vijit-singh2 days ago
this is very cool. will try it out.