Видео

Просмотров

Комментарии •

@EcoTekTermika 2 месяца назад
This looks quite slow on a 1.5B model. What time did you get for your model?
@rubencontesti221 3 месяца назад
Another great video Mark! Wish you could share some good options for local hosting large models like Llama 3 70B quantized to 4 bits. I'm curious about the cheapest ways to host these models with my own server. Thank you!
@learndatawithmark 3 месяца назад
If it's fully blown models, I think Hugging Face's inference server is best - github.com/huggingface/text-generation-inference
If it's quantized models, llamafile are doing some cool work to make a super fast server - github.com/Mozilla-Ocho/llamafile