Another great video Mark! Wish you could share some good options for local hosting large models like Llama 3 70B quantized to 4 bits. I'm curious about the cheapest ways to host these models with my own server. Thank you!
If it's fully blown models, I think Hugging Face's inference server is best - github.com/huggingface/text-generation-inference If it's quantized models, llamafile are doing some cool work to make a super fast server - github.com/Mozilla-Ocho/llamafile
This looks quite slow on a 1.5B model. What time did you get for your model?
Another great video Mark! Wish you could share some good options for local hosting large models like Llama 3 70B quantized to 4 bits. I'm curious about the cheapest ways to host these models with my own server. Thank you!
If it's fully blown models, I think Hugging Face's inference server is best - github.com/huggingface/text-generation-inference
If it's quantized models, llamafile are doing some cool work to make a super fast server - github.com/Mozilla-Ocho/llamafile