Converting Safetensors to GGUF (for use with Llama.cpp)

Cognibuild AI - GET GOING FAST

Просмотров 2,2 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 21 окт 2024

Комментарии • 24

@zacharycangemi9525 21 день назад
Hey - before I ask a question, I found your videos on youtube last week - outstanding content - thank you so much, massive shout out. You helped me get some local AI up and running for work and just got major shoutouts at work - thank you so much!
How do we convert a already quantized version of a llama 8B to a .gguf file? I keep getting a tensor issue!
@cognibuild 11 дней назад
i found this on github which might help. You'll need to install and use it in Linux or WSL:
github.com/kevkid/gguf_gui
Just be certain to pip install: pip install llama-cpp-python
because its not added in the installation.
Let me know if it works for you!
@gostjoke Месяц назад ⁺¹
thanks bro you are my god
@cognibuild Месяц назад ⁺¹
@@gostjoke I'm not a god. I'm just a dude 😎
@gostjoke Месяц назад
@@cognibuild Actually I want to ask a question. After my safetensor became gguf, I tried some question, but the model answer seems worse a lot than when it was in safetensor. do you know why?
@cognibuild Месяц назад ⁺¹
@@gostjoke a lot of times it has to do with the parameters. Check the chatml mode.
Also try using KoboldCPP (kcpp) for your gguf files to see if they work
@gostjoke Месяц назад
@@cognibuild got it, thanks
@robert_nissan Месяц назад ⁺¹
exelente bro
@xspydazx 4 месяца назад ⁺¹
Yes this is normal stuff .. but you may not realize that you can open gguf with the transformers library !! ..
Hence you can use save to pretrained to unquantize the gguf file back to safe tensors !!
@cognibuild 4 месяца назад
How would you unquantize something? The numbers are lost
@xspydazx 4 месяца назад
@@cognibuild they are not my friend : i thought that but the numbers are not lost the model is just in perminant 4-bit mode:
so :
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
print('Extract and Convert to FP16...')
model.to(torch.float16)
model
in this way transformers loads the model as normal(in 4 bit pretrained) then you can save like normal :
i was searching for this in the begining but when i could not find it i gave up:
but i fed my model all of huggingface docs:
so when i was talking about gguf it told me that it was possible so i found it in the docs::
@xspydazx 4 месяца назад
really i could not beleive it so i tried it !!! techniically gguf is just another form of zip!!
but for tensors: it converts the model into a llama clone but it remains the mistral inside : technically its only a wrapper for a q4 etc .... yes the tensor sizes are changed but the calculation to compress is the same to decompress ... so it can unzip again ?
when you compress the model ie 7b it turns into 3.5b ? so did it shrink .... but the unsloth uses 4-bit models ? so we use quantized loras? ....
so there should be no problem once the model is loaded !
@cognibuild 4 месяца назад ⁺¹
Right transfer it back to the format makes more sense. Because if you cut off decimals those decimals are gone. Which is why you're saying it stayed in the quantized size but is now able to run by safe tensors. Cool man!
@xspydazx 4 месяца назад
@@cognibuild i actually discovered it today bro! so i thought i would share...
as gguf locks it for transport ... so you can unlock it ... but as you say i think there will be some loss on Q4 and the harsh qunatizes. i always train in 4 bit to make sure when i quantize the model after its basically the same as it was in training:
but : if i was to use it for transporting , i would probably do a Q8 or even fp16 gguf...
just to make sure .... (this is something quite hidden as you know it can be done but not the syntax) ... as you choose the folder location or repo location you also need to specify the filename .... (wow) ... or you can even just specify the full path of the filename with the kwag handle... (wow)....
(but its still better to run them with llama_cpp for it speed ( on laptop or pc transformers runs a bit slow) but llamaCpp runs fast .. so on laptop if you have to use weight use pipelines as its also much faster for some reason !
(today i actually conquered the stable audio (local)) ... with minor adjustments to thier code to recompile it for local use and not repo (quite easy in the end).. now i can do the sound generation ... .=im still using blip1 for image captioning etc ...(to learn the craft) ... for me i have been concentration on getting media IN first all outputs lead to text but now sound also (speech and noises)... (really enjoyable stuff bro... perhaps you should do a few tutorials...)....
@heliotek1212 11 дней назад
What if I don't have llama cpp and I wanna run my model in jan?
@cognibuild 11 дней назад
in jan?
@cikokid Месяц назад ⁺¹
what is your pc system bro share it
@cognibuild Месяц назад
@@cikokid asus pro art x670e motherboard, and ryzen 9 7950x, 128 ddr, Nvidia 4090
@alwekalanet885 Месяц назад
I always wonder why the hell peopel do coding tutorial in a video?
@cognibuild Месяц назад
@@alwekalanet885 I didn't know... ask everyone else who appreciates it

Следующие

Автовоспроизведение