@SGforce

SGforce@lemmy.ca · 15 days ago

Oh, that part is. But the splitting tech is built into llama.cpp

SGforce@lemmy.ca · 15 days ago

With modern methods sometimes running a larger model split between GPU/CPU can be fast enough. Here’s an example https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl

SGforce@lemmy.ca · 15 days ago

fp8 would probably be fine, though the method used to make the quant would greatly influence that.

I don’t know exactly how Ollama works but a more ideal model I would think would be one of these quants

https://huggingface.co/bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF

A GGUF model would also allow some overflow into system ram if ollama has that capability like some other inference backends.

SGforce@lemmy.ca · 15 days ago

The technology for quantisation has improved a lot this past year making very small quants viable for some uses. I think the general consensus is that an 8bit quant will be nearly identical to a full model. Though a 6bit quant can feel so close that you may not even notice any loss of quality.

Going smaller than that is where the real trade off occurs. 2-3 bit quants of much larger models can absolutely surprise you, though they will probably be inconsistent.

So it comes down to the task you’re trying to accomplish. If it’s programming related, 6bit and up for consistency with whatever the largest coding model you can fit. If it’s creative writing or something a much lower quant with a larger model is the way to go in my opinion.

SGforce@lemmy.ca · 2 months ago

I just bought a 4a. I don’t want a 6-6.5" phone, damnit.

SGforce@lemmy.ca · 3 months ago

No You don’t get to touch those

SGforce@lemmy.ca · edit-2 1 year ago

Set up a security camera? Though they usually get in by eggs having been attached to corals/plants. Have you put any new plants in in the past few months?

SGforce@lemmy.ca · 1 year ago

Muscle relaxers?

SGforce@lemmy.ca · 1 year ago

Could you name the unnecessary regulations specifically?