Advice - Getting started with LLMs

its_me_xiphos@beehaw.org · 6 months ago

Advice - Getting started with LLMs

BaroqueInMind@lemmy.one · 6 months ago

OLlama is so fucking slow. Even with a 16-core overclocked Intel on 64Gb RAM with an Nvidia 3080 10Gb VRAM, using a 22B parameter model, the token generation for a simple haiku takes 20 minutes.

xcjs@programming.dev · 6 months ago

Ok, so using my “older” 2070 Super, I was able to get a response from a 70B parameter model in 9-12 minutes. (Llama 3 in this case.)

I’m fairly certain that you’re using your CPU or having another issue. Would you like to try and debug your configuration together?

BaroqueInMind@lemmy.one · 6 months ago

I think I fucked up my docker setup and will wipe and start over.

xcjs@programming.dev · 6 months ago

Good luck! I’m definitely willing to spend a few minutes offering advice/double checking some configuration settings if things go awry again. Let me know how things go. :-)

BaroqueInMind@lemmy.one · 6 months ago

My setup is Win 11 Pro ➡️ WSL2 / Debian ➡️ Docker Desktop (for windows)

Should I install the nvidia drivers within Debian even though the host OS already has drivers?

xcjs@programming.dev · 6 months ago

I think there was a special process to get Nvidia working in WSL. Let me check… (I’m running natively on Linux, so my experience doing it with WSL is limited.)

https://docs.nvidia.com/cuda/wsl-user-guide/index.html - I’m sure you’ve followed this already, but according to this. It looks like you don’t want to install the Nvidia drivers, and only want to install the cuda-toolkit metapackage. I’d follow the instructions from that link closely.

You may also run into performance issues within WSL due to the virtual machine overhead.

BaroqueInMind@lemmy.one · 6 months ago

I did indeed follow that guide already, thank you for the respect; I am an idiot and installed both the nvidia WSL driver on top of the host OS driver _as well as the Cuda driver. So I’ll try again with only that guide and see what breaks.

Zworf@beehaw.org · edit-2 6 months ago

Hmmm weird. I have a 4090 / Ryzen 5800X3D and 64GB and it runs really well. Admittedly it’s the 8B model because the intermediate sizes aren’t out yet and 70B simply won’t fly on a single GPU.

But it really screams. Much faster than I can read. PS: Ollama is just llama.cpp under the hood.

Edit: Ah, wait, I know what’s going wrong here. The 22B parameter model is probably too big for your VRAM. Then it gets extremely slow yes.

BaroqueInMind@lemmy.one · 6 months ago

What is the appropriate size for 10Gb VRAM?

Zworf@beehaw.org · 6 months ago

It depends on your prompt/context size too. The more you have the more memory you need. Try to check the memory usage of your GPU with GPU-Z with different models and scenarios.

xcjs@programming.dev · 6 months ago

No offense intended, but are you sure it’s using your GPU? Twenty minutes is about how long my CPU-locked instance takes to run some 70B parameter models.

On my RTX 3060, I generally get responses in seconds.

kiku123@feddit.de · 6 months ago

I agree. My 3070 runs the 8B Llama3 model in about 250ms, especially for short responses.