Is Arli AI a legit cloud LLM inference service? Any user experience?

hendrik@palaver.p3x.de · 16 hours ago

Kobold.cpp is using llama.cpp under the hoods. It just adds a few extras and a webserver and an user interface. Plus some backwards compatibility for older model file formats, and it’s relatively easy to install. But the project builds upon llama.cpp and uses that same code for inference.

hendrik@palaver.p3x.de · 17 hours ago

Very good idea. I mean there are frameworks for programmers to do exaclty that, like LangChain. But I also end up doing this manually. I use Kobold.cpp and most of the times I just switch it to Story mode and I get one lage notebook / text area. I’ll put in the questions, prompts, special tokens if it’s an instruct-tuned variant and start the bullet point list for it. Or click on generate after I’ve already typed in the chapter names or a table of contents. Or opened the code block with the proper markdown. So pretty much like what you lined out. It’s super useful to guide the LLM into the proper direction. Or steer it back on track with a small edit in its output, and a subsequent call to generate from there.

hendrik@palaver.p3x.de · edit-2 2 days ago

I read about some PCBs that let you repurpose old laptop screens but I’m pretty sure that’s more complicated with phones and tablets. These have tiny and very specialized electronics. Oftentimes not built in a modular way. And people tend to break them, give them away or dispose of them. I don’t see people repurposing these devices.

And it’d be hard to bypass the boot time and Android experience. Sometimes you can flash a custom ROM like LineageOS. Though, that’s still Android. Other operating systems aren’t really a thing within that ecosystem.

I like to flash LineageOS and then use extra phones/tablets as a kitchen radio or TV or as a TAN generator.

You could also install Termux and install Linux software. Like a webserver or something like that.

hendrik@palaver.p3x.de · edit-2 2 days ago

You could install Windows 95 or 98 in a VM. I like libvirt for that. And you can install Windows 95 in a DosBox. It’s pretty straightforward and there are a bunch of tutorials out there how to do it. Other than that I usually stick with Lutris and Bottles. (Edit: Those are Linux tools.)

hendrik@palaver.p3x.de · edit-2 3 days ago

Let’s hope it’ll get us a few more Linux handheld devices and maybe closer to the dream of a decent Linux phone. I bought a Pinephone back then, but that’s pretty limited. And we also need better power management, software that is designed for small touchscreens. And support for the dozens of other diverse components in a phone, touchscreen, camera, gps, all the other chips… Having the SoC supported is only the minimum. Without the other drivers in place it doesn’t automatically provide us with an image on the screen etc. It’d be a big good step into the right direction, though.

hendrik@palaver.p3x.de · edit-2 7 days ago

Read the ActivityPub protocol and a book on webdevelopment… Also have a look at existing projects and their codebase.

hendrik@palaver.p3x.de · edit-2 7 days ago

I hope someone else chimes in and can offer some advice. You could have a look at the ollama log / debug output and see if the <|fim_prefix|>, <|fim_suffix|> and <|fim_middle|> tokens are at the correct spots when fed into the LLM. (as per https://github.com/QwenLM/Qwen2.5-Coder?tab=readme-ov-file#3-file-level-code-completion-fill-in-the-middle ) Other than that, I don’t have a clue. You could also try a different model. But I guess there is something wrong somewhere. I mean coding sometimes is repetetive. But it shouldn’t do it like that.

hendrik@palaver.p3x.de · edit-2 7 days ago

Fair enough. I haven’t had repetition being an issue for quite some time now. Usually that happened when I manually messed with the context, or had the parameters set incorrectly. Are you sure the fill-in-the-middle support is set up correctly and the framework inserts the correct tokens for that? I mean if it does other things properly, but isn’t able to fill in, maybe it’s that.

hendrik@palaver.p3x.de · 7 days ago

Is that model expected to generate better results? At 1.5B parameters, it’s very small and I wouldn’t expect it to be very intelligent or be more than very basic autocomplete… But I don’t have any good experience at coding with smaller models.

hendrik@palaver.p3x.de · 8 days ago

https://sepiasearch.org/search?search=autism&resultType=videos

hendrik@palaver.p3x.de · 9 days ago

What are you doing?

hendrik@palaver.p3x.de · 9 days ago

Mmhm, I’m not sure if I’m entirely on the same page. Admins have complained. Users would like to run their own instances, but they can’t as the media cache is quite demanding and requires a bigger and costly virtual server. And we’re at the brink of DDoSing ourselves with the way ActivityPub syncs (popular) new posts throughout the network. We still have some room to grow, but it’s limited due to the protocol design choices. And it’s chatty as pointed out. Additionally we’ve already had legal concerns, due to media caching…

Up until now everything turned out mostly alright in the end. But I’m not sure if it’s good as is. We could just have been lucky. And we’re forced to implement some minimum standards of handling harassment, online law, copyright and illegal content. Just saying we’re amateurs doesn’t really help. And it shifts burden towards instance admins. Same for protocol inefficiencies.

I agree - however - with the general promise. We’re not a big company. And that’s a good thing. We’re not doing business and not doing economy of scale here. And it’s our garden which we foster and have fun at.

hendrik@palaver.p3x.de · edit-2 10 days ago

You could use some of these image generators that allow you to edit an image with a prompt. ~~For example OmniGen or something like that.~~ Any AI tool that allows inpainting. I suppose there are lots of AI image editors out there… Instruct it to paint in some glasses into a picture of yours. And that’ll be how that AI thinks you should look with glasses.

hendrik@palaver.p3x.de · edit-2 12 days ago

No support for any of the free social media platforms like Lemmy…

hendrik@palaver.p3x.de · edit-2 13 days ago

Correct answer. There is no general purpose AI model that can fit ino 1GB. These small models exist, but they do very specific small tasks. Sentiment analysis, object detection, word embeddings for vector databases…

For coding, answering questions and generating text, you’d need like 6-8GB minimum. For maths way more than that and they’ll still be throwing dice instead of giving correct answers.

hendrik@palaver.p3x.de · edit-2 14 days ago

Yeah, sorry. You need an (android) app that does this, or PieFed…

It’s a great feature, though. I’m not one of the people who get offended/annoyed easily… But I also prefer to consume some content and skip some other things…

hendrik@palaver.p3x.de · edit-2 14 days ago

Use the keyword filter. First of all add the last names of the candidates. And also “voting”, “election”. I’m happy with the amount of posts that remain after that.

hendrik@palaver.p3x.de · edit-2 15 days ago

A 2bit or 3bit quantization is quite some trade-off. At 2bit, it’ll probably be worse then a smaller model with a lesser quantization. At the same effective size.

There is a sweet spot somewhere between 4 to 8 bit(?). And more than 8bit seems to be a waste, it seems indistinguishable from full precision.

General advice seems to be: Take the largest model you can fit at somewhere around 4bit or 5bit.

The official way to compare such things is calculate the perplexity for all of the options and choose the one with the smallest perplexity, that fits.

And by the way: I don’t really use the tiny models like 3B parameters. They write text, but they don’t seem to be able to store a lot of knowledge. And in turn they can’t handle any complex questions and they generally make up a lot of things. I usually use 7B to 14B parameter models. That’s a proper small model. And I stick to 4bit or 5bit quants for llama.cpp

Your graphics card should be able to run a 8B parameter LLM (4-bit quantized) I’d prefer that to a 3B one, it’ll be way more intelligent.

hendrik@palaver.p3x.de · 15 days ago

I think that xkcd is overused. It’s very true, though.

hendrik@palaver.p3x.de · edit-2 15 days ago

Yeah. There are a few issues with ActivityPub. And as far as I know it’s questionable whether it could scale to a scale the large commercial platforms have. I’d be in for a more efficient successor with a few more standardized extras. I guess it’ll be hard to agree on a new protocol. External pressure/influence could help, IMO. Although, at the same time I’m grateful I have this platform.

hendrik@palaver.p3x.de · edit-2 13 days ago

Is Arli AI a legit cloud LLM inference service? Any user experience?

hendrik@palaver.p3x.de · edit-2 3 months ago

How to make the Threadiverse a nice place and effectively make it grow

hendrik

Is Arli AI a legit cloud LLM inference service? Any user experience?

Is Arli AI a legit cloud LLM inference service? Any user experience?

How to make the Threadiverse a nice place and effectively make it grow

How to make the Threadiverse a nice place and effectively make it grow