It looks like you're new here. If you want to get involved, click one of these buttons!
Subscribe to our Patreon, and get image uploads with no ads on the site!
Base theme by DesignModo & ported to Powered by Vanilla by Chris Ireland, modified by the "theFB" team.
Comments
Comprehensive LLMs are fookin' massive, though. There's a good analysis of GLM-130b here:
https://fullstackdeeplearning.com/blog/posts/running-llm-glm-130b/
...so 320GB VRAM needed if you're running with GPUs for a full model. There's simply no way to do that without seriously hefty iron.
https://www.city.ac.uk/about/schools/law/academic-programmes/llm-master-of-laws
https://www.sussex.ac.uk/study/masters/courses/information-technology-and-intellectual-property-law-llm
Or
https://theamericangenius.com/tech-news/large-language-models/
https://www.london.ac.uk/specialisation-computer-and-communications-law
I then spent the better part of a week trying to get it working, fighting with Intel's awful documentation and buggy code. The end goal was to run Mistral 0.2, because that seems like the absolute best model around at the moment (at least, the best one that can run on consumer hardware). I got it working with the basic worker, but it was ssslllooooww - less than ChatGPT speed. Intel's version of vLLM (the speed-optimised worker) just farted out errors when trying to load Mistral, although it kinda worked with less-capable models.
Next attempt was by buying a pair of Nvidia Tesla P100 16GB cards - older cards, but plenty of RAM and still fast, and everything's built for Nvidia CUDA these days. Unfortunately, what turned up were the 12GB cards, but no matter...I ploughed on, printed a fan shroud for them, and...from there, it was just an exercise in pain. Even with the most optimised worker, it was only slightly faster than ChatGPT, and it kept bombing out when I was using it to convert code of any useful length. So...I've sent those cards back on the grounds that they were supposed to be 16GB cards anyway.
Tried a few different models back on the Arc card, and it really does seem to work but there just aren't any models that compete with Mistral, and I hate the fact that I couldn't get it running with vLLM. The thing that's been burning up my brain and costing me sleep is the fact that Intel's code is just broken. That shouldn't happen, especially when AI is a big part of their strategy going forward.
Now, their AI code is open source. It's all Python, which I don't know, but it's kinda close enough to Ruby that I can understand it to a point. So I poked around their code repository, and noticed that none of the vLLM code has really been updated lately. That seems...odd.
I just spent three hours this evening tracing the errors, experimenting with their code, and figuring out how this AI thing works (after figuring out how this Python thing works).
Long story short, I just fixed Intel's code for them. Needs tidying up, but I'll have a go at that in the morning and chuck them a pull request.
This is how fast a good AI model can run on a cheap £299 GPU that everybody treats like the dumb kid in class:
Yeah, that'll do. Bargain.
https://llm.mlc.ai/ is also good.
if you choose a compiled model (MLC) or quantized model (the others) it isn’t that slow on local hardware without a massive GPU.
Now that I've got the Intel stuff working, though, and optimised it a bit...jeebus, it's fast. I'm running Mistral at well over 75 tokens/s now, and Mixtral 2x7b AWQ at 40 tokens/s. That's approaching RTX 3090 territory, for a fraction of the price.
If anybody else has an Intel Arc card, I'd be happy to share my Docker setup (I might even stick it on GitHub, if I can be bothered). The trick is that it needs the absolute latest Intel drivers (which aren't easily found). Installing those more than quadrupled the performance.
This is actually relevant to the forum - one of the projects I've got in mind is processing the entire history of the classifieds through it, to extract a make, model and price for every item that's been listed. The idea is that eventually I'll be able to implement proper listings and item search.
In fact, the lack of available knowledge for the Intel cards actually kinda made it enjoyable. It's quite nice not following the pack and going Nvidia (to be honest, they'll never have another penny from me).
Really must get round to sending them back...
https://thenewstack.io/how-to-run-a-local-llm-via-localai-an-open-source-project/
I'm currently on Ollama, and it works well...until you shove another GPU into the system. I'm getting 40 tokens/s with Llama 3 q6 using a single GPU, and 8t/s using two GPUs. I don't expect it to be faster, but I also don't expect it to be 80% slower.
Even more annoying is that I'd be perfectly happy running just the one GPU in there, given how good Llama 3 is, and then shove the second A770 into my desktop. However, there's no way to control the fans on Linux, and they're really annoying in their default state. By comparison, my 8GB RX 6600XT is absolutely silent.
Thankfully, Intel are pretty responsive on Github, so hopefully I'll find a resolution to the Ollama problem soon. I have no confidence that they'll sort the desktop fan issue out, mainly because their driver team are much less responsive than the IPEX team.