Has anyone tried running an LLM locally?

What's Hot
Looks like most of the models need chonging GPUs but does anyone know if there are ones small enough to mess about with locally. @digitalscream maybe?
ဈǝᴉʇsɐoʇǝsǝǝɥɔဪቌ
0reaction image LOL 0reaction image Wow! 0reaction image Wisdom

Comments

  • digitalscreamdigitalscream Frets: 26608
    Haven't really tried, to be honest. I was considering it - there are USB machine-learning accelerators you can get which are designed to handle the matrix manipulation for use with SBCs like the Pi...maybe look at those?

    Comprehensive LLMs are fookin' massive, though. There's a good analysis of GLM-130b here:

    https://fullstackdeeplearning.com/blog/posts/running-llm-glm-130b/

    ...so 320GB VRAM needed if you're running with GPUs for a full model. There's simply no way to do that without seriously hefty iron.
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    To be fair, I got the context from the "GPUs" part. And the fact that nobody in their right mind would tag me in a question about law qualifications :D
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • PolarityManPolarityMan Frets: 7288
    I've found some reference to people replacing the datatype used to store the weights to shrink the models a bit but it still looks like you need a lot of vram. 
    ဈǝᴉʇsɐoʇǝsǝǝɥɔဪቌ
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    I've found some reference to people replacing the datatype used to store the weights to shrink the models a bit but it still looks like you need a lot of vram. 
    To be fair, the article does say that the VRAM estimate is for two bytes per weight - even if you cut that in half, you're still looking at a shitload. The only other way to do it is compression, but then you're looking at streaming and decompressing small chunks on the fly, which will hurt performance and you still won't get it down to a single GPU. Maybe with three or four GPU accelerators with 24GB+ you could make that work, but still...that's a budget way beyond the reach of a home lab.
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    edited April 6
    @PolarityManJust bumping this one. I've spent the last couple of weeks farting around with local AI models. I initially bought an Intel Arc A770 16GB GPU, thinking it'd be the cheapest way in; 16GB for £299 (at Currys), seems like a bargain.

    I then spent the better part of a week trying to get it working, fighting with Intel's awful documentation and buggy code. The end goal was to run Mistral 0.2, because that seems like the absolute best model around at the moment (at least, the best one that can run on consumer hardware). I got it working with the basic worker, but it was ssslllooooww - less than ChatGPT speed. Intel's version of vLLM (the speed-optimised worker) just farted out errors when trying to load Mistral, although it kinda worked with less-capable models.

    Next attempt was by buying a pair of Nvidia Tesla P100 16GB cards - older cards, but plenty of RAM and still fast, and everything's built for Nvidia CUDA these days. Unfortunately, what turned up were the 12GB cards, but no matter...I ploughed on, printed a fan shroud for them, and...from there, it was just an exercise in pain. Even with the most optimised worker, it was only slightly faster than ChatGPT, and it kept bombing out when I was using it to convert code of any useful length. So...I've sent those cards back on the grounds that they were supposed to be 16GB cards anyway.

    Tried a few different models back on the Arc card, and it really does seem to work but there just aren't any models that compete with Mistral, and I hate the fact that I couldn't get it running with vLLM. The thing that's been burning up my brain and costing me sleep is the fact that Intel's code is just broken. That shouldn't happen, especially when AI is a big part of their strategy going forward.

    Now, their AI code is open source. It's all Python, which I don't know, but it's kinda close enough to Ruby that I can understand it to a point. So I poked around their code repository, and noticed that none of the vLLM code has really been updated lately. That seems...odd.

    I just spent three hours this evening tracing the errors, experimenting with their code, and figuring out how this AI thing works (after figuring out how this Python thing works).

    Long story short, I just fixed Intel's code for them. Needs tidying up, but I'll have a go at that in the morning and chuck them a pull request.

    This is how fast a good AI model can run on a cheap £299 GPU that everybody treats like the dumb kid in class:



    Yeah, that'll do. Bargain.
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • Matt_McGMatt_McG Frets: 323
    You can use Ollama or Llama.cpp. Which makes running models locally quite easy. MLC 
    https://llm.mlc.ai/ is also good.

    if you choose a compiled model (MLC) or quantized model (the others) it isn’t that slow on local hardware without a massive GPU.
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    edited April 7
    Matt_McG said:
    You can use Ollama or Llama.cpp. Which makes running models locally quite easy. MLC 
    https://llm.mlc.ai/ is also good.

    if you choose a compiled model (MLC) or quantized model (the others) it isn’t that slow on local hardware without a massive GPU.
    Maybe...I have to say, I was going to go for MLC, but the limited model options were a bit of a sticking point, and their documentation is just as obtuse as Intel's.

    Now that I've got the Intel stuff working, though, and optimised it a bit...jeebus, it's fast. I'm running Mistral at well over 75 tokens/s now, and Mixtral 2x7b AWQ at 40 tokens/s. That's approaching RTX 3090 territory, for a fraction of the price.

    If anybody else has an Intel Arc card, I'd be happy to share my Docker setup (I might even stick it on GitHub, if I can be bothered). The trick is that it needs the absolute latest Intel drivers (which aren't easily found). Installing those more than quadrupled the performance.

    This is actually relevant to the forum - one of the projects I've got in mind is processing the entire history of the classifieds through it, to extract a make, model and price for every item that's been listed. The idea is that eventually I'll be able to implement proper listings and item search.
    <space for hire>
    0reaction image LOL 0reaction image Wow! 1reaction image Wisdom
  • PolarityManPolarityMan Frets: 7288
    I kinda paused this for a  while as now have access to LLMs via work running  on a big  server farm which is nice. May well come back to it at some point though. 
    ဈǝᴉʇsɐoʇǝsǝǝɥɔဪቌ
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • guitartangoguitartango Frets: 1020
    I kinda paused this for a  while as now have access to LLMs via work running  on a big  server farm which is nice. May well come back to it at some point though. 
    We run this sort of data at work, currently we can run around 10 GPU's per server, still takes time to process data. I have seen some take over 3WKS to process. Can't see many people using Windows for this :)
    “Ken sent me.”
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    I kinda paused this for a  while as now have access to LLMs via work running  on a big  server farm which is nice. May well come back to it at some point though. 
    Yeah...nice as that'd be, I must admit that there's an element of fun involved in getting it working on your own with bugger-all budget to work with.

    In fact, the lack of available knowledge for the Intel cards actually kinda made it enjoyable. It's quite nice not following the pack and going Nvidia (to be honest, they'll never have another penny from me).
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • PolarityManPolarityMan Frets: 7288
    I'm def still interested but not really in a position to spend on anything right now even on the cheap :( 
    ဈǝᴉʇsɐoʇǝsǝǝɥɔဪቌ
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    I kinda paused this for a  while as now have access to LLMs via work running  on a big  server farm which is nice. May well come back to it at some point though. 
    We run this sort of data at work, currently we can run around 10 GPU's per server, still takes time to process data. I have seen some take over 3WKS to process. Can't see many people using Windows for this :)
    You kind of have to assume that machine learning is near the top of the list of reasons for Microsoft directly supporting Linux in Windows.
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    I'm def still interested but not really in a position to spend on anything right now even on the cheap :( 
    Well, if you change your mind in the next 18 hours or so, I've got a pair of Tesla P100 12GB cards I can let go :D

    Really must get round to sending them back...
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • PolarityManPolarityMan Frets: 7288
    ဈǝᴉʇsɐoʇǝsǝǝɥɔဪቌ
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
  • digitalscreamdigitalscream Frets: 26608
    Interesting...at the moment, I'm kinda stuck to the runners supported by Intel, given that I'm running their A770 GPUs.

    I'm currently on Ollama, and it works well...until you shove another GPU into the system. I'm getting 40 tokens/s with Llama 3 q6 using a single GPU, and 8t/s using two GPUs. I don't expect it to be faster, but I also don't expect it to be 80% slower.

    Even more annoying is that I'd be perfectly happy running just the one GPU in there, given how good Llama 3 is, and then shove the second A770 into my desktop. However, there's no way to control the fans on Linux, and they're really annoying in their default state. By comparison, my 8GB RX 6600XT is absolutely silent.

    Thankfully, Intel are pretty responsive on Github, so hopefully I'll find a resolution to the Ollama problem soon. I have no confidence that they'll sort the desktop fan issue out, mainly because their driver team are much less responsive than the IPEX team.
    <space for hire>
    0reaction image LOL 0reaction image Wow! 0reaction image Wisdom
Sign In or Register to comment.