![]() ![]() ![]() That location of the pin headers is archaic and annoying and just needs to die. And then, even when you do, you have the problem of the pin headers on the bottom of the motherboard interfere with the last card. The Supermicro H12SSL-i is about the only one that gets 4 x16 slots double-slot spaced appropriately and in a way that you could run 4 blower or WC'd cards and not overlap something else. But in the ML world it would be appreciated because it would be possible to build quad card systems without watercooling cards or using A5000/A6000 or other dual slot, expensive datacenter cards.Īnd then, even for dual-slot cards like the A5000/A6000 etc again there are very few motherboards that you can get the x16 slots spaced appropriately. Obviously that would result in a non-standard length motherboard (much taller). There may be a few that allow you to fit 2x cards, but none that would support 3x I don't think, and definitely none that do 4x. 3090s and 4090s are all triple slot cards, but often motherboard are putting x16 slots spaced 2-apart, with x8 or less slots in between. there are essentially zero motherboards that space out x16 pci-e slots so that you can appropriately use more than 2 triple-slot GPUs. If you're at a motherboard manufacturer I have some definite points for you to hear. Llama.cpp is bottlenecked by bandwidth, and ~64GB is perfect. But I had a dream about an affordable llama.cpp host: The cheapest Sapphire Rapids HBM SKUs, with no DIMM slots, on a tiny, dirt cheap motherboard you can pack into a rack like sardines. On the server side, everyone is running Nvidia boxes, I guess. IGPs with access to lots of memory capacity + bandwidth will be very desirable, and I hear Intel/AMD are cooking up quad channel IGPs. Vulkan is a popular target runtimes seem to be heading for. TBH the biggest frustration is that OEMs like y'all can't double up VRAM like you could in the old days, or sell platforms with a beefy IGP, and that quad channel+ CPUs are too expensive. Inference for me is bottlenecked by my GPU and CPU RAM bandwidth. Sometimes users will split models across two 24GB CUDA GPUs. People stuff the biggest models that will fit into the collective RAM + VRAM pool, up to ~48GB for llama 70B. The most popular performant desktop llm runtimes are pure GPU (exLLAMA) or GPU + CPU (llama.cpp). ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |