Hi guys, I really want to run this model on Q6_K_XL (194 GB) by Unsloth or perhaps one of the AWQ \ FP8 Quants.
My setup is complex though, I have two servers:
Server A - 4 x RTX 3090 1900x ThreadRipper 64GB of DDR4 RAM. ( 2133 MT/s ) - Quad Channel
Server B - 2 x RTX 3090 2 x CPUs, each Xeon E5-2695-v4 512GB of DDR4 ECC RAM ( 2133 MT/s ) - Quad Channel per CPU *( total 8 channels if using both Numa nodes or 4 Channels if using 1 )
I have another, 7th 3090 on my main work PC, I could throw it in somewhere if it made a difference, but prefer to get it done with 6.
I can’t place all 6 GPUs on Server B, as it is not supporting MoBo PCIe bifurcation, and does not have enough PCIe Lanes for all 6 GPUs alongside the other PCIe cards ( NVMe storage over PCIe and NIC ).
I CA…
Hi guys, I really want to run this model on Q6_K_XL (194 GB) by Unsloth or perhaps one of the AWQ \ FP8 Quants.
My setup is complex though, I have two servers:
Server A - 4 x RTX 3090 1900x ThreadRipper 64GB of DDR4 RAM. ( 2133 MT/s ) - Quad Channel
Server B - 2 x RTX 3090 2 x CPUs, each Xeon E5-2695-v4 512GB of DDR4 ECC RAM ( 2133 MT/s ) - Quad Channel per CPU *( total 8 channels if using both Numa nodes or 4 Channels if using 1 )
I have another, 7th 3090 on my main work PC, I could throw it in somewhere if it made a difference, but prefer to get it done with 6.
I can’t place all 6 GPUs on Server B, as it is not supporting MoBo PCIe bifurcation, and does not have enough PCIe Lanes for all 6 GPUs alongside the other PCIe cards ( NVMe storage over PCIe and NIC ).
I CAN place all 6 GPUs on Server A but the most RAM that can be placed on this server is 128GB, MoBo limitation.
I know there are technologies out there such as RAY that would allow me to POOL both Servers GPUs together via network ( I have 40Gbps Network so plenty fast for inference ), but I don’t know if RAY will even work in my setup, even if I balance 3 GPUs on each Server, for PP i need ( 1, 2, 4, 8, ... per server. ). Can I do PP2 on server A and PP4 on ServerB ?!..
Even if I would get PP to work with Ray, would I still be able to also offload to RAM of Server B ?
Ideally I would want to use all 6 GPUs for maximum vRAM of 144GB for KV & Some of the weight, and add ~100GB in weights from RAM. ( I also need full context - I’m a software engineer ).
Last, if I can’t get 15 t/s+ inference and 1000 t/s+ prompt processing, it won’t suffice, as I need it for agentic work and agentic coding.
What do you guys think?
If not doable with said hardware, would you recommend I upgrade my Mothboard & CPU to a 7xx2/3 Epyc *( utilizing the same RAM) for increased offloading speeds or go for more GPUs and cheaper motherboard but one that has pcie-bifurcation to have say 8-10 x RTX 3090 GPUs on the same RIG ? If I can fit the model in GPU, I don’t need the RAM or memory channels eitherway.