gcr6 hours ago
DwarfStar4 is a small LLM inference runtime that can run DeepSeek 4. The blog post implies that it currently requires 96GB of VRAM.
For others who are lacking context :-)
foresto6 hours ago
Thanks. Outside of LLM circles, DS4 is usually a video game controller.
artyom5 hours ago
Well, I was sitting here expecting the Redis creator have an opinion on still-unannounced Dark Souls 4.
low_tech_love2 minutes ago
Haha the same here!!
oezi4 hours ago
Or a car from Citroen
pavlovan hour ago
Technically DS is an independent sibling of Citroën within Stellantis, a sprawling car conglomerate that owns a dog’s dinner of car brands in Europe and USA.
insensible4 hours ago
Trekkies are experiencing a major regression from Deep Space Nine.
jofzar5 hours ago
I am actually kind of disappointed it wasn't a deep dive on the dual shock 4
zozbot234an hour ago
> The blog post implies that it currently requires 96GB of VRAM.
Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.
conradkay37 minutes ago
It'd be way slower since you'd be doing that work every token
zozbot23432 minutes ago
True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that compute throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.
DeathArrow3 hours ago
>The blog post implies that it currently requires 96GB of VRAM.
From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.
thomasm6m62 hours ago
FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.
[1] https://unsloth.ai/docs/models/tutorials/minimax-m27
(Unsloth's deepseek-v4 support is still WIP)
DeathArrow2 hours ago
Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.
manmal2 hours ago
It wouldn’t be useful with your setup, probably 3-4 token per second.
DeathArrow33 minutes ago
Yep, maybe I can open a feature request if it makes sense technically.
karmakaze6 hours ago
Great to find this narrow focused thing:
> We support the following backends:
Metal is our primary target. Starting from MacBooks with 96GB of RAM.
NVIDIA CUDA with special care for the DGX Spark.
AMD ROCm is only supported in the rocm branch. It is kept separate from main
since I (antirez) don't have direct hardware access, so the community rebases
the branch as needed.
> This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.Edit: aww, doesn't seem to support offloading to system RAM[0] (yet)
[0] https://github.com/antirez/ds4/issues/108
Guess I'll have to keep watching the llama.cpp issue[1]
zimmerfreian hour ago
> AMD ROCm is only supported in the rocm branch.
Has anybody tried it? There is a lot of emphasis on MacBook Pro in this thread, but I would like to use it with an AMD Halo Strix with 128GB of unified RAM.
keyle3 hours ago
If only you could still buy Mac's with that much RAM
shric2 hours ago
You can buy 128GB M5 MacBook Pros?
Configured one just now, delivers in 2 weeks
keyle10 minutes ago
Interesting there were news last week or so of apple removing Mac minis options.
zmmmmm5 hours ago
I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.
Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.
nlan hour ago
Kilo (the open source coding agent) tested Deepseek v4 Pro and Flash vs Opus 4.7 and Kimi K2[1].
It did ok, but scored substantially less than Opus. It also cost nearly as much, even with the current launch promo pricing for Deepseek.
That cost is interesting - I've seen similar things with Sonnet vs Opus, and in my own benchmarking there are some models that benchmark well, seem to have a good price but use so many tokens they cost just as much as "more expensive" models.
[1] https://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash
loeg5 hours ago
> At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing.
Is that true? I find the smarter models can just be effective when smaller models can't. It isn't a matter of just waiting longer.
davnicwil4 hours ago
it's almost certainly not true yet but at some point there might be an equilibrium reached of speed Vs quality (and let's not forget, cost) where it's true for most of what you do.
Perhaps you'd still turn to hosted models for the hardest tasks, but most tasks go local. It does seem like that would make demand go down significantly.
Of course that's all predicated on model advances plateauing, or at least getting increasingly more expensive for incremental improvements, such that local open source models can catch up on that speed/quality/cost curve. But there is a fair amount of evidence that's happening. The models are still getting noticably better, but relative improvement does seem to be slowing, and cost is seemingly only going up.
vlovich1232 hours ago
Why is this presumed to be de facto inevitable:
* local compute isn’t scaling as before, so algorithmic improvements are the only ways models get meaningfully faster and smarter
* all those same algorithmic improvements would also be true for larger models
* hardware manufacturers have an incentive against local LLMs because cloud LLMs are so much more lucrative (+ corps would by desktop variants if they were good enough)
So no it’s not clear quality will ever be comparable. It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.
jofzar5 hours ago
> I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.
It's always going to be cost;
developer time vs developer cost vs AI cost vs developer productivity.
With 4.6 it's looking like we are at the upper limit of appetite for cost (for "regular" Business) so the other levers will probably need to change.
skybrian3 hours ago
I imagine we'll get to "good enough" for hobbyist programmers fairly quickly, but businesses will still be willing to pay more for faster and smarter. Why make your programmers wait?
zmmmmm2 hours ago
> Why make your programmers wait?
That depends on where the methodology goes. But more and more it's hands off. If the trajectory continues it won't matter because nobody is sitting their waiting / watching the LLM code anyway. It is all happening in the background. We might see hybrid approaches where the weaker / cheaper agent tries to solve it and just "asks for help" from the more expensive agent when it needs it etc.
ilakshan hour ago
I want something like this but not only for my own computer but also for client projects or stuff I might run in cloud GPUs. Because the core idea of having a strong model that is efficient and doesn't require a cluster still applies to a lot of business cases. I am hoping something like this can work in batch mode.
Right now I feel like a 4bit Qwen 3.6 27B with MTP is one of the best for agentic tool calling for some smart voice agents in an H200. I wonder if DS4 Flash being using 80b at 2 bit with 13b active and MTP could be even faster and smarter and allow more concurrent sequences?
This special 2bit quantization seems like a big deal.
FuckButtons7 hours ago
It’s shocking how close this feels to claude, obviously it's much slower, but I don’t know that it’s significantly dumber. Interestingly the imatrix quantization seems to be better than whatever quant the zdr inference backends on open router are using. It was self aware enough yesterday to realize that it’s own server process was itself without me telling it, which is not something I’ve ever observed a local model doing before.
stavros7 hours ago
In my (obviously anecdotal) testing, DeepseekV4 Pro was better than Sonnet at coding. However, it is much slower, but also many times cheaper, especially with the promotion right now.
DeathArrow3 hours ago
Do they have a coding plan or you only pay per API call?
trollbridge2 hours ago
It’s just per token, but burning up 100 million+ tokens is a $3 transaction with their pricing right now
DeathArrow2 hours ago
Do you use the official API or another provider?
0xbadcafebee8 hours ago
I don't see an explanation of why they would make a model-specific inference engine vs just using llamacpp. There are already lots of people working on the llamacpp integration. This is a lot of effort spent on a single model which is likely to become obsolete when a different model comes out that does better. In some discussions, people are now making PRs against both the llamacpp branches and ds4... so it's taking a rare commodity (people investing development time in this model) and fragmenting it
dilap3 hours ago
way easier to work on a focussed c codebase you own than a mature unwieldy c++ codebase you don't. but it's fine, people will take that work and port to llamacpp and everyone wins.
(the ux of ds4 is fantastic too -- it's dead-easy to get a known-good model, great quant. llamacpp you're much more hacking in the wilderness, w/ many many knobs.)
flakiness8 hours ago
I believe the assumption is: The code is cheap. The collaboration (eg. upstreaming) is expensive.
Is it true? We'll see, in a few years.
fgfarben7 hours ago
At a certain point the level of abstraction / genericization necessary for a big flexible project (like llama.cpp or Linux) blows things up into a huge number of files. Something newer and smaller can move faster.
zozbot2348 hours ago
Author has mentioned many times that the llama.cpp maintainers don't want code that's prevalently written by AI with no human revision. If anyone wants to try and get the support upstreamed into that project, they're quite free to do that: the code is MIT licensed.
kristianp7 hours ago
Also Antirez has been able to use GPT to iterate on the code and performance. He/they (others contributed to DS4) has a set of result files to ensure that correctness is maintained, and benchmarks to verify performance, and the LLM is able to iterate within that framework. Having a small, focussed codebase helps here.
Antirez explained the dev process when he posted a pure C implementation of the Flux 2 Klein image gen model, at https://news.ycombinator.com/item?id=46670279
somewhatrandom96 hours ago
With "intelligence" (or whatever you want to call it) and speed both seeming to ramp up quickly with local models I wonder what the growth rate and ceiling(?) might be in this space. Will this kind of iq and performance work with just e.g: 16GB RAM in a couple years? Is there a new kind of Moore's law to be defined here?
famouswaffles3 hours ago
Squeezing a model like this complete with 'big model smell' into 16GB...Honestly it's not even possible or feasibly possible today.
It'll require some kind of:
- breakthrough in architecture or
- breakthrough in hardware or
- some breakthrough quantisization technique
The problem is that all the parameters need to be in memory, even the ones that aren't active (say for Mixture Of Expert Models) because switching parametrs in and out of ram is far too slow.
marci44 minutes ago
"That’s where EMO comes in.
We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."
hadlock4 hours ago
640gb ought to be enough for anybody
lwansbrough6 hours ago
The people working at the leading edge of this stuff seem to believe that there is a need for parallel models that solve different problems.
A crow exhibits some degree of intelligence in what is a very small brain compared to humans. There is overlap in the problem solving skills of the dumbest humans and the smartest crows.
So the question is: what is that? Yann LeCun seems to think it’s what we now call world models. World models predict behaviour as opposed to predicting structured data (like language.)
If your model can predict how some world works (how you define world largely depends on the size of your training data), then in theory it is able to reason about cause and effect.
If you can combine cause and effect reasoning with language, you might get something truly intelligent.
That’s where things seem to be going. Once we have a prototype of that system, there will be many questions about how much data you really need. We’ve seen how even shrinking LLMs with 1-bit quantization can lead to models that exhibit a fairly strong understanding of language.
I don’t think it’s unreasonable to expect to see some very intelligent low (relatively) memory AI systems in the next couple years.
simonw8 hours ago
I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.
perfmode8 hours ago
How’s the token throughput / response time?
simonw8 hours ago
Healthy!
prefill: 30.91 t/s, generation: 29.58 t/s
From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...embedding-shape8 hours ago
Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:
prefill: 121.76 t/s, generation: 47.85 t/s
Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.
xienze8 hours ago
I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
throwdbaaway37 minutes ago
Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.
fgfarben7 hours ago
That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...
hadlock4 hours ago
M5 studio is gonna sell like hot cakes
aiscoming8 hours ago
if it's just the coding agent system prompt and tools, you can cache that
xienze8 hours ago
Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.
[deleted]2 hours agocollapsed
rtpg6 hours ago
what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?
chatmasta6 hours ago
So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.
simonw5 hours ago
I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.
Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.
I'm an LLM nerd so running local models is worth it from a research perspective.
simpaticoder3 hours ago
An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?
driese34 minutes ago
As always: it depends on your needs. Here's a very basic heuristics rundown:
- More RAM: bigger models, more intelligence.
- More FLOPs: higher pre-fill (reading large files and long prompts before answering, the so-called "time to first token").
- More RAM bandwidth: higher token generation (speed of output).
So basically Macs (high RAM, okay bandwidth, lowish FLOPs) can run pretty intelligent models at an okay output speed but will take a long time to reply if you give them a lot of context (like code bases). Consumer GPUs have great speed and pre-fill time, but low RAM, so you need multiple if you want to run large intelligent models. Big boy GPUs like the RTX 6000 have everything (which is why they are so expensive).
There are some more nuances like the difference of Metal vs. CUDA, caching, parallelization etc., but the things above should hold true generally.
aiscoming2 hours ago
[dead]
easythrees6 hours ago
I thought for a moment there was a Dark Souls 4
NDlurker6 hours ago
I was thinking dual shock 4
txhwind40 minutes ago
Fucking abbreviations. Who knows it's DeepSeek, Dark Souls or DualShock? All possible on HN.
JavierFlores096 hours ago
Glad I wasn't the only one, my second thought was Dual Shock controller but that wasn't it either lol
minimaxir8 hours ago
A relevant recent tweet from antirez: https://x.com/antirez/status/2054854124848415211
> Gentle reminder on how, in the recent DS4 fiesta, not just me but every other contributor found GPT 5.5 able to help immensely and Opus completely useless.
I've noticed the same for lower level squeezing-as-much-performance-as-possible code work.
throwaway0412077 hours ago
Assuming we are talking about Code/Codex are you on API billing or subscription? I have essentially unlimited API billing at my disposal and I haven't noticed any degradation of quality across Opus versions.
chatmasta6 hours ago
Same here, the enterprise version of Claude has been great. Luckily I’m not the one paying for it. We also have CoPilot and when GPT-5.4 came out, and was 1x request cost, I was very impressed but haven’t had much time to compare the two.
I also don’t have time to do much personal coding outside of work, so I haven’t subscribed to a personal one yet. But I intend to go for Codex just to balance the Claude at work and also because of the hostile moves from Anthropic toward their consumer business.
[deleted]5 hours agocollapsed
sanxiyn6 hours ago
There is a benchmark for performance work, and I think it is not being optimized by model vendors. The latest result from GSO is that both Opus 4.6 and 4.7 slightly outperforms GPT 5.5. This also matches my experience.
vitorsr6 hours ago
Tasks are taken from commit histories in public Git repositories which defeats the purpose.
Riany2 hours ago
I think local models need to be good enough that privacy, latency, and control become worth the tradeoff, instead of beat the best cloud models
vrighter2 hours ago
Damn it I was expecting something interesting about the ps4 controller. Not some more junk about AI. Such a rugpull
bjconlan9 hours ago
This is great! I feel the same way about the deepseek v4 architecture for commodity hardware.
Also have enjoyed playing with https://huggingface.co/HuggingFaceTB/nanowhale-100m-base (but early days for me understanding this space)
kamranjon8 hours ago
Very cool! I had no idea that HF was doing this - I really love their small model experiments.
sbinnee8 hours ago
It is a big thing for sure to have a competitive local agentic model. I've replaced gemini 3 flash preview with DeepSeek v4 flash for all of my personal use cases. Starting from chat app, language learning, and even hobby coding. For coding, I couldn't get decent results no matter which sota latest models I used before. It's not close to Opus or Codex models. It's a flash model and makes mistakes here and there (I just saw `from opentele while import trace`, new Python syntax!)
But I found its tool calling is reliable than other oss models I tried. I assume that it attributes to interleaved thinking. Its reasoning effort is adjusted automatically by queries. I enjoy reading these reasoning traces from open models because you can't see them from proprietary models.
I would love to try DS4 so bad. Well, I don't have a machine for it. I will just stick to openrouter. I wish I can run a competitive oss model on 32GB machine in 3 years.
kristianp7 hours ago
> I wish I can run a competitive oss model on 32GB machine in 3 years.
It's so hard to predict what size the open-weight models will be, even in 6 months time. Will a 96GB machine turn out to be a complete waste of money? Who knows.
zozbot2347 hours ago
> I wish I can run a competitive oss model on 32GB machine in 3 years.
You could try DS4 on that machine anyway and see how gracefully it degrades (assuming that it runs and doesn't just OOM immediately). Experimenting with 36GB/48GB/64GB would also be nice; they might be able to gain some compute throughput back by batching multiple sessions together (though obviously at the expense of speed for any single session).
thegeomaster7 hours ago
> `from opentele while import trace`
FYI, this to me points to an inference bug, bad sampling, or a non-native quant. OpenRouter is known to route requests to absolutely terrible, borked implementations. A model like DeepSeek V4 Flash shouldn't be making syntax errors like this.
kgeist4 hours ago
Did someone compare DeepSeek 4 Flash to Qwen3.6-27B on real tasks (quality + speed)? According to the benchmarks at artificialanalysis.ai, Qwen3.6-27B is better at agentic tasks, and DS4 is only 2 points better at coding (both with max reasoning effort, full weights). At the same time, DS4 requires 5 times more VRAM even at 2 bits. Last time I explored this topic, large MoE models at 2-3 bits usually performed worse (quality-wise) than dense ~30B models at 4-8 bits, despite being much heavier to run.
Sure, MoE models have more knowledge, but extreme quantization may negate the benefits. And generally for coding tasks, you don't need a model that has memorized all the irrelevant trivia like, I don't know, the list of all villages in country X. DS4 also seems to run much slower on Mac Studio Ultra, which appears to be more or less in the same price range as RTX 5090. RTX 5090 gives me 50-60 tok/sec and 260k context with Unsloth's 5-bit quantization (only some layers are 5-bit too) and an 8-bit KV cache; prefill is instant too. It works flawlessly in OpenCode.
If you already have a spare high-end Mac, I can see the benefit, but I'm not sure it's a good configuration overall. Unless Qwen3.6 is more benchmaxxed than DS4 :)
kamranjon8 hours ago
Just want to mention that I've been pulling down and using DwarfStar locally and it's incredible. I actually have it running on my personal macbook m4 max with 128gb of ram and I am running the server to share it through tailscale with my work laptop and just have pi running there.
The long context reasoning is something I haven't even seen in frontier models - I was running at 124k tokens earlier and it was still just buzzing along with no issues or fatigue.
I am amazed at how well it works, I'm using it right now for some pretty complex frontend work, and it is much much faster than, for example running a dense 27b or 31b model (like qwen or gemma) for me (The benefits of MoE) - but the long context capabilities have been what have been absolutely flooring me.
Super excited about this project and hope Antirez can keep himself from burning out - i've been following the repo pretty closely and there are a ton of PR's flooding in and it seems like he's had to do a lot of filtering out of slop code.
le-mark8 hours ago
Is DS4 dwarf star 4 or deep seek 4?
kamranjon8 hours ago
Just updated! Sorry I meant Dwarf Star - it's the only way I've actually managed to run DeepSeek flash on my local hardware
zackify6 hours ago
Are you on q2?
kamranjon5 hours ago
Yea I'm on the imatrix q2 version now
wolttam8 hours ago
DwarfStar 4 is DeepSeek 4 (check the repo)
[deleted]7 hours agocollapsed
brcmthrowaway7 hours ago
This guy is falling deep into Yegge-tier psychosis.
linkregister7 hours ago
Empirically, DS4 is hosting the DeepSeek v4 Flash model with good performance on home hardware. I'm curious how you came to this conclusion.
dakolli7 hours ago
"Empirically", have you tested this yourself?
linkregister5 hours ago
It's trivial to find reviews and benchmarks of DS4 online. Also, there are benchmarks in the article.
Here's one of the top hits: https://forums.developer.nvidia.com/t/fully-custom-cuda-nati...
Bizarre comment; sounds like "How do you know Porsches are fast? Did you drive one?"
calmingsolitude4 hours ago
Parent is simply pointing out the incorrect usage of "empirically", which should typically only be mentioned when you've tested it yourself.
linkregister3 hours ago
I'm having trouble finding dictionaries or other references that add the qualifier that it needs to be self-tested and not relying on the research of others. Can you point me to one?
dakolli3 hours ago
I don't think comments on the internet count as "empirical" evidence, but sure.
linkregister2 hours ago
If you think antirez's benchmarks in the blog post are false, you should make the claim. Continue to move the goal posts.
dakolli4 hours ago
Are you comparing an LLM running on a laptop to a Porsche?
I just find it really funny people are willing to write things like "empirically speaking, X is obvious" without actually testing it themselves.
I've seen mixed reviews, and the most honest sounding ones have said it has latency issues.
I don't really care that much what the average LLM power user says at this point, they're impressed by anything an LLM does. They're like toddler's entertained by the sound their Velcro shoes make.
You LLM people are going to be like my mom, once she got an Maps app she completely gave up on navigating anywhere with her own brain, and is lost without a phone.
Except for you LLM people, its going to be reading, writing, problem solving and thinking in general. You'll be completely reliant on an llm to get anything done, have fun with that. You're cooked bro.
linkregister3 hours ago
It's funny because you make these assertions without any empiricism of your own. They're just speculations.
"You LLM people". Has it occurred to you that individuals have variation within groups?
wren69913 hours ago
Not even close. "I made this DSP task faster by focusing on exactly one compute graph on one machine instead of a compute graph compiler that runs on every possible machine" is a real engineering approach, and the AI usage is incidental. Things like Gas Town are self-serving turboslop whose only purpose is to generate more slop.
fgfarben7 hours ago
Nope.
codedokode8 hours ago
I thought DeepSeek was closed-weights and proprietary? I wonder how it compares against Western open-weight models. The hugging face page contains the comparison only with proprietary models for some reason.
itishappy8 hours ago
DeepSeek has always been open-weight, and the DeepSeek HuggingFace page does not contain any comparisons. Where did you form these opinions?
codedokode7 hours ago
It contains comparisons: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
itishappy7 hours ago
Just the first one then...
Apologies. Where did I form my opinions?
[deleted]7 hours agocollapsed
zozbot2348 hours ago
Nemotron would be a comparable Western open model AIUI.