Hacker News

tosh

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data (2024) thonking.ai

dan_sbl2 hours ago

> For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.

I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?

wildzzzan hour ago

If my gpu is sitting idle, and I mean idle with nothing loaded into its memory, it's sitting at about 18W. If I load in model that uses nearly all of the memory but that model is idle, it's at 36W. If that model is actively thinking, it's like 118W. I think this is likely due to the GPU being aware that there is real data loaded into memory and turning up the DRAM refresh rate whereas when nothing is loaded, the dynamic power is as low as possible.

Aurornis2 hours ago

Server cards are not optimized for idle power usage. They’re expected to be fully utilized.

For server gear it’s more common to have less dynamic power and voltage switching because it produces more predictable performance and latency.

cmovqan hour ago

For GeForce cards you can get similar behavior by setting “Prefer maximum performance” which disables some of the low power states.

umanwizard29 minutes ago

I suspect the act of running nvidia-smi itself prevents the GPU from being put into a low-power state.

zacmps10 minutes ago

From memory this is true and nvml (Nvidia management library) is the way to get stats that doesn't cause the GPU to wake.

jetsamflotsam40 minutes ago

I feel like many of the comments missed the point or didn't read the article. What I believe this article is stating (and I've read this many times during my PhD for various reasons), is that the input data distributions affect how many transistor state changes there are during multiplication. Since these events are a large portion of energy loss/heat generation, the clocks won't be throttled as much for certain data patterns.

There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.

amelius2 hours ago

Sounds like a side channel attack waiting to happen.

unglaublichan hour ago

So I guess we'll all be applying a random rotation to our matrices now to obscure their contents, like TurboQuant does. https://arkaung.github.io/interactive-turboquant/#rotation

buildbotan hour ago

Not that it super matters, but random hadamards for quantization have been a thing since way before turboquant.

https://arxiv.org/abs/2404.00456

jayd162 hours ago

I can't tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?

I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.

nzach3 hours ago

I went in expecting to find 'branch prediction'[0] as the answer, but apparently things are even more complex nowadays.

[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...

bee_rider34 minutes ago

I expected a “torch is smart enough to keep track of cases where it just initialized the C in C <= A*B+C to zero, avoiding the add” type situation but I was wrong.

gruezan hour ago

>I went in expecting to find 'branch prediction'[0]

GPUs do branch prediction? I thought they didn't bother and try to minimize wasted effort by using high amounts of concurrent threads?

jayd16an hour ago

They do texture prefetching, which is sorta similar.

ryanisnan29 minutes ago

That's exactly what I thought.

kangalioo2 hours ago

To be fair, the culprit in the article is _less complex_ than branch prediction: "with random data, bits are flipped often, and bit flips in transistors inherently draw power" is less mental gymnastics than "with random data, the cpu fails to predict the future, causing redundant speculative execution"

[deleted]2 hours agocollapsed

gdevenyi4 hours ago

People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!

Aurornis2 hours ago

This is not observable from LLM inference, where you would not encounter uniform matrices.

Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.

Lercan hour ago

When thermal throttling occurs you can perform faster by running slower.

This is precicely because of the efficiency. The lower efficiency of the higher speed triggers a much lower performance sooner.

Aurornis42 minutes ago

> When thermal throttling occurs you can perform faster by running slower.

This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.

The parts have a curve of clock speed versus voltage. More clock speed means higher performance. That goes further up the voltage curve, meaning more power.

Throttling just moves the card further down the voltage to clock speed curve. It reduces clock speed, reducing performance.

The cards don't "perform faster by running slower". If you run the card slower, it performs slower.

PcChip18 minutes ago

with a lower power cap set, it runs cooler, which sometimes allows the GPU to reach higher boost speeds. This is a real effect on gaming GPUs - however I have no idea if it applies to datacenter GPUs

gchamonlive4 hours ago

In general, constraints require optimizations and rearchitectures. I'd also expect the ram shortage for instance to have a big impact on the software industry as a whole, specially in games. They will need to make do with what people have, a ps5/pro or similar in PC power.

aNoob70003 hours ago

I actually think it is a good thing to introduce constraints to AI and the overall tech industry. Hopefully everyone will have to look at improving performance without having to add RAM or increase CPU/GPU performance.

gchamonlivean hour ago

As long as these constraints are for everyone and not just for thee and not for me, and become an instrument for big tech to keep consumers dependent on their infra.

bitwize2 hours ago

It wouldn't surprise me to see some ML algorithm in silico somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.

Nevermarkan hour ago

Here is one: An adjustment to weight updates, that makes it more likely for weights to stay uniformly distributed.

~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.

I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.

And for actual runs, from a pre-run sampled curve.

falcor84an hour ago

And there's at least one more level of inception at the data center level, where they use AI to optimize power usage (particularly by predictively controlling cooling, and adaptively rescheduling tasks).

evanjrowley2 hours ago

Designing for predictable execution flow is one of the advantages of Tenstorrent hardware.

https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...

https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...

https://arxiv.org/html/2604.03279

cold_harbor3 hours ago

[dead]

source