Hacker News

samwho
Prompt caching: 10x cheaper LLM tokens, but how? ngrok.com

aitchnyu16 minutes ago

Took me a minute to see it is same Ngrok which provided freemium tunnels to localhost. How did they adapt to the AI revolution?

[deleted]15 minutes agocollapsed

coderintheryean hour ago

Really well done article.

I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.

est3 hours ago

This is a surprising good read of how LLM works in general.

wesammikhail30 minutes ago

Amazing article. I was under the misapprehension that temp and other output parameters actually do affect caching. Turns out I was wrong and this explains why beautifully.

Great work. Learned a lot!

simedw3 days ago

Thanks for sharing; you clearly spent a lot of time making this easy to digest. I especially like the tokens-to-embedding visualisation.

I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…

samwhoop2 days ago

Thank you so much <3

Yes, I recently wrote https://github.com/samwho/llmwalk and had a similar experience with cache vs no cache. It’s so impactful.

mrgaro3 hours ago

Hopefully you can write the teased next article about how Feedforward and Output layers work. The article was super helpful for me to get better understanding on how LLM GPTs work!

Youden2 days ago

Link seems to be broken: content briefly loads then is replaced with "Something Went Wrong" then "D is not a function". Stays broken with adblock disabled.

hn-front (c) 2024 voximity
source