Hacker News

rvz
Gemma 4 12B: A unified, encoder-free multimodal model blog.google

ethanpil4 minutes ago

What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

mmariana minute ago

Marketing + Pro Serv if I had to take a guess.

minimaxir14 minutes ago

The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

kristjansson9 minutes ago

> quantization

12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?

But TBD how well the base model performs before thinking too much about quantization

reactordev7 minutes ago

It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.

wolttam5 minutes ago

I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.

jszymborski8 minutes ago

Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.

minimaxir5 minutes ago

In hindsight I may have been pedantic.

LarsDu887 minutes ago

Well its a real simple encoder I guess

GaggiX9 minutes ago

> That's technically encoding

Isn't that just projecting the patches into the d_model size vectors that the models takes?

>I am assuming that involves of quantization

12B model in 16GB seems very reasonable to me, int8 is top quality for running models.

minimaxir3 minutes ago

The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."

12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that.

fushigokira7 minutes ago

[dead]

nickandbro13 minutes ago

Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.

redman252 minutes ago

IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.

embedding-shape5 minutes ago

I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.

hn-front (c) 2024 voximity
source