Hacker News

DSemba

Moebius: 0.2B image inpainting model with 10B-level performance hustvl.github.io

simonw16 hours ago

I got this working with ONNX (thanks, Claude Opus 4.8) and now I have an interactive demo of the model running entirely in the browser here (~1.3GB download): https://simonw.github.io/moebius-web/ - code here: https://github.com/simonw/moebius-web

(Claude Code transcript: https://gisthost.github.io/?58039ba5c1ca3ed177e8659168996ee4)

Wrote this up in more detail on my blog: https://simonwillison.net/2026/Jun/22/porting-moebius/

g5889288110 hours ago

well done!

unet weights are in fp32. did you by any chance try something lower, fp16?

da_grift_shift3 hours ago

The model considered it.

There are 25 or so mentions of fp16 and fp32 weights across the 7500+ words of Markdown text it generated. So the next question might be: Did it make the right calls?

https://github.com/simonw/moebius-web/blob/main/notes.md

https://github.com/simonw/moebius-web/blob/main/plan.md

https://github.com/simonw/moebius-web/blob/main/research.md

https://github.com/simonw/moebius-web/blob/main/understandin...

K0IN16 hours ago

Awesome, I wanted to do the exact same thing (used gpt 5.5 + code) but it didn't get the model to work in onnx...

Zopieuxan hour ago

Not great. The inpainted areas are, as usual, very smooth compared to the detailed, "high frequency" look of natural photos.

Barely useful enough to erase things in thumbnails.

lifthrasiir20 hours ago

Tried a bit, and while it is very impressive for 0.2B model it would be very hard to convince me that this matches with 10B models. It did work reasonably well with natural images but inpainted regions were visibly smoother than surroundings, and performed very badly on novel objects. It is also limited to 512x512 output, which limits its practical usefulness.

amelius17 hours ago

Do you think the provided examples are representative of its performance, or do you think they were cherry picked?

lifthrasiir17 hours ago

Given its limited output dimension it's hard to tell. I haven't exactly tested fine-tuned variants but I think they would work well under certain situations. After all, some (possibly cherry-picked) examples still exhibit similar problems when you inspect them in detail.

nickandbro9 hours ago

Here is a little app I made that allows you to experiment with all of the fine tuned models that runs entirely in your browser:

https://inpaintlab.com/

james2doyle21 hours ago

There are some demo spaces using this. This one seems the best (paint your own mask) but it failed on all the images I tried: https://huggingface.co/spaces/multimodalart/Moebius

hex4def617 hours ago

I've been playing around, got it to work, although quality was a bit crappy. Still playing around with the settings that get exposed, but you're welcome to look at : https://huggingface.co/spaces/jonatei/MoebiusDemo

Note that I'm actively messing with it, so it may break for short periods of time :)

It's also running on the free CPU, so it's like 80 seconds per image...

xrd19 hours ago

I did an inpainting project for a client a few years ago. They were trying to inpaint banner ads for concert promoters, and find a way to make it easy to produce a bunch of different sized ads for a variety of placements. I was tasked with inpainting Xmas themed ad for a few major singers.

The weirdest thing was when the inpainting tool added strange people to an image. This singer was all decked out in tinsel and red, and the inpainting model added a grumpy old man in a top hat. I don't recall clicking the "Add creepy old man" button.

At the time this was Stable Diffusion on the backend, run by a variety of model hosting services, Amazon being one. They all had different requirements for the input image and that made things really complex. For some the aspect ratio was impossible to meet, and it would fail if the banner was 200x60. For others, you had to resize it before input, which meant you were adding an image with poor resolution to start. Garbage in, garbage out.

All of this to say, there is a lot of preproduction that went into it, and the client never ended up using my attempts.

Yokohiii2 hours ago

> At the time this was Stable Diffusion on the backend

The community made models (merges, fine tunes, etc) of that era are all completely overtrained and optimized for portraits and frontal shots. They would try to make a person out of anything. Inpainting faces is already a chore, even with a lot of tooling around that, but inpainting anything else is almost impossible. These models are also especially bad to fit objects naturally into scenes. You can make a crappy necklace or belt work, but introducing a new object into a scene just fails with infinite variety.

They are also much better using 512x512 as resolution, any larger deviation introduces more problems.

Considering you wanted to inpaint banner ads, they would probably get distorted heavily. Those models can't deal with fonts and are bad at a pixel perfect transfers. The only viable way to do this, at that time, would be to manually insert the banner ads and fix the seams with AI. Requires some artistic skill of course.

Your attempt was bold, but with the expectation of just supplying two images and let the models do it, it was impossible.

anigbrowl15 hours ago

This singer was all decked out in tinsel and red, and the inpainting model added a grumpy old man in a top hat. I don't recall clicking the "Add creepy old man" button.

Obvious reference to the Dickens story A Christmas Carol. In the UK there's a bylaw that requires Christmassy events to hire a Scrooge-like figure to lurk in the background so people keep their enthusiasm in check.

giancarlostoro19 hours ago

> For others, you had to resize it before input, which meant you were adding an image with poor resolution to start.

Thats because small models like SD (Stable Diffusion) are trained on very specific resolutions, its the fancier models that are trained on higher quality, or more diverse sets of resolutions, and if you use a higher quality model to generate lower resolution images, what's actually happening is you're trimming a much bigger image and getting a chunk of it output, at least that's how it feels based on my many hours of experimenting. If I use major models and try to center a thing, I never see it in the center. :) My GPU can only handle so much.

vunderba18 hours ago

So traditionally, the way you’d do this (and why some UIs like automatic1111 let you configure inpainting so flexibly) is that you didn’t have to shrink the entire image.

The general idea was: you mask the area you want changed, and the model inpaints that region at full resolution. The advantage of masking, compared to plain img2img, is that you’re not sending the entire picture to the model.

With the classic setups like SD 1.5 and SDXL, you’d effectively inpaint at full resolution: take the masked area from a larger image, scale just that region to the model’s native resolution, process it at the full ~1 megapixel then scale it back and composite it into the original. This lets you add MORE detail.

Unfortunately if the OP is using hosted SD models, they might not have that granular control and thus would suffer pretty bad quality loss.

giancarlostoro18 hours ago

I was kind of speaking more in general I realized, not just strictly inpainting, but yeah that makes sense, though I've had inpainting also limited by the image being too big for my GPU to handle as well. I may be using it incorrectly though, not really experimented with much of that in a while, maybe when I get a newer gaming rig.

vunderba18 hours ago

Yeah, the landscape also changes a lot as well. It’s just really hard to keep up with everything. Especially if you’re using it casually because some of the UI wrappers (the Gradio-based ones) have more obscure knobs and dials than a TI‑82 calculator.

This is the image I always think of when first introducing someone to ComfyUI or even Automatic1111.

https://imgur.com/a/G0Xlznj

pattilupone19 hours ago

I want a version of this for manga (for translation). Right now I think the go-to lightweight inpainting model for anime and manga is LaMa which is several years old now and it feels like there is room for improvement.

matthewfcarlson19 hours ago

I've been working on trying to outpaint an animated program for my son (Leapfrog Letter Factory if you're curious) and then upscale it. Doing so locally has been actually fairly difficult. I wonder if you could retrain or fine tune this model. They mention building an expert, I wonder if that expert could understand more about translating various characters.

chatmasta15 hours ago

What is inpainting? Everyone in the comments seems to be familiar with the term, and I don’t see it described in the linked page.

torgoguys15 hours ago

Click on the visualizations to see it in action. The purple areas are areas a user highlighted to tell the system to inpaint, and when you click on the image you see the results of the inpainting. Basically the model redraws sections of an image (the purple areas) using the context of what's in the non-purple areas to decide what might look best in the purple areas. Often used for removing objects but as you can see in the examples it can do other things too.

NooneAtAll33 hours ago

> and when you click on the image

ah, bad UX

NooneAtAll3a day ago

I don't understand. Is it available somewhere to try or is it just an ad?

owebmastera day ago

Yeah it's great but how do I use it?

Edit: I think I found it https://huggingface.co/hustvl/Moebius

K0INa day ago

with this size we could have a interaactive web demo.

james2doyle21 hours ago

Like this? https://huggingface.co/spaces/multimodalart/Moebius

IvanK_net18 hours ago

Were you able to make it work? It never works in my case.

teroshana day ago

Unrelated but when I read inpainting and Moebius I was scared it was related and using the art of the great Jean Giraud [0] a.k.a. Moebius

https://characterdesignreferences.com/artist-of-the-week-3/m...

[0] https://en.wikipedia.org/wiki/Jean_Giraud

coldteaa day ago

Scared why?

teroshana day ago

Scared for the same reason I found last year's 'Ghibli filter' craze upsetting, I would have personally hated to have seen this artist's legacy used for promoting AI image generation.

TeMPOraLa day ago

In case that happened then the rest of the world would probably appreciate the art, and a subset of it, the artist (and even a small subset of ~whole Internet-connected population is a lot of people). Some silver lining, perhaps.

teroshan18 hours ago

Perhaps.

I like the idea that a piece of art, in addition of ultimately ending up as pixels on my screen, is also a window into a world that has been dreamt up by real human imagination, driven by their hopes and fears.

Semiconductors based generation may give me the first part, but not the second.

I'm speaking for myself here, I agree with your point though.

NooneAtAll33 hours ago

> I like the idea that a piece of art, in addition of ultimately ending up as pixels on my screen, is also a window into a world that has been dreamt up by real human imagination, driven by their hopes and fears.

I guess this actually defines the fringe between ai-art enjoyers and haters - some people prefer what art does to their imagination, while others look at what art does to others'

zmgsabst15 hours ago

You just refuse to see certain people’s hopes and fears because they didn’t express them in a way you personally find pleasing.

The LLMs didn’t prompt themselves.

bilekas5 hours ago

> The LLMs didn’t prompt themselves.

I refuse to accept that real humans believe prompting is art.

solid_fuel21 hours ago

> In case that happened then the rest of the world would probably appreciate the art

What art?

We’re talking about generated pictures, aka slop, not art made by a real human.

And I don’t know if you’ve been paying attention but people seem to be pretty tired of the slop. I don’t think it would be appreciated nearly as much as you think.

inigyou20 hours ago

It is possible to use generative AI in nonslop ways btw

TeMPOraL21 hours ago

This definition of "slop" doesn't cut reality just quite at the joints.

People are tired of marketing. AI generated slop people are annoyed with, is garbage produced for marketing reasons, and it's distinctly noticeable precisely because all the bottom-feeder marketing houses switched to using it. But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.

solid_fuel20 hours ago

> This definition of "slop" doesn't cut reality just quite at the joints.

> People are tired of marketing.

You know what, I'll give you that one. I find most generated art pretty tasteless, but I have enjoyed the occasional piece of fiction with small generated elements for atmosphere. I still hesitate to call it 'art', but I will grant it's not all 'slop'.

But for the second part:

> But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.

I think the problem is how much cheaper it is now. I would estimate generating a picture is at least 2 orders of magnitude cheaper than paying even a cheap human, so with the same amount of money being invested into slop we are due for - and seeing - a huge tidal wave of it, because the same amount of money turns out way more crap now.

[deleted]15 hours agocollapsed

delis-thumbs-7ea day ago

This is the useful AI stuf. There’s so many usecases this makes possible.

zerobees19 hours ago

Right, and that's what I find frustrating. There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.

Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.

Yokohiii2 hours ago

I have the feeling that the cloud based providers are just using the freely available segmentation models. It's just speculation, but it doesn't seem to be top priority for them, so they'd just bolt on anything that works.

A problem is also that the cloud solutions need a complex UI to surface segmentation to the user. But the point you have there is that those models are probably not prime time ready yet, surfacing them would actually reveal they are not as powerful as the user expects. Destroying the illusion that AI can just do anything at will.

Someone5 hours ago

> There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.

iPhones have models for text extraction and in-painting in the Photos App.

Both don’t have knobs to tune them, but, I think, they are decent for their intended audience (definitely not flawless, but I don’t think that exists anywhere, even if dropping the ‘local’ requirement)

For scene segmentation, iOS has models for detecting persons (https://developer.apple.com/documentation/Vision/segmenting-...).

It also has models for detecting faces, face features, body and hand poses, or for picking the ‘best’ selfie from a set.

(And dust removal is fairly niche compared to these, I think. Or do I overlook some common use case for it that many people want?)

somenameforme7 hours ago

You can do all of this locally on a cheap video card. Search for fooocus or automatic1111 for a couple of setups that are fairly low friction to get going. Amuse AI is another one. It's not quite state of the art and also censored, but it's by far the least friction (especially if you have an AMD card) - it's pretty much plug and play. ComfyUI is the advanced do-everything workhorse. However, it's anything but comfy if you don't already have a lot of knowledge about this domain. I'd generally recommend fooocus for a balance between usability and power/flexibility.

The million image gen services online are mostly just making bank off ignorance. People don't realize that their own cheap video cards are more than enough to do everything they're paying a service an orders of magnitude markup for.

krackers19 hours ago

The highest return small local model for me has been the in-built OCR that macOS has. It has finally "solved" OCR by making high-quality results accessible to everyone. Yet the state of art outside the apple ecosystem seems to be tesseract (poor results), or extremely heavy VLMs.

crimsonnoodle5812 hours ago

PaddleOCR? Qwen3-VL 30B-A3B?

doctorpangloss21 hours ago

how many times have you edited a photo you took on your phone in the last 7 days?

stusmall21 hours ago

I think 3? I feel like that's often enough. Sometimes it's nice to do a quick dumb ass gag on a whim. If I am anything I am a man who loves a dumb ass gag.

inigyou20 hours ago

Good on you. I've laughed at many dumbass gags but I've only been a passive consumer of them.

stusmall17 hours ago

Become the dumbass change you want to see in the world

TeMPOraL21 hours ago

Half a dozen at least.

(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)

dogomatic21 hours ago

Personally, about 9 times. Would be higher if it was even easier and cheaper

epolanskia day ago

What is the current SOTA for impainting?

I have a potential project for my e-commerce where I want to allow users to upload images of their house exteriors and impaint awnings.

vunderbaa day ago

Proprietary? Either gpt-image-2 or NB2.

I have an example of interior decorating inpainting where I replaced a large floor-to-ceiling window with a mirror, and the result was pretty impressive using NB Pro from nearly a year ago.

https://imgpb.com/ZXkiXV

Locally hostable? For my money I'd argue Flux.2 Klein but Qwen-Edit still puts in the work.

CharlesWa day ago

NB2 means "Nano Banana 2", a Google image generation model. https://blog.google/innovation-and-ai/technology/ai/nano-ban...

woadwarrior0120 hours ago

For locally hostable image editing models, the edit variant of the recently released Boogu-Image[1] model is very good. Anecdotally, I'd say way better than Flux.2 Klein 9B and Qwen-Edit.

[1]: https://github.com/boogu-project/Boogu-Image

IAmGraydona day ago

As far as I know, gpt-image-2 doesn't even let you define a mask unless you've already run it through one iteration, and once you do define the mask, it just ignores it 90% of the time. It's utterly useless for inpainting. Also, this and other proprietary models are severely limited in their output resolution.

I do agree, however, that the Flux2 family is the SoTA at the moment. Running locally via something like Comfy gets incredible results.

vunderbaa day ago

Yeah definitely. You can do workarounds like drawing circles or using highlighters to create pseudo-masks for use with OpenAI or Google models but it’s really just a visual indication more than anything.

If you want real precision (especially for complex polygonal masks), or if you’re concerned about image degradation over multiple edit rounds, you'll slam against the limitations of those approaches.

Even with SOTA proprietary models, repeatedly editing and re-uploading an image is like making a copy of a copy of a VHS tape: you're gonna see subtle color shifts and quality loss steadily accumulate.

At that point, you either need to put in the manual work in something like Photoshop (bringing elements in as layers and masking them properly) or, as you mentioned, use a model or workflow that properly supports masking.

TeMPOraLa day ago

Awnings, if I understand correctly (I just learned this word right now), are purely additive attachments to structure exteriors - so perhaps they wouldn't necessarily need a full inpainting model? Wouldn't it be enough to estimate an affine transform for a quad and blend the image of awning directly (and the same with shadow map to fake shade)? Is classical photogrammetry up to such task these days?

jdiff21 hours ago

I'm quite perplexed by this comment. If I'm understanding you correctly, sure, what you describe is possible through significantly more effort, orchestration, and source photos. Or we can grab one still image and throw an inpainting model at it.

epolanski20 hours ago

I have no idea but I think you might be onto something.

So you're saying that, if I can calculate from the picture the position (height, inclination and such), and I can render the model (should be doable) for that height and angle, my best course of action could be to combine original + render and only at the end use a visual model? That could be interesting.

TeMPOraL18 hours ago

More-less, yes. I was actually thinking about taking a high-resolution rendering of an awning directly facing the camera, and transforming that quad onto the user image - which requires computing the transformation matrix that would right the user image so the building is level and directly facing the camera, and then applying the inverse of that to the quad with your rendering - but I now realize I assumed user photos would mostly be nearly straight images of the building, not at large, odd angles. For general case, you'd need a 3D model (even if approximate), and apply the inverse transform to that, and then render it on top.

This idea rests on the assumption that my understanding of what "awnings" are is correct and matches your project, i.e. additive structures. In that case, your primary problem is adding pixels on top of the user image. Additive modifications are easy to pull off. Inpainting seems like overkill here; it's something that shines when you need to poke holes or replace some of the aspects of the original that is not covered by the part you're adding.

OTOH, it might still be that inpainting is your best bet for operational reasons - additive modification itself may not be a problem, but fixing lighting and shadows might, and current image generations models should handle this in stride.

(I say should because that's my expectation, but I never tested any of the current models on for ability to fix shadows that cover areas similar to the targeted modification, but lay beyond it. It might be that you'll still need a model and a transform estimate just to generate a shadow map as a hint for the model where it needs to act and how.)

epolanski8 hours ago

Yes, that's what we need to add (the red awning) to user provided pictures.

https://imgur.com/a/Y0Q4mfu

By the way I have a tried a handful of NB2 queries providing a reference image of the awning and of a user uploaded garden-facing building and I was very impressed by the results, I think that combining a 3d render at the right angle of the awning + NB 2 should do great.

Thanks a lot for your help, your feedback has been crucial! Dziekuje!

BoredPositrona day ago

flux klein with LoRa. GPT image and nano often produce high frequency artifacts when editing.

michaelfm121120 hours ago

> The core insight of Moebius can be summarized in a single equation: Synergy × (Architecture + Distillation) = Shattering the "Impossible Triangle" of Low Parameters, Fast Inference, and High Quality

Is it just me or is it weird seeing these clickbaity AI-generated taglines in an otherwise scientific work?

kevin_thibedeau18 hours ago

It signals a paradigm shift in vacuous prose.

dormento19 hours ago

It IS weird, but it "converts" (ugh...), that's why they coming.

Apart from this, the text details amazing work. Congrats.

soperj19 hours ago

After "In Good Company" i can't hear (or see) the word Synergy without cringing.

Jackson__17 hours ago

Judging by the performance of the shown examples, the quality is closer to pre-2022 Photoshop content aware fill than actual 10B models.

I think it is safe to say this is pretty far from a "scientific" work.

gspr21 hours ago

Nitpick: in the showcase on that page, under Comparison of Natural Scenes, Moebius should definitely get a "structural confusion" tag for the back of the surfboard. If other models get deducted for truncating the surfboard, then surely the elongation that Moebius does should count too.

Also, what's going on behind the in-painted corner of the house? We'd need to see higher resolution pictures, but I'm not convinced that it too shouldn't get a flag. Likewise with the beach just behind the surfboard. Not terrible, but what gets flagged in the competitors is similar.

N_Lensa day ago

The gallery of their samples is pretty impressive!

GL26a day ago

Could this run locally on a smartphone ?

rasz21 hours ago

It sure has a thing for chins, jaws and removing weight, looksmaxing build in.

hari112321 hours ago

lot of the photo editors on mobiles have this, maybe even some apps?

zb3a day ago

1) What are RAM requirements?

2) If these are reasonable, a WebGPU demo would be great..

lifthrasiir20 hours ago

The total model size is about 1.2GB (UNet + SDXL VAE included), so probably about ~3GB?

gifhater18 hours ago

[dead]

source