Nano Banana can be prompt engineered for nuanced AI image generation

minimaxir | 661 points

I have been generating a few dozen images per day for storyboarding purposes. The more I try to perfect it, the easier it becomes to control these outputs and even keep the entire visual story as well as their characters consistent over a few dozen different scenes; while even controlling the time of day throughout the story. I am currently working with 7 layers prompts to control for environment, camera, subject, composition, light, colors and overall quality (it might be overkill, but it’s also experimenting).

I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.

Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.

Genego | 11 hours ago

I like the Python library that accompanies this: https://github.com/minimaxir/gemimg

I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:

  GEMINI_API_KEY="..." \
  uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
    python -m gemimg "a racoon holding a hand written sign that says I love trash"
Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...
simonw | 13 hours ago

This works with the openrouter API as well, which skips having to make a google account etc. Here's a Claude-coded openrouter compatible adaptation which seems to work fine: https://github.com/RomeoV/gemimg

A 1024x1024 image seems to cost about 3ct to generate.

Bromeo | 23 minutes ago

Good read minimaxir! From the article:

> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.

In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.

> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.

This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.

[1] - https://mordenstar.com/portfolio/zeno-paradox

vunderba | 10 hours ago

Use Google AI Studio to submit requests, and to remove watermark, open browser development tools and right click on request to “watermark_4” image and select to block it. And from next generation there will be no watermark!

dostick | 14 hours ago

The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!

This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.

mFixman | 13 hours ago

I was kind of surprised by this line:

>Nano Banana is terrible at style transfer even with prompt engineering shenanigans

My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.

As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.

Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.

Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.

As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.

leviathant | 14 hours ago

"prompt engineered"...i.e. by typing in what you want to see.

ml-anon | 14 hours ago

It's really nice to see long-form, obviously human-written blogs from people deep into the LLM space - maybe us writers will be around for a while still in spite of all the people saying we've been replaced.

sixhobbits | 2 hours ago

My personal project is illustrating arbitrary stories with consistent characters and settings. I've rewritten it at least 5 times, and Nano Banana has been a game-changer. My kids are willing to listen to much more sophisticated stories as long as it has pictures, so I've used it to illustrate text like Ender's Game. Unfortunately, it's getting harder to legally acquire books in a format you can feed to an LLM.

I first extract all the entities from the text, generate characters from an art style, and then start stitching them together into individual illustrations. It works much better with NB than anything else I tried before.

achatham | 7 hours ago

I tried asking for a shot from a live-action remake of My Neighbor Totoro. This is a task I’ve been curious about for a while. Like Sonic, Totoro is the kind of stylized cartoon character that can’t be rendered photorealistically without a great deal of subjective interpretation, which (like in Sonic’s case) is famously easy to get wrong even for humans. Unlike Sonic, Totoro hasn’t had an actual live-action remake, so the model would have to come up with a design itself. I was wondering what it might produce – something good? something horrifying? Unfortunately, neither; it just produced a digital-art style image, despite being asked for a photorealistic one, and kept doing so even when I copied some of the keyword-stuffing from the post. At least it tried. I can’t test this with ChatGPT because it trips the copyright filter.

comex | 12 hours ago

In my own experience, nano banana still has the tendency to:

- make massive, seemingly random edits to images - adjust image scale - make very fine grained but pervasive detail changes obvious in an image diff

For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.

Has anyone had a better experience?

peetle | 13 hours ago

Nano Banana can be frustrating at times. Yesterday I tried to get it to do several edits to an image, and it would return back pretty much the same photo.

Things like: Convert the people to clay figures similar to what one would see in a claymation.

And it would think it did it, but I could not perceive any change.

After several attempts, I added "Make the person 10 years younger". Suddenly it made a clay figure of the person.

BeetleB | 11 hours ago

For images of people generated from scratch, Nano Banana always adds a background blur, it can't seem to create more realistic or candid images such as those taken via a point and shoot or smartphone, has anyone solved this sort of issue? It seems to work alright if you give it an existing image to edit however. I saw some other threads online about it but I didn't see anyone come up with solutions.

satvikpendem | 13 hours ago

use it for technical design doc, where i sketch out something on paper and ask nano banana to make flow chat, its incredibly good at this kind of editing (also if want to borrow image from someone and change some bridges usually its hard its embedded image, but nano banana solves that)

AuthError | 11 hours ago

It's really cool how good of a job it did rendering a page given its HTML code. I was not expecting it to do nearly as well.

sebzim4500 | 14 hours ago
[deleted]
| 11 hours ago

>> "The image style is definitely closer to Vanity Fair (the photographer is reflected in his breastplate!)"

I didn't expect that. I would have definitely counted that as a "probably real" tally mark if grading an image.

sejje | 12 hours ago

Theres lots these models can do but I despise when people suggest they can do edits with "with only the necessary aspects changed".

No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.

Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.

Even if you

miladyincontrol | 14 hours ago

The blueberry and strawberry are not actually where they prompted.

ainiriand | 13 hours ago
[deleted]
| 13 hours ago

> Nano Banana is still bad at rendering text perfectly/without typos as most image generation models.

I figured that if you write the text in Google docs and share the screenshot with banana it will not make any spelling mistake.

So, use something like "can you write my name on this Wimbledon trophy, both images are attached. Use them" will work.

mkagenius | 14 hours ago

Well, I just asked it for a 13-sided irregular polygon (is it that hard?)…

https://imgur.com/a/llN7V0W

pfortuny | 14 hours ago

I found this well written. I read it start to finish. The author does a good job of taking you through their process

4b11b4 | 10 hours ago

regarding buzzword usage

"YOU WILL BE PENALIZED FOR USING THEM"

That is disconcerting.

sigspec | 4 hours ago
[deleted]
| 12 hours ago

Created a tool you can try out!! sorry to self-plug but I launch on Product Hunt next week that lets you do this:)

www.brandimagegen.com

if you want a premium account to try out, you can find my email in my bio!!

smerrill25 | 10 hours ago

> It’s one of the best results I’ve seen for this particular test, and it’s one that doesn’t have obvious signs of “AI slop” aside from the ridiculous premise.

It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.

layer8 | 13 hours ago

The kicker for nano banana is not prompt adherence which is a really nice to have but the fact that it's either working on pixel space or with a really low spatial scaling. It's the only model that doesn't kill your details because of vae encode/decode.

BoredPositron | 14 hours ago

Cute. What’s the use case?

tomalbrc | 11 hours ago

I really wish that real expert stuff, like how to do controlnet, use regional prompting, or most other advanced ComfyUI stuff got upvoted to the top instead.

Der_Einzige | 11 hours ago

how did you do NSFW?

icemelt8 | 10 hours ago

I'm getting annoyed by using "prompt engineered" as a verb. Does this mean I'm finally old and bitter?

(Do we say we software engineered something?)

squigz | 14 hours ago

Another thing it can't do is remove reflections in windows, it's nearly a no-op.

roywiggins | 12 hours ago

I haven't paid much attention to image generation models (not my area of interest), but these examples are shockingly good.

insane_dreamer | 12 hours ago

This article was a good read, but the writer doesn't seem to understand how model-based image generation actually works, using language that suggests the image is somehow progressively constructed the way a human would do it. Which is absurd.

I've noticed a lot of this misinformation floating around lately, and I can't help but wonder if it's intentional?

empressplay | 8 hours ago

[flagged]

gtrealejandro | 8 hours ago

lots of words

okay, look at imagen 4 ultra:

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”

Is Imagen thinking?

Let's compare to gemini 2.5 flash image (nano banana):

look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?

compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0

without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2

We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.

Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.

doctorpangloss | 14 hours ago

I don't feel like I should search for "nano banana" on my work laptop

jdc0589 | 12 hours ago