Pico-Banana-400K: Apple's Open Dataset for Image Editing

Apple releases Pico-Banana-400K, a 400,000-image open dataset for text-guided image editing — built from real photos using Google's Nano Banana, scored by Gemini, and available on GitHub under CC BY-NC-ND 4.0.

The biggest bottleneck in text-guided image editing isn't the models — it's the training data. Existing datasets are either small (human-curated) or synthetic (generated from proprietary models you can't redistribute). Apple just released one that's neither.

Pico-Banana-400K is a 400,000-image dataset for instruction-based image editing, built entirely from real photographs. The source images come from OpenImages. The edits were generated using Google's Nano Banana, then scored and filtered by Gemini 2.5 Pro. The result is openly available on GitHub under CC BY-NC-ND 4.0.

How It Was Built

Apple's pipeline has three stages. First, they selected real photographs from OpenImages — images of humans, objects, textual scenes. Second, they generated editing prompts using Gemini 2.5 Flash with a system prompt instructing it to write natural, concise editing instructions grounded in what's actually visible in the image. Third, those prompts drove Nano Banana to produce edited versions. Gemini 2.5 Pro then evaluated the results on four criteria: instruction compliance (40% weight), editing realism (25%), preservation balance (20%), and technical quality (15%).

Failed edits weren't discarded. About 56,000 of them were retained as a preference subset — useful for training reward models and alignment research.

The prompts themselves went through a compression step. The initial Gemini-generated instructions tend to be verbose. Apple ran them through Qwen 2.5-7B-Instruct to produce shorter, more human-like versions. Both long and short forms ship with the dataset.

What's Inside

The dataset covers 35 edit types organized into eight categories — pixel and photometric adjustments (change color tone), object-level semantics (relocate an object, swap a color), scene composition (add background), stylistic transformation (convert photo to sketch), and more. That taxonomy is the key differentiator. Previous synthetic datasets tend to cluster around a few edit types. Pico-Banana-400K was designed to cover the full space.

Subset	Size	Purpose
Single-turn edits	257K examples	Core training data for image editing models
Multi-turn edits	72K examples	Sequential editing, reasoning, planning across consecutive modifications
Preference / failures	56K examples	Alignment research, reward model training
Long-short instruction pairs	Paired with above	Instruction rewriting and summarization

The multi-turn subset is particularly interesting. Most image editing datasets give you a single instruction and a single result. Pico-Banana-400K includes sequences — edit A, then edit B on the result of A, then edit C. That's what you need for studying how models handle consecutive modifications without losing coherence.

Who Covered It

InfoQ's Sergio De Simone published a detailed breakdown in November 2025, walking through Apple's pipeline from prompt generation to quality scoring. The paper has already accumulated 11 citations on arXiv. Reddit's r/StableDiffusion community discussed it with 20+ comments, and it appeared on Hugging Face's paper listings. Apple hosts the dataset on their CDN through the GitHub repository (apple/pico-banana-400k).

Why It Matters

If you're training or fine-tuning image editing models, your options until now were InstructPix2Pix's synthetic data (generated from Stable Diffusion + GPT-3, limited in diversity), or proprietary datasets you couldn't share. Pico-Banana-400K is large, built from real photos, quality-scored by a frontier model, and openly licensed.

The irony: Apple used Google's models — Nano Banana for generation, Gemini for scoring and prompt creation — to build an open dataset that will help competitors train better image editors. That's the kind of move that accelerates the entire image generation space, not just Apple's position in it.

For researchers building the next generation of text-guided editing models, the 72K multi-turn subset alone might be worth the download.