Show HN: I invented a new generative model and got accepted to ICLR

diyer22 | 624 points

This sounds somewhat like a normalizing flow from a discrete space to a continuous space. I think there's a way you can rewrite your DDN layer as a normalizing flow which avoids the whole split and prune method.

1. Replace the DDN layer with a flow between images and a latent variable. During training, compute in the direction image -> latent. During inference, compute in the direction latent -> image. 2. For your discrete options 1, ..., k, have trainable latent variables z_1, ..., z_k. This is a "code book".

Training looks like the following: Start with an image and run a flow from the image to the latent space (with conditioning, etc.). Find the closest option z_i, and compute the L2 loss between z_i and your flowed latent variable. Additionally, add a loss corresponding to the log determinant of the Jacobian of the flow. This second loss is the way a normalizing flow avoids mode collapse. Finally, I think you should divide the resulting gradient by the softmax of the negative L2 losses for all the latent variables. This gradient division is done for the same reason as dividing the gradient when training a mixture-of-experts model.

During inference, choose any latent variable z_i and flow from that to a generated image.

cooljoseph | 17 hours ago

Very impressive to see a single author paper in ICLR, especially for an innovative method. Well done!

michaeldoron | a day ago

Pretty interesting architecture, seems very easy to debug, but as a downside you effectively discard K-1 computations at each layer since it's using a sampler rather than a MoE-style router.

The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.

f_devd | a day ago

Green flag that he references the I Ching, most original ideas come through analogy. Paul Werbos claims he invented backprop to formalize Freud's theory of “psychic energy” into an algorithm.

mysterEFrank | 18 hours ago

After reading the paper there is one thing I don't understand about the DDL. It seems each "concat" will increase the size of the "output feature" relative to the "input feature" by the size of the "generated image".

Is that right?

If so, how is this increased size handled by each downstream DDL?

Or, is there a 2x pooling in the "concat" step so that final size remains unchanged?

frumiousirc | 8 hours ago

An uninformed question: If the network is fully composed of 1x1 convolutions, doesn’t that mean no information mixing between pixels occur? Would that not imply that each pixel is independent of each other? How can that not lead to incoherent results?

qazxcvbnm | a day ago

Super cool concept!

Looking at the examples below the abstract, there's several details that surprise me with how correct the model is. For examples: hairline in row 2 column 3; shirt color in row 2 columns 7, 8, 9, 11; lipstick throughout rows 4 and 6; face/hair position and shape in row 6 column 4. Of particular note is the red in the bottom left of row 6 column 4. It's a bit surprising—but still understandable—that the model realized there is something red there, but it's very surprising that it chose to put the red blob in exactly the right spot.

I think some of this can be explained by bias in the dataset (e.g. lipstick) and cherry picking on my part (I'm ignoring the ones it got wildly wrong), but I can't explain the red shoulder strap/blob. Is there any possibility of data leakage and/or overfitting of a biased dataset, or are these just coincidences?

hatthew | a day ago

I built something similar in structure, if not in method, using a hierarchy of cross attention and learned queries, made sparse by applying L1 to the attention matrices.

Discrete hierarchical representations are super cool. The pattern of activations across layers amounts to a “parse tree” for each input. You have effectively compressed the image into a short sequence of integers.

intalentive | a day ago

Pretty interesting. I was just doing research on diffusion using symbolic transform matrices to try and parallelize a deep graph reactive system a few days ago, seems to be a general direction that people are going, I wouldn't be surprised to see diffusion adjacent models take over for codegen in the next year or two.

CuriouslyC | a day ago

Super cool, I spent a lot of time playing with representation learning back in the day and the grids of MNIST digits took me right back :)

A genuinely interesting and novel approach, I'm very curious how it will perform when scaled up and applied to non-image domains! Where's the best place to follow your work?

moconnor | a day ago

Can you train this model to detect objects (e.g detect a fish in the picture)?

FitchApps | a day ago

Slightly meta-level: I'm glad the authors finds the ICLR reviews useful, and this illustrates one of the successes of ICLR's policy of always open sourcing the reviews (regardless of whether the paper is accepted or rejected).

The authors benefit from having "testimonials" of how anonymous reviewers interpreted their works, and it also allows opens the door to people outside of the classic academic pipeline to see the behind the scenes arguments to accept/reject a paper.

Here are the reviews for this paper btw: https://openreview.net/forum?id=xNsIfzlefG

And here's a list of all the rejected papers: https://openreview.net/group?id=ICLR.cc/2025/Conference#tab-...

aseg | a day ago

It's not often you read a title like that and expect it to pan out, but from a quick browse, it looks pretty good.

Now I just need a time-turner.

Lerc | a day ago

This looks like great work.

I've added it to my reading list.

Thank you for sharing it on HN.

cs702 | a day ago

I don't have a super deep understanding of the underlying algorithms involved, but going off the demo and that page, is this mainly a model for image related tasks, or could it also be trained to do things like what GPT/Claude/etc does (chat conversations)?

VoidWhisperer | a day ago

Could this be used to train a text -> audio model? I'm thinking of an architecture that uses RVQ. Would RVQ still be necessary?

cellis | a day ago

How does it compare to state of the art models? Does it scale?

p1esk | a day ago

Reminds me of particle filters.

v9v | 5 hours ago

fwiw ICLR:

International Conference on Learning Representations

https://en.wikipedia.org/wiki/International_Conference_on_Le...

mellosouls | a day ago

It seem to have both feature and a discrete number passed into next layer, which one did you think of first? or it is both by design?

cttet | 17 hours ago

Do you have any details on the experiment procedures? E.g. hardware, training time, loss curves? It is difficult to confidently reproduce research without at least some of these details.

highd | a day ago

isn't this kind of like an 80% vq-vae?

serf | a day ago

The part about pruning and selecting sounds similar to genetic algorithms from before the popularity of nn.

0xdeadbeefbabe | a day ago

How did this get accepted without any baseline comparisons? They should have compared this to VQ-VAE, diffusion inpainting and a lot more.

gurtinator | a day ago

Amazing! So basically the statistical LLM concept for imaging.

wordglyph | a day ago

It's so cool to see the hierarchical generation of the model, on their Github page they have one with L=4: https://discrete-distribution-networks.github.io/img/tree-la...

The one shown on their page is L=3.

GaggiX | a day ago

Impressive, congrats.

nothrowaways | a day ago

very interesting stuff! great work and congratulations on the ICLR acceptance!

kaiokendev | a day ago

Congrats!

jama211 | 14 hours ago

Congrats!! Very cool.

nvr219 | a day ago

Deeply uninformed person here:

Is the inference cost of generating this tree to be pruned something of a hindrance? In particular I'm watching your MNIST example and thinking - does each cell in that video require a full inference? Or is this done in parallel at least? In any case, you're basically memory for "faster" runtime (for more correct outputs), no?

throwaway314155 | a day ago

First, I think this is really cool. Its great to see novel generative architectures.

Here are my thoughts on the statistics behind this. First, let D be the data sample. Start with the expectation of -Log[P(D)] (standard generative model objective).

We then condition on the model output at step N.

- Expectation of Log[Sum over model outputs at step N{P(D | model output at step N) * P(model output at step N)}]

Now use Jensen's inequality to transform this to

<= - expectation of Sum over model outputs at step N{Log[P(D | model output at step N) * P(model output at step N)]}

Apply Log product to sum rule

= - expectation of Sum over model outputs at step N {Log(P(D | model output at step N)) + Log(P(model output at step N))}

If we assume there is some normally distributed noise we can transform the first term into the standard L2 objective.

= - expectation of Sum over model outputs at step N {L2 distance(D, model output at step N) + Log(P(model output at step N))}

Apply linearity of expectation

= Sum over model outputs at step N [expectation of{L2 distance(D, model output at step N)}] - Sum over model outputs at step N [expectation of {Log(P(model output at step N))}]

and the summations can be replaced with sampling

= expectation of {L2 distance(D model output at step N)} - expectation of {Log(P(model output at step N))}]

Now, focusing on just the - expectation of Log(P(sampled model output at step N)) term.

= - expectation of Log[P(model output at step N)]

and condition on the prior step to get

= - expectation of Log[Sum over possible samples at N-1 of (P(sample output at step N| sample at step N - 1) * P(sample at step N - 1))]

Now, for each P(sample at step T | sample at step T - 1) this is approximately equal to 1/K. This is enforced by the Split-and-Prune operations which try to keep each output sampled at roughly equal frequencies.

So this is approximately equal to

≃ - expectation of Log[Sum over possible samples at N-1 of (1/K * P(possible sample at step N - 1))]

And you get an upper bound by only considering the actual sample.

<= -Log[1/K * expectation of P(actual sample at step N - 1))]

And applying some log rules you get

= Log(K) - expectation of Log[P(sample at step N - 1)]

Now, you have (approximately) expectation of -Log[P(sample at step N)] <= Log(K) - expectation of Log[P(sample at step N - 1)]. You can repeatedly apply this transformation until step 0 to get

(approximately) expectation of -Log[P(sample at step N)] <= N * Log(K) - expectation of Log[P(sample at step 0)]

and WLOG assume that expectation of P(sample at step 0) is 1 to get

expectation of -Log[P(sample at step N)] <= N * Log(K)

Plugging this back into the main objective, we get (assuming the Split-and-Prune is perfect)

expectation of -Log[P(D)] <= expectation of {L2 distance(D, sampled model output at step N)} + N * Log(K)

And this makes sense. You are providing the model with an additional Log_2(K) bits of information every time you perform an argmin operation, so in total you have provided the model with N * Log_2(K) bits for information. However, this is constant so you can ignore it from the gradient based optimizer.

So, given this analysis my conclusions are:

1) The Split-and-Merge is a load-bearing component of the architecture with regards to its statistical correctness. I'm not entirely sure about how this fits with the gradient based optimizer. Is it working with the gradient based optimizer, fighting the gradient based optimizer, or somewhere in the middle? I think the answer to this question will strongly affect this approaches scalability. This will also need a more in-depth analysis to study how deviations from perfect splitting affect the upper bound on loss.

2) With regards to statistical correctness, the L2 distance between the output at step N and D is the only one that is important. The L2 losses in the middle layers can be considered auxiliary losses. Maybe the final L2 loss / L2 losses deeper in the model should be weighted more heavily? In final evaluation the intermediate L2 losses can be ignored.

3) Future possibilities could include some sort of RL to determine the number of samples K and depth N on a dynamic basis. Even a split with K=2 increases NLL loss by Log_2(2) = 1. For many samples after a given depth the increase in loss due to the additional information outweighs the decrease in L2 loss. This also points to another difficulty, it is hard to give fractional information in this Discrete Distribution Network architecture. In contrast, diffusion models and autoregressive models can handle fractional bits. This could be another point of future development.

elchananHaas | a day ago

Why do you refer to yourself as "we" in the paper?

Invictus0 | a day ago

Wtf, iclr reviews are happening right now. Did you get accepted into a workshop? How do you know it’s been accepted?

Der_Einzige | a day ago

>Figure 18: The Taiji-DDN exhibits a surprising similarity to the ancient Chinese philosophy of Taiji. Records of Taiji can be traced back to the I Ching (Book of Changes) from the late 9th century BC, often described by the quote on the left (a) that explains the universe’s generation and transformation. This description coincidentally also summarizes the generation process and the transformations in the generative space of Taiji-DDN. Moreover, the diagram (b) from the book Tom (2013) bears a closely resemblance to the tree structure of DDN’s latent fig. 1b. Therefore, we have named the DDN with K = 2 as Taiji-DDN.

Very nitpicky comment, but I personally find such things to make for a bad impression. To be more specific, branching structures are a fairly universal idea, so the choice of relating it ancient proverbs instead of something much mundane raises an eyebrow.

E-Reverance | a day ago

[dead]

tensorlibb | a day ago