## RWKV RNN: Better than ChatGPT?

The best thing about this model is that it has O(T) speed and O(1) memory during inference vs the O(T^2) speed and O(T) memory (flash memory) of a GPT model, still it can be trained in parallel like a GPT model.

Imagine having ChatGPT level AI running in an ASIC inside earphones. This could be like an always-on buddy, available offline and able to access resources when you're connected.

Or in Google Glasses. The Readme states that it's more optimized for ASIC than the transformer architecture used by ChatGPT.

Interesting, that it goes against the grain.

Since the seminal paper 'Attention is all you need', we went from RNN type neural network to pure attention based networks. It started the LLM revolution as the attention only training such networks is parallelizable, and you got record breaking performance to boot.

Now we learn, that going back to the old RNN paradigm is actually better. It even advertises itself as totally 'attention-free'!

It generates quite a lot of random content to be honest.

If Nancy had two apples and Becky had 1 apple. Becky gives her 1 apple to Nancy, how many apples becky has ? Full Answer:

RWKV :

Two apples. If Nancy had 2 apples and Becky had 1 apple. Becky gives her 1 apple to Nancy, how many apples becky has ? Two apples.

Q : Two girls are playing with a ball, one of them throws the ball so that it goes straight and falls on the other's feet, the other bends her knees and catches it, how many times will the ball fall on the knees ? Full Answer: The ball will fall on the knees three times.

Q : Two sisters are playing with a stick. The first sister says 'let me hold it', the second sister says 'no'. Now what will happen ? Full Answer: The second sister will hold it.

Q : How many

GPT 3.5 Turbo : After Becky gives 1 apple to Nancy, Becky will have zero apples left. Becky gave her only apple to Nancy, so she doesn't have any apples remaining.

So, the answer is Becky has zero apples left.

It requires a lot more improvement.

Also check out Alpaca; you can self-host this one, the 7B and 13B variants produce surprisingly good results and are fast enough just running on CPU: https://github.com/antimatter15/alpaca.cpp

What test cases do folks here recommend for measuring this new model's ability to reason? and, specifically, if it can reason about code with similar (or better!) performance to ChatGPT4? Has anyone managed to get it running locally?

> We can predict that RWKV 100B will be great, and RWKV 1T is probably all you need :)

That sounds awfully similar to this quote: "There is no reason for any individual to have a computer in his home." by the founder of DEC in 1977.

There’s a similar one that’s supposed to be Bill Gates’ but apparently it’s not.

From the README

*dear god rwkv is fast. i switched to another tab after starting training it from scratch & when i returned it was emitting plausible english & maori words, i left to go microwave some coffee & when i came back it was producing fully grammatically correct sentences.*

Comparison with GPT3.5 - turbo

Given two numbers provide answer as follows

3 + 9 = 39 1 + 2 = 12 2 + 9 = 29

What is 29 + 49

RWKV :

?

A:

The answer is

$29+49 = 78$

Explanation:

Let the two numbers be $a$ and $b$. Then the sum is $a+b$. Now let the sum be $c$. Then the sum is $a+b+c$. Now let the sum be $d$. Then the sum is $a+b+c+d$. Now let the sum be $e$. Then the sum is $a+b+c+d+e$. Now let the sum be $f$. Then the sum is $a+b+c+d+e+f$. Now let the sum be $g$. Then the sum is $a+b+

GPT3.5 turbo

Based on the pattern provided, 29 + 49 would be equal to 2949.

Explanation:

In each of the equations given, the result is obtained by concatenating the two numbers together instead of adding them. For example, 3 + 9 becomes 39 because we concatenate 3 and 9 to get 39.

So, to find the answer to 29 + 49 using this pattern, we concatenate the two numbers to get 2949.

This is actually not that bad.

Is there a research paper / arxiv which describes it in detail?

Off-topic, but this submission's title feels unusually editorialized/click-baity for HN.

Ok

I'm skeptical that RNNs alone will outperform transformers. Perhaps some sort of transformer + rnn combo?

The issue with RNNs is that feedback signals decay over time, so the model will be biased towards more recent words.

Transformers on the other hand don't have this bias. A word 10,000 words ago could be just as important as a word 5 words ago. The tradeoff is that the context window for transformers is a hard cutoff point.

From the project page: pronounced as "RwaKuv"

That is still quite challenging to pronounce, maybe one of "rwkv" -> "raw-kv" -> "rawk-v" -> "rock-v"?

It just takes one language library to dethrone the next. I called this when everyone was like CHATGPT!!! The problem is noone knows what they are talking about and screaming AI!!! ChatGPT is not AI. It does something automated with accuracy information baked in and builds new information around the accuracy data. It does not think like AI, it takes the most probable data and responds with it. That is machine learning it. It is fundamentally a cornerstone toward AI, but not AI itself.

I've followed updates on this project r/machinelearning and for me the existence of projects like this is some good evidence that the OpenAI moat is not that strong. It gives some hope you are not going to need massive huge computers and GPUs to run decent language models.

I hope this project will thrive.