Which is why the term Artificial Intelligence is really a misnomer for LLMs. Artificial Mediocracy might be more fitting.

Sums up the issues with democracy too, and a ton of other stuff

I&#x27;m not sure you can claim that the essential functionality of something is the issue with something.The whole idea of LLMs is that they chose the most likely token based on the tokens before, and then sometimes chose less likely tokens. But it&#x27;s all based on likelihoods.Probably there is a huge education part missing from this, if people aren&#x27;t aware that this is how it works, and they think that any LLM can &quot;creatively&quot; come up with it&#x27;s own chain of tokens based on nothing.

Evals do help to account for correctness when it comes to LLMs

I propose calling it artificial non-diligence.

&gt; Copilot gives you popular responses not correct ones.That also sums up most of the issues with LLMs in general in one sentence.

It would be very interesting to fine-tune Copilot on the code of people widely regarded in their communities as experts, to see how the suggestions would change.

I&#x27;m not sure that code being newer inherently means it will be more secure

This makes me wonder about training an LLM on one language and then fine tuning it for another. If you train over only, say, JavaScript, and then finetune for C, I imagine it will be quite bad at writing safe code, even if it makes the code look like C, because it didn&#x27;t have to learn about freeing and such.Similarly, would it pick up patterns from one language and keep then in the other? Maybe an LLM trained on Kotlin would be more likely to write functional code finetuned.

&gt; Most of the data is probably around average.I know this is not how distributions work, but I had to chuckle at the literal interpretation of this.

I wonder if llm are biased towards older, more insecure implementations because there is a higher volume of old code vs new code.Same thing with the data it is trained on — not all code requires all levels of refinement. Most of the data is probably around average.

That&#x27;s not how it&#x27;s presented or how managers expect it to be used.

Brawndo is great for plants because it has elecrolytes.

&gt; It only means programmers commonly talk about it. This isn&#x27;t the same thing as measuring incidence in production or distribution.Copilot was primarily trained on GitHub projects, not on communication between programmers. Patterns that frequently show up in Copilot output are most likely prevalent on GitHub, which is a pretty good indicator that they&#x27;re common in production code.

&gt; Yet if a weakness is common, it also means that human coders frequently make the same mistake as well.It only means programmers commonly talk about it. This isn&#x27;t the same thing as measuring incidence in production or distribution.Anyway, i&#x27;d argue the real question is &quot;can the chatbot fix the code if requested to&quot;.

The added complication is now you&#x27;ll have to watch out for the junior+copilot combo, though it&#x27;s a trade I personally am very willing to take.

A junior programmer&#x27;s code? This makes no sense. It&#x27;s happening right in front of you. A junior programmer isn&#x27;t going to write on my screen. I can just correct it right here I am currently holding the context in my head.These &quot;security weakness&quot; examples are<pre><code> print(&quot;first user registered, role set to admin&quot;, user, password)
</code></pre>
and<pre><code> pprint({&quot;json&quot;:&quot;somejunk&quot;, &quot;classes&quot;: somefunc(user)})

</code></pre>
Nah, this stuff I can easily spot while I&#x27;m writing code. For a junior programmer, I&#x27;m going to be looking at design, and then at common specific mistakes. For Copilot it&#x27;s writing in front of me. I can easily exclude anything that isn&#x27;t obviously correct because I&#x27;m in the state right there.It&#x27;s a fantastic tool. If you go and use it and end up with `print(user_credentials)` I don&#x27;t know what to tell you.

If a weakness is common, then of course Copilot is going to suggest it. Copilot gives you popular responses not correct ones. Yet if a weakness is common, it also means that human coders frequently make the same mistake as well.The studies results are rather unsurprising and its conclusions are oft-repeated advice. As many have said, treat copilot’s code in the same light you would treat a junior programmer’s code.

&gt; The results show that (1) 35.8% of Copilot generated code snippets contain CWEsWhat percent of non-Copilot generated public GitHub repos contain CWEs?Edit: According to this study, Copilot generates C&#x2F;C++ code with vulnerabilities, but at a lower rate than your average human coder: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2204.04741.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2204.04741.pdf</a>

&quot;...The results show that (1) 35.8% of Copilot generated code snippets contain CWEs, and those issues are spread across multiple languages, (2) the security weaknesses are diverse and related to 42 different CWEs, in which CWE-78: OS Command Injection, CWE-330: Use of Insufficiently Random Values, and CWE-703: Improper Check or Handling of Exceptional Conditions occurred the most frequently, and (3) among the 42 CWEs identified, 11 of those belong to the currently recognized 2022 CWE Top-25. Our findings confirm that developers should be careful when adding code generated by Copilot (and similar AI code generation tools) and should also run appropriate security checks as they accept the suggested code...&quot;

This is probably the next step for the LLM providers. They need to find ways to increase quality, and for code, there are many options. Perhaps code repos could get in on this too.

Yes, but presumably in the training data those two are quite correlated.

Wouldn&#x27;t this train it to avoid detection more than to avoid bad patterns?

I wonder if it would be possible to rate the code used during the training phase. For example the code could go through various static analysis tools and the result would be assigned as metadata to the code being used to train the model. The final model would then know that a given pattern is flagged as problematic by some tool and could take this into account not just to suggest new snippets but also to suggest improvements of existing snippets. Though I suppose if it was that easy, they&#x27;d have done it already.

They didn&#x27;t improve on human truck drivers yet.

As always the statistic is useless without the human comparison. If it improves on human coders, no amount of gnashing and wailing will stop the layoffs.

I don&#x27;t know if it still does it, but it used to be that if you did something like<pre><code> NonQueryResult StoreUser(User user) {
 var sql = &quot;INSERT...

</code></pre>
It would use string interpolation to fill out the properties

&gt; I&#x27;ve seen this in the wild too but that&#x27;s no excuse.See, the LLM also saw it in the wild...

Not best practice? That&#x27;s a very generous way to describe storing plaintext passwords in logs. I&#x27;ve seen this in the wild too but that&#x27;s no excuse.

That is the CWE that they identify, but the code seems to store the apparently unhashed password in the database on top of that?

There&#x27;s only one weakness specifically identified that I can see.<pre><code> print(&quot;new user&quot;, username, password)
</code></pre>
Yeah, not best practice, but also pretty common for development if you wanted to check that everything is being passed to the correct function.

A related headline could be &quot;Security weaknesses of code produced by a junior developer&quot;.
It says copilot in the product name -&gt; it&#x27;s not intended to replace the pilots (aka developers) brain.

They did not prompt at all. They used GitHub’s code search to find projects where the repo owner specified that the code was generated “by Copilot” and the authors took that at face value for all code in the project. Whether the code was actually suggested by Copilot is not at all analyzed in the paper. As such, the results are highly questionable.

This would likely help a little bit. We&#x27;ve already seen LLMs improve performance on some tasks by being instructed to &quot;think carefully&quot; first; presumably this biases it towards parts of the training set that are higher quality.But security ultimately requires comprehension, which is not something LLMs have.

I think it’s more likely that you would use a security graded bot.It’s perfectly reasonable to not use secure code for a large number of use cases.

&gt; also make it secure[proceeds to simply refactor the same code]

That would be kind of wild. Imagine a world where whether your system was secure was just a matter of remembering to tell the AI agent &quot;&amp; also make it secure&quot; before it writes your code.(could be quite real!)

Did they prompt it to consider security weaknesses?

Security weaknesses of Copilot generated code in GitHub