HNPWA with Next.js

Show HN: Hacker News em dash user leaderboard pre-ChatGPT

tkgally | 353 points

v1 (the submitted URL) was https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo.... We've replaced it now v2, for more complex analytical em dash explorations :) - see https://news.ycombinator.com/item?id=45075379 and https://news.ycombinator.com/item?id=45072635.

dang | 2 days ago

Using the HN public dataset in Google BigQuery [0], which I think fits easily in the amount of free queries allowed:

  SELECT 
    EXTRACT(YEAR FROM timestamp) AS year, 
    SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) AS withDash, 
    COUNT(*) AS total, 
    SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction
  FROM `bigquery-public-data.hacker_news.full` 
    WHERE type = 'comment' 
  GROUP BY year 
  ORDER BY year;

  year with—   total  frac
  2006     0      12 0.000
  2007    13   70858 0.000
  2008   461  247922 0.001
  2009  1497  491034 0.003
  2010  3835  842438 0.005
  2011  4719 1044913 0.005
  2012  5648 1246782 0.005
  2013  7881 1665185 0.005
  2014  8400 1510814 0.006
  2015  9967 1642912 0.006
  2016 12081 2093612 0.006
  2017 14530 2361709 0.006
  2018 19246 2384086 0.008
  2019 23662 2755063 0.009
  2020 27316 3243173 0.008
  2021 32863 3765921 0.009
  2022 34657 4062159 0.009
  2023 36611 4221940 0.009
  2024 32543 3339861 0.010
  2025 30608 2231919 0.014

So there's definitely been an increase.

Querying for the users who use "—" most as a proportion of all their comments:

  SELECT
    `by`,
    SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction,
    COUNT(*) AS total,
    MIN(timestamp) AS minTime,
    MAX(timestamp) AS maxTime
  FROM `bigquery-public-data.hacker_news.full` 
  WHERE 
    type = 'comment' AND 
    timestamp < '2022-11-30' 
  GROUP BY `by`
  HAVING COUNT(*) > 100
  ORDER BY fraction DESC
  LIMIT 250;

zmgsabst uses them the most [1], westoncb [2] is an older account that uses them fourth-most.

[0] https://console.cloud.google.com/marketplace/product/y-combi...

[1] https://news.ycombinator.com/threads?id=zmgsabst

[2] https://news.ycombinator.com/threads?id=westoncb

Symbiote | 2 days ago

You might also want to rank by how often people use double hyphens-- like so.

I'm probably not alone here in being a longtime Linux user who started using a Macbook after the Apple Silicon transition, late 2022.

On Windows and Linux, inserting an em-dash is a laborious alt-code process. But on MacOS with an Apple keyboard, the `option` key acts like a tertiary shift, so an `–` em dash is just <option><->.

I didn't start using em-dashes (typing -- is just second nature to me and I'm still on Linux most of the time) when I got a Macbook, but I imagine some people in my shoes did.

lynndotpy | 2 days ago

You can count your own with this snippet. Just replace my username with your own. My count before this comment was 46.

  curl -s "https://hn.algolia.com/api/v1/search?tags=comment,author_sjs382&hitsPerPage=10000" \
    | jq -r '.hits[].comment_text' \
    | grep -o "—" \
    | wc -l

sjs382 | 2 days ago

I’d be interested in seeing how the data changes if instead of the total raw number of posts with em-dashes you instead check for their percentage considering the total number of posts. I guess the folks who registered later would be bumped up the list?

latexr | 2 days ago

This is the kind of top-tier content we need on HN. These are the issues that really matter!

ayaros | 2 days ago

Fun, but perhaps the ratio of em-dash per comment would be more interesting?

Otherwise it looks like the "race" is biased towards just the amount of comment posted.

riffraff | 2 days ago

The em-dash giveaway is an actual Unicode em-dash character, right? I professionally had to learn Latex to write a paper in the 1990s and picked up a "---" habit ever since, and I've been wondering if that's some kind of weird LLM tell now.

tptacek | 2 days ago

Due to the interest in this project, I created a second, more comprehensive version of the leaderboard:

https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...

This second version was vibe-coded with Codex CLI. I also tried Gemini CLI, but it didn’t work very well. The SQL scripts I ran at BigQuery were by Claude.

I am not a programmer or web designer, so I will leave these pages as they are, warts and all. It was a fun project, though. I never would have attempted something like this pre-vibe-coding.

tkgally | 2 days ago

As an em dash appreciator—and there are dozens of us!—I have mixed feelings on ChatGPT embracing our little guy. My suspicion is that it's a quirk of their RLHF tuning where the em dash—which is definitely distinct from the en dash and hyphen—came to be associated with authoritative writing.

bhickey | 2 days ago

Heh. A top 50. No way that I'm in there — I don't post that much.

Oh look, a more complete leaderbord — click.

Oh. I'm at position 51.

Freak_NL | 2 days ago

It might be more fun to see users who’s emdash usage increased after the release.

PUSH_AX | 2 days ago

It would be interesting to compare the post-2022 usage trends among the top contenders.

kevin_thibedeau | 2 days ago

I have started using triple dots as on Linux I can get them with Alt Gr + .

A lot of symbols can be accessed with Alt Gr compared to Windows

ThatMedicIsASpy | 2 days ago

Feature request: Sort by em-dashes per comment.

Feature request 2: Em-dash regular-dash ratio.

LeoPanthera | 2 days ago

I mourn and celebrate the emdash as a sign and signal. I mourn our memories of it and laugh at myself in the future thinking about this when I have forgotten about it.

It's like the memory of the jokes about the wacky phrases of gpt2 or the ew at the yellow hue saturated ai generated images.

In the future this sign will be gone and our pattern recognition will adapt and our memory of this will also mostly be gone. Hello to future tech archeologists. The emdash isn't a meme, it will never survive and replicate but it's fun while it's lasting and I'm enjoying it in the meantime!

I mourn also because in the future we may have few or no obvious signs of LLM use. These are the golden years.

thinkingemote | 2 days ago

I just realized I’ve been using en-dash this whole time. This is an identity crisis.

chatmasta | 2 days ago

Microsoft word converts your dashes to em-dashes for you automatically, for a least the last decade. So as a sibling comment said, if it's professionally written, there are probably em dashes used more than regular ones.

kristianp | 2 days ago

I think a bit more interesting statistics is to count only \w—\w. This excludes cases like "(—)" and emdashes surrounded by spaces, which is, apparently, what Russian-speaking users like to use. Also it is an very old tradition to format page titles as <title>[Page name] — [Website name]</title>: depending on language this is a default setting for MediaWiki, WordPress, etc.

Lockal | 2 days ago

As someone who leans heavily on emdashes, this has all been very annoying.

mkbelieve | 2 days ago

Ironically, I personally prefer good typography, but unless the editor for the desktop app is autocorrecting -- to —, I usually don't bother. But when I type on the phone with screen keyboard, I almost always do bother, even though entering text on mobile is objectively slower and more difficult and often with fewer options.

Andrew_nenakhov | 2 days ago

How about en dash usage? Has that been used as a similar false indicator?

rasse | 2 days ago

How can I get to the top of the leaderboard?

Is the amount of em dashes counted or the comments that have at least one em dash inside them?

You know, I am asking for...science(?).

I also wanted to point out that these could be Kantonese/Mandarin/Japanese/SouthEast Asian users that use their local keymapping software because a lot of them use the idiom symbols (e.g. the dot character, too) when they switch to the English keymaps.

Check out how laptops usually look like over there, a lot of manufacturers build that right into the firmware.

cookiengineer | 2 days ago

I'm actually one of the people who use em dash regularly. I treat it like a pause—like sighing. It's very easy to type it on a Mac it becomes muscle memory: Opt+Shift+Dash.

wiradikusuma | 2 days ago

This is kind of pointless given that iOS’s autocorrect has been adding em dashes, ellipsis and smart quotes to comments since… forever.

(Like now)

It’s become a weird kind of witch hunting regarding blogs, too, and I have a 20+ year old site that renders all of its content using Markdown extensions that do the same (and that also convert dual hyphens to em dashes—something I’ve been typing for about as long).

rcarmo | 2 days ago

I started using emdashes in my academic career, after my advisor pointed me to the subtle differences. And since then, I like and use emdash a lot. In Latex, it is easily produced, just keep the spacing rules in mind. The Punctuation Guide is a nice reference on it https://www.thepunctuationguide.com/

astahlx | 2 days ago

I think this whole em dash topic should lead to some deeper (though not very deep) conversations:

* If it was not widely used before where/how did (chat)GPT picked it up?

    * If it was widely used, then it shouldn't be a topic at all. But, there seems to be informal agreement that it wasn’t widely used.
    
    * Or, could GPT have inferred that even though it's not widely used, it's the better way to go (to use it). Which then makes one wonder about the whole probability of next token idea. Maybe this line of thinking falls too short of what might be really going on internally.

 * If it had picked up something that is widely used but in the wrong way, it should make us pause (again) about the future feedback loops these LLMs, which aren't going away, are already creating. Not just in terms of grammar and spelling but also in terms of way of thinking and seeing the world.

(edit: formatting)

maaaaattttt | 2 days ago

The one thing LLMs do well is manipulating text. The danger is obviously that it will reduce individual expression and make everything the same mediocre sludge.

For me writing is a way to capture a stream of consciousness so I don’t really see the advantage of using an LLM.

When I see some trivial mediocrity I simply stop reading. It’s just not interesting.

Gud | 2 days ago

I applaud this data. But how are people actually creating an em-dash in the "add comment" box? Some non-obvious OS-level shortcut?

zdw | 2 days ago

Someone should make something like this for the wider world outside of just HN. Go through all my publications through gScholar or elsewhere, and scour and parse anything I wrote publicly pre-11/30/22 to establish some kind of proof-of-humanity. Sincerely, an em-dash user who got overtaken by the GenAI wave of the mid-2020s.

thoughtpeddler | a day ago

Confused by the year stats below - that shows an increase much earlier that say GPT3 release date. So I'm guessing whatever is going on isn't just AI?

Havoc | 2 days ago

I noticed them in the Economist around 2010, and thought they were slick. Tons of software will autodetect "---" as an emdash so that works.

Honestly, even if it doesn't make it pretty I find stringing together a few hyphens does the trick in less formal settings.

loughnane | a day ago

Slightly tweaked, a leaderboard of em dash containing comments after ChatGPT release, limited to users who used them in fewer than 1% of comments before ChatGPT release, and who posted at least 200 comments before and after ChatGPT release. Data is recent (August 28th).

Of course this doesn't mean they're using ChatGPT either, they could've switched devices or started using them because they felt like it.

  #   user           before_chatgpt after_chatgpt  
  1   fao_           9/1777 (1 %)   36/225 (16 %)
  2   tlogan         1/962 (0 %)    59/399 (15 %)
  3   whynotminot    1/250 (0 %)    36/356 (10 %)
  4   unclebucknasty 13/2566 (1 %)  38/378 (10 %)
  5   iLemming       0/793 (0 %)    61/628 (10 %)
  6   nostrebored    10/1045 (1 %)  32/331 (10 %)
  7   freeone3000    0/2128 (0 %)   74/791 (9 %) 
  8   pdabbadabba    6/932 (1 %)    20/225 (9 %) 
  9   thebooktocome  4/632 (1 %)    18/208 (9 %) 
  10  tnecniv        0/671 (0 %)    34/446 (8 %) 
  11  dkersten       39/5092 (1 %)  24/318 (8 %) 
  12  stared         8/1565 (1 %)   29/392 (7 %) 
  13  ETH_start      3/385 (1 %)    75/1029 (7 %)
  14  tcbawo         2/792 (0 %)    15/218 (7 %) 
  15  jbm            2/406 (0 %)    22/350 (6 %)

Query [2]:

  WITH by_user AS (
    SELECT
      `by` AS user,
      COUNTIF(text LIKE '%—%') AS match_count,
      COUNT(*) AS total_count,
      (timestamp >= '2022-11-30') AS after_chatgpt
    FROM `bigquery-public-data.hacker_news.full` 
    WHERE type = 'comment'
    GROUP BY user, after_chatgpt
  ),
  combined AS (
    SELECT
      user,
      MAX(IF(NOT after_chatgpt, match_count, 0)) AS match_before_chatgpt,
      MAX(IF(NOT after_chatgpt, total_count, 0)) AS total_before_chatgpt,
      MAX(IF(after_chatgpt, match_count, 0)) AS match_after_chatgpt,
      MAX(IF(after_chatgpt, total_count, 0)) AS total_after_chatgpt,
    FROM by_user
    GROUP BY user
    HAVING total_before_chatgpt >= 200 AND total_after_chatgpt >= 200
  ),
  with_fractions AS (
    SELECT
      *,
      SAFE_DIVIDE(match_before_chatgpt, total_before_chatgpt)  AS fraction_before_chatgpt,
      SAFE_DIVIDE(match_after_chatgpt, total_after_chatgpt) AS fraction_after_chatgpt
    FROM combined
  )
  SELECT
    user,
    FORMAT('%d/%d (%.0f %%)', match_before_chatgpt, total_before_chatgpt, ROUND(fraction_before_chatgpt*100)) AS before_chatgpt,
    FORMAT('%d/%d (%.0f %%)', match_after_chatgpt, total_after_chatgpt, ROUND(fraction_after_chatgpt*100)) AS after_chatgpt
  FROM with_fractions
  WHERE fraction_before_chatgpt < 0.01
  ORDER BY fraction_after_chatgpt DESC
  LIMIT 15

[1] https://news.ycombinator.com/item?id=45072937

[2] https://console.cloud.google.com/marketplace/product/y-combi...

dns_snek | 2 days ago

[deleted]

| 2 days ago

A related question - if you feed each comment into an LLM and asked it to classify into {human-produced, llm-produced, not-sure}, how many would it think are from LLMs? How could you try to investigate the true answer?

ks2048 | 2 days ago

For most write-ups, I’ve switched to en-dash flanked by two spaces these days. Easier to type and looks less gippitified imo.

> But British usage - instead - uses spaces, so an en-dash or an em-dash is acceptable.

rednafi | a day ago

Sadly, I’ve been editing it out of my writing, at least online and in emails.

JumpCrisscross | a day ago

I suspect they are generated via "autocorrect", the same way as "smart (more like stupid) quotes" and other characters that tend to cause a great deal of frustration should they find their way into source code. It would be interesting to see how many users regularly make posts containing non-ASCII characters.

userbinator | 2 days ago

This kind of thing is the only way I'm likely to get in a top-10-HackerNews-users list ^_^;

ben_w | 2 days ago

How do people type em dash on the keyboard? On iOS you have to long-press the dash key, on a hardware full keyboard you have to key-in the code(?). That’s all very cumbersome and unnatural! Why people bother using em dash at all?

sinuhe69 | 2 days ago

So does that mean ChatGPT was trained on these HNers' comments?

apparent | 2 days ago

I probably would have made the list, but regular dashes are good enough for me - ASCII forever!!!

phendrenad2 | 2 days ago

As #10 on this list, here’s how I do it on my laptop.

I remap a key to the right of Space to Compose, and add various custom sequences. Before long, I was completely comfortably and casually typing dashes and curly quotes and more, and in fact it takes conscious effort for me to limit myself to ASCII when typing prose. (Writing code, writing *, /, -, ' and " is easy. But writing prose, I genuinely will write ×, ÷ if it feels the right one in that place, −, ‘/’ and “/”.)

On one previous laptop keyboard I mapped Menu, on my current one RAlt is more suitable.

When on Windows, I use WinCompose. On Linux, I used to just use it bare, which had advantages and disadvantages—apps implement a Compose key inconsistently, some messing things up related to includes and some handling overlapping sequences differently. More recently I wanted to be able to type Telugu and installed fcitx5 which is no longer mostly broken under Wayland like it was last time I tried, so now fcitx5 is handling the Compose sequences across the entire system, and working more consistently. Also I can use Ctrl+Alt+Shift+U and get a popup where I can search Unicode by code or description. Now if only that pesky popup would handle Shift+Space and Ctrl+Backspace itself rather than letting them fall through to the parent…

In my ~/.config/sway/config:

  input * {
      xkb_options "caps:backspace,compose:ralt"
  }

(caps:backspace isn’t entirely relevant here, but it’s on the same line and I choose to mention it. When people are remapping Caps Lock, I’ve never understood why so many seem to choose to make it Escape. Just extend the left hand and slap the corner of the keyboard with the ring finger, it’s not a huge movement and is easy to reach and return. Backspace, however, tends to be needed at least as often (and yes, I say that despite using Vim), and is much harder to hit. In my mind, a far better candidate for shifting to that prime real estate.)

For my ~/.XCompose, I start with the defaults and one good set of additions, https://raw.githubusercontent.com/kragen/xcompose/master/dot...:

  include "/usr/share/X11/locale/en_US.UTF-8/Compose"
  include "/home/chris/.XCompose-kragen"

Then I add all kinds of additions. Lots of fine typography stuff like zero-width space and non-joiner, narrow no-break space, thin space… a few more hyphen/dash mappings… and lots of other things like nice emoji sequences, music notation stuff, Greek letters matching Vim digraphs, superscript ordinals (ˢᵗ, ⁿᵈ, ʳᵈ, ᵗʰ), the keyboard shortcut symbols macOS uses (⌘⌃⌥⇧⌫ and another dozen less common ones), control pictures like ␆, and a handful of other things.

When all’s said and done:

• Compose - - - gets me — EM DASH (stock)

• Compose - - . gets me – EN DASH (stock)

• Compose - - = gets me − MINUS SIGN (custom)

• Compose - - w gets me ⸺ TWO EM DASH (custom; w for wide)

• Compose - - W gets me ⸻ THREE EM DASH (custom; W for Wider)

The last two I use occasionally, the other three I use very frequently. I went through a phase of using HYPHEN and SOFT HYPHEN, now I seldom use them.

I also like to write &c. (italic where supported) for et cetera.

For quotation marks, I also use custom mappings:

  <Multi_key> <semicolon> <semicolon>   : "‘"   U2018 # LEFT SINGLE QUOTATION MARK
  <Multi_key> <apostrophe> <apostrophe> : "’"   U2019 # RIGHT SINGLE QUOTATION MARK
  <Multi_key> <colon> <colon>           : "“"   U201c # LEFT DOUBLE QUOTATION MARK
  <Multi_key> <quotedbl> <quotedbl>     : "”"   U201d # RIGHT DOUBLE QUOTATION MARK

Think about how you physically type them, and I reckon these mappings make a lot of sense, very easy to type. Much better than the stock bindings (<' >' <" >") or kragen ones (`Space 'Space `` ''; or 6' 9' 6" 9").

—⁂—

(Oh yeah, that one’s <Multi_key> <h> <r> : "—⁂—".)

Now, I have one question I’d like answered. Overlapping sequences. If you have -> → and <- ← you’re fine, but when you add <-> ↔, I can’t find any way of using the <- sequence any more. Before fcitx5, some apps would ignore one or the other (in ways difficult to explain which I think involved the fact that some definitions came from includes), and some would let you terminate the sequence early and match the shorter one (e.g. Compose < - Enter). Is there some proper solution I’ve missed?

I have plans for an article on my keyboard arrangements, including sharing a full .XCompose, but I’m going to finish my next major revision to my website first. Because then I’ll be able to draw things instead of just writing.

—⁂—

On mobile, I think I use FUTO keyboard at present, which lets me access most of these things, but not elegantly. I want to make my own keyboard layout that lets me access the good stuff more easily, but I haven’t got to it yet.

Also: anyone want to join me in advocating for completion dictionaries and libraries to replace their ' apostrophes with ’, or at least to support both approaches equally? I’m fed up with not having this stuff, Vim is the only place where it was straightforward to get it about right, and mobile is just a mess.

chrismorgan | 2 days ago

Well─────that was bound to happen.

notpushkin | 2 days ago

Some of us use triple dash to indicate the same thing. Like LateX. You should add that too.

mickeyp | 2 days ago

[deleted]

| 2 days ago

I guess I’m confused. Why is it interesting to know how many em dashes were used before the dawn of ChatGPT? It’s how many AFTER that seems like it would be far more interesting.

IAmGraydon | 2 days ago

[deleted]

| a day ago

Yes! #21! A list I finally made — and I was not surprised to find I was on it.

JKCalhoun | a day ago

There's also https://news.ycombinator.com/item?id=27787448

dang | 2 days ago

I was hoping to see a graph of em-dash usage over time across all comments - would be interesting to see the spike post LLM

nullandvoid | 2 days ago

This is amazing The rise of the AI generated em dash is insane.

almostbasic | a day ago

I do em dash on my phone, and --- on the computer. Can we expand this further? I wanna make at least the top 200!

Ericson2314 | 2 days ago

[deleted]

| a day ago

If I had a key for it on my keyboard, I'd use it more often too.

k__ | 2 days ago

The post where we discovered dan g was an AI.

qingcharles | 2 days ago

We need a Column for em-dashes per 1000 words

qwertytyyuu | 2 days ago

[deleted]

| 2 days ago

Between the comments running correlations BC and AC, things still seem inconclusive.

@dang - can we add it to the HN guidelines that we should not or should call out AI when we see it? On one hand people might get defensive and the threads get out of hand. On the other hand, we don’t want AI slop.

firesteelrain | 2 days ago

I was surprised I only ranked 34th for earliest -- but then I saw it was the date my account was created.

nullc | 2 days ago

This shows absolute numbers. It would be better to see frequency.

EDIT: There's a second ranking linked at the top that shows this.

lo_zamoyski | 2 days ago

Place 33. I hate the whole LLMs em-dash thing since I now have to consider how em-dash usage impacts the perception of those reading what I wrote.

At least I tended to use em-dash always with spaces surrounding it — like so. I know the anglospace-convention is to use it without spaces, but I just don't like that visually. At least one way to tell me apart from typical LLM-generated text.

atoav | 2 days ago

So now some folks will intentially add in em dashes to get on the leaderboard — oops!

attogram | 2 days ago

[deleted]

| 2 days ago

[deleted]

| 2 days ago

[dead]

aaron695 | 2 days ago

[dead]

anonyMusk | 2 days ago

[dead]

RobertEva | 2 days ago