Using the HN public dataset in Google BigQuery [0], which I think fits easily in the amount of free queries allowed:
SELECT
EXTRACT(YEAR FROM timestamp) AS year,
SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) AS withDash,
COUNT(*) AS total,
SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'comment'
GROUP BY year
ORDER BY year;
year with— total frac
2006 0 12 0.000
2007 13 70858 0.000
2008 461 247922 0.001
2009 1497 491034 0.003
2010 3835 842438 0.005
2011 4719 1044913 0.005
2012 5648 1246782 0.005
2013 7881 1665185 0.005
2014 8400 1510814 0.006
2015 9967 1642912 0.006
2016 12081 2093612 0.006
2017 14530 2361709 0.006
2018 19246 2384086 0.008
2019 23662 2755063 0.009
2020 27316 3243173 0.008
2021 32863 3765921 0.009
2022 34657 4062159 0.009
2023 36611 4221940 0.009
2024 32543 3339861 0.010
2025 30608 2231919 0.014
So there's definitely been an increase.Querying for the users who use "—" most as a proportion of all their comments:
SELECT
`by`,
SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction,
COUNT(*) AS total,
MIN(timestamp) AS minTime,
MAX(timestamp) AS maxTime
FROM `bigquery-public-data.hacker_news.full`
WHERE
type = 'comment' AND
timestamp < '2022-11-30'
GROUP BY `by`
HAVING COUNT(*) > 100
ORDER BY fraction DESC
LIMIT 250;
zmgsabst uses them the most [1], westoncb [2] is an older account that uses them fourth-most.[0] https://console.cloud.google.com/marketplace/product/y-combi...
You might also want to rank by how often people use double hyphens-- like so.
I'm probably not alone here in being a longtime Linux user who started using a Macbook after the Apple Silicon transition, late 2022.
On Windows and Linux, inserting an em-dash is a laborious alt-code process. But on MacOS with an Apple keyboard, the `option` key acts like a tertiary shift, so an `–` em dash is just <option><->.
I didn't start using em-dashes (typing -- is just second nature to me and I'm still on Linux most of the time) when I got a Macbook, but I imagine some people in my shoes did.
You can count your own with this snippet. Just replace my username with your own. My count before this comment was 46.
curl -s "https://hn.algolia.com/api/v1/search?tags=comment,author_sjs382&hitsPerPage=10000" \
| jq -r '.hits[].comment_text' \
| grep -o "—" \
| wc -l
I’d be interested in seeing how the data changes if instead of the total raw number of posts with em-dashes you instead check for their percentage considering the total number of posts. I guess the folks who registered later would be bumped up the list?
This is the kind of top-tier content we need on HN. These are the issues that really matter!
Fun, but perhaps the ratio of em-dash per comment would be more interesting?
Otherwise it looks like the "race" is biased towards just the amount of comment posted.
The em-dash giveaway is an actual Unicode em-dash character, right? I professionally had to learn Latex to write a paper in the 1990s and picked up a "---" habit ever since, and I've been wondering if that's some kind of weird LLM tell now.
Due to the interest in this project, I created a second, more comprehensive version of the leaderboard:
https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...
This second version was vibe-coded with Codex CLI. I also tried Gemini CLI, but it didn’t work very well. The SQL scripts I ran at BigQuery were by Claude.
I am not a programmer or web designer, so I will leave these pages as they are, warts and all. It was a fun project, though. I never would have attempted something like this pre-vibe-coding.
As an em dash appreciator—and there are dozens of us!—I have mixed feelings on ChatGPT embracing our little guy. My suspicion is that it's a quirk of their RLHF tuning where the em dash—which is definitely distinct from the en dash and hyphen—came to be associated with authoritative writing.
Heh. A top 50. No way that I'm in there — I don't post that much.
Oh look, a more complete leaderbord — click.
Oh. I'm at position 51.
It might be more fun to see users who’s emdash usage increased after the release.
It would be interesting to compare the post-2022 usage trends among the top contenders.
I have started using triple dots as on Linux I can get them with Alt Gr + .
A lot of symbols can be accessed with Alt Gr compared to Windows
Feature request: Sort by em-dashes per comment.
Feature request 2: Em-dash regular-dash ratio.
I mourn and celebrate the emdash as a sign and signal. I mourn our memories of it and laugh at myself in the future thinking about this when I have forgotten about it.
It's like the memory of the jokes about the wacky phrases of gpt2 or the ew at the yellow hue saturated ai generated images.
In the future this sign will be gone and our pattern recognition will adapt and our memory of this will also mostly be gone. Hello to future tech archeologists. The emdash isn't a meme, it will never survive and replicate but it's fun while it's lasting and I'm enjoying it in the meantime!
I mourn also because in the future we may have few or no obvious signs of LLM use. These are the golden years.
I just realized I’ve been using en-dash this whole time. This is an identity crisis.
Microsoft word converts your dashes to em-dashes for you automatically, for a least the last decade. So as a sibling comment said, if it's professionally written, there are probably em dashes used more than regular ones.
I think a bit more interesting statistics is to count only \w—\w. This excludes cases like "(—)" and emdashes surrounded by spaces, which is, apparently, what Russian-speaking users like to use. Also it is an very old tradition to format page titles as <title>[Page name] — [Website name]</title>: depending on language this is a default setting for MediaWiki, WordPress, etc.
As someone who leans heavily on emdashes, this has all been very annoying.
Ironically, I personally prefer good typography, but unless the editor for the desktop app is autocorrecting -- to —, I usually don't bother. But when I type on the phone with screen keyboard, I almost always do bother, even though entering text on mobile is objectively slower and more difficult and often with fewer options.
How about en dash usage? Has that been used as a similar false indicator?
How can I get to the top of the leaderboard?
Is the amount of em dashes counted or the comments that have at least one em dash inside them?
You know, I am asking for...science(?).
I also wanted to point out that these could be Kantonese/Mandarin/Japanese/SouthEast Asian users that use their local keymapping software because a lot of them use the idiom symbols (e.g. the dot character, too) when they switch to the English keymaps.
Check out how laptops usually look like over there, a lot of manufacturers build that right into the firmware.
I'm actually one of the people who use em dash regularly. I treat it like a pause—like sighing. It's very easy to type it on a Mac it becomes muscle memory: Opt+Shift+Dash.
This is kind of pointless given that iOS’s autocorrect has been adding em dashes, ellipsis and smart quotes to comments since… forever.
(Like now)
It’s become a weird kind of witch hunting regarding blogs, too, and I have a 20+ year old site that renders all of its content using Markdown extensions that do the same (and that also convert dual hyphens to em dashes—something I’ve been typing for about as long).
I started using emdashes in my academic career, after my advisor pointed me to the subtle differences. And since then, I like and use emdash a lot. In Latex, it is easily produced, just keep the spacing rules in mind. The Punctuation Guide is a nice reference on it https://www.thepunctuationguide.com/
I think this whole em dash topic should lead to some deeper (though not very deep) conversations:
* If it was not widely used before where/how did (chat)GPT picked it up?
* If it was widely used, then it shouldn't be a topic at all. But, there seems to be informal agreement that it wasn’t widely used.
* Or, could GPT have inferred that even though it's not widely used, it's the better way to go (to use it). Which then makes one wonder about the whole probability of next token idea. Maybe this line of thinking falls too short of what might be really going on internally.
* If it had picked up something that is widely used but in the wrong way, it should make us pause (again) about the future feedback loops these LLMs, which aren't going away, are already creating. Not just in terms of grammar and spelling but also in terms of way of thinking and seeing the world.
(edit: formatting)The one thing LLMs do well is manipulating text. The danger is obviously that it will reduce individual expression and make everything the same mediocre sludge.
For me writing is a way to capture a stream of consciousness so I don’t really see the advantage of using an LLM.
When I see some trivial mediocrity I simply stop reading. It’s just not interesting.
I applaud this data. But how are people actually creating an em-dash in the "add comment" box? Some non-obvious OS-level shortcut?
Someone should make something like this for the wider world outside of just HN. Go through all my publications through gScholar or elsewhere, and scour and parse anything I wrote publicly pre-11/30/22 to establish some kind of proof-of-humanity. Sincerely, an em-dash user who got overtaken by the GenAI wave of the mid-2020s.
Confused by the year stats below - that shows an increase much earlier that say GPT3 release date. So I'm guessing whatever is going on isn't just AI?
I noticed them in the Economist around 2010, and thought they were slick. Tons of software will autodetect "---" as an emdash so that works.
Honestly, even if it doesn't make it pretty I find stringing together a few hyphens does the trick in less formal settings.
Slightly tweaked, a leaderboard of em dash containing comments after ChatGPT release, limited to users who used them in fewer than 1% of comments before ChatGPT release, and who posted at least 200 comments before and after ChatGPT release. Data is recent (August 28th).
Of course this doesn't mean they're using ChatGPT either, they could've switched devices or started using them because they felt like it.
# user before_chatgpt after_chatgpt
1 fao_ 9/1777 (1 %) 36/225 (16 %)
2 tlogan 1/962 (0 %) 59/399 (15 %)
3 whynotminot 1/250 (0 %) 36/356 (10 %)
4 unclebucknasty 13/2566 (1 %) 38/378 (10 %)
5 iLemming 0/793 (0 %) 61/628 (10 %)
6 nostrebored 10/1045 (1 %) 32/331 (10 %)
7 freeone3000 0/2128 (0 %) 74/791 (9 %)
8 pdabbadabba 6/932 (1 %) 20/225 (9 %)
9 thebooktocome 4/632 (1 %) 18/208 (9 %)
10 tnecniv 0/671 (0 %) 34/446 (8 %)
11 dkersten 39/5092 (1 %) 24/318 (8 %)
12 stared 8/1565 (1 %) 29/392 (7 %)
13 ETH_start 3/385 (1 %) 75/1029 (7 %)
14 tcbawo 2/792 (0 %) 15/218 (7 %)
15 jbm 2/406 (0 %) 22/350 (6 %)
Query [2]: WITH by_user AS (
SELECT
`by` AS user,
COUNTIF(text LIKE '%—%') AS match_count,
COUNT(*) AS total_count,
(timestamp >= '2022-11-30') AS after_chatgpt
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'comment'
GROUP BY user, after_chatgpt
),
combined AS (
SELECT
user,
MAX(IF(NOT after_chatgpt, match_count, 0)) AS match_before_chatgpt,
MAX(IF(NOT after_chatgpt, total_count, 0)) AS total_before_chatgpt,
MAX(IF(after_chatgpt, match_count, 0)) AS match_after_chatgpt,
MAX(IF(after_chatgpt, total_count, 0)) AS total_after_chatgpt,
FROM by_user
GROUP BY user
HAVING total_before_chatgpt >= 200 AND total_after_chatgpt >= 200
),
with_fractions AS (
SELECT
*,
SAFE_DIVIDE(match_before_chatgpt, total_before_chatgpt) AS fraction_before_chatgpt,
SAFE_DIVIDE(match_after_chatgpt, total_after_chatgpt) AS fraction_after_chatgpt
FROM combined
)
SELECT
user,
FORMAT('%d/%d (%.0f %%)', match_before_chatgpt, total_before_chatgpt, ROUND(fraction_before_chatgpt*100)) AS before_chatgpt,
FORMAT('%d/%d (%.0f %%)', match_after_chatgpt, total_after_chatgpt, ROUND(fraction_after_chatgpt*100)) AS after_chatgpt
FROM with_fractions
WHERE fraction_before_chatgpt < 0.01
ORDER BY fraction_after_chatgpt DESC
LIMIT 15
[1] https://news.ycombinator.com/item?id=45072937[2] https://console.cloud.google.com/marketplace/product/y-combi...
A related question - if you feed each comment into an LLM and asked it to classify into {human-produced, llm-produced, not-sure}, how many would it think are from LLMs? How could you try to investigate the true answer?
For most write-ups, I’ve switched to en-dash flanked by two spaces these days. Easier to type and looks less gippitified imo.
> But British usage - instead - uses spaces, so an en-dash or an em-dash is acceptable.
Sadly, I’ve been editing it out of my writing, at least online and in emails.
I suspect they are generated via "autocorrect", the same way as "smart (more like stupid) quotes" and other characters that tend to cause a great deal of frustration should they find their way into source code. It would be interesting to see how many users regularly make posts containing non-ASCII characters.
This kind of thing is the only way I'm likely to get in a top-10-HackerNews-users list ^_^;
How do people type em dash on the keyboard? On iOS you have to long-press the dash key, on a hardware full keyboard you have to key-in the code(?). That’s all very cumbersome and unnatural! Why people bother using em dash at all?
So does that mean ChatGPT was trained on these HNers' comments?
I probably would have made the list, but regular dashes are good enough for me - ASCII forever!!!
As #10 on this list, here’s how I do it on my laptop.
I remap a key to the right of Space to Compose, and add various custom sequences. Before long, I was completely comfortably and casually typing dashes and curly quotes and more, and in fact it takes conscious effort for me to limit myself to ASCII when typing prose. (Writing code, writing *, /, -, ' and " is easy. But writing prose, I genuinely will write ×, ÷ if it feels the right one in that place, −, ‘/’ and “/”.)
On one previous laptop keyboard I mapped Menu, on my current one RAlt is more suitable.
When on Windows, I use WinCompose. On Linux, I used to just use it bare, which had advantages and disadvantages—apps implement a Compose key inconsistently, some messing things up related to includes and some handling overlapping sequences differently. More recently I wanted to be able to type Telugu and installed fcitx5 which is no longer mostly broken under Wayland like it was last time I tried, so now fcitx5 is handling the Compose sequences across the entire system, and working more consistently. Also I can use Ctrl+Alt+Shift+U and get a popup where I can search Unicode by code or description. Now if only that pesky popup would handle Shift+Space and Ctrl+Backspace itself rather than letting them fall through to the parent…
In my ~/.config/sway/config:
input * {
xkb_options "caps:backspace,compose:ralt"
}
(caps:backspace isn’t entirely relevant here, but it’s on the same line and I choose to mention it. When people are remapping Caps Lock, I’ve never understood why so many seem to choose to make it Escape. Just extend the left hand and slap the corner of the keyboard with the ring finger, it’s not a huge movement and is easy to reach and return. Backspace, however, tends to be needed at least as often (and yes, I say that despite using Vim), and is much harder to hit. In my mind, a far better candidate for shifting to that prime real estate.)For my ~/.XCompose, I start with the defaults and one good set of additions, https://raw.githubusercontent.com/kragen/xcompose/master/dot...:
include "/usr/share/X11/locale/en_US.UTF-8/Compose"
include "/home/chris/.XCompose-kragen"
Then I add all kinds of additions. Lots of fine typography stuff like zero-width space and non-joiner, narrow no-break space, thin space… a few more hyphen/dash mappings… and lots of other things like nice emoji sequences, music notation stuff, Greek letters matching Vim digraphs, superscript ordinals (ˢᵗ, ⁿᵈ, ʳᵈ, ᵗʰ), the keyboard shortcut symbols macOS uses (⌘⌃⌥⇧⌫ and another dozen less common ones), control pictures like ␆, and a handful of other things.When all’s said and done:
• Compose - - - gets me — EM DASH (stock)
• Compose - - . gets me – EN DASH (stock)
• Compose - - = gets me − MINUS SIGN (custom)
• Compose - - w gets me ⸺ TWO EM DASH (custom; w for wide)
• Compose - - W gets me ⸻ THREE EM DASH (custom; W for Wider)
The last two I use occasionally, the other three I use very frequently. I went through a phase of using HYPHEN and SOFT HYPHEN, now I seldom use them.
I also like to write &c. (italic where supported) for et cetera.
For quotation marks, I also use custom mappings:
<Multi_key> <semicolon> <semicolon> : "‘" U2018 # LEFT SINGLE QUOTATION MARK
<Multi_key> <apostrophe> <apostrophe> : "’" U2019 # RIGHT SINGLE QUOTATION MARK
<Multi_key> <colon> <colon> : "“" U201c # LEFT DOUBLE QUOTATION MARK
<Multi_key> <quotedbl> <quotedbl> : "”" U201d # RIGHT DOUBLE QUOTATION MARK
Think about how you physically type them, and I reckon these mappings make a lot of sense, very easy to type. Much better than the stock bindings (<' >' <" >") or kragen ones (`Space 'Space `` ''; or 6' 9' 6" 9").—⁂—
(Oh yeah, that one’s <Multi_key> <h> <r> : "—⁂—".)
Now, I have one question I’d like answered. Overlapping sequences. If you have -> → and <- ← you’re fine, but when you add <-> ↔, I can’t find any way of using the <- sequence any more. Before fcitx5, some apps would ignore one or the other (in ways difficult to explain which I think involved the fact that some definitions came from includes), and some would let you terminate the sequence early and match the shorter one (e.g. Compose < - Enter). Is there some proper solution I’ve missed?
I have plans for an article on my keyboard arrangements, including sharing a full .XCompose, but I’m going to finish my next major revision to my website first. Because then I’ll be able to draw things instead of just writing.
—⁂—
On mobile, I think I use FUTO keyboard at present, which lets me access most of these things, but not elegantly. I want to make my own keyboard layout that lets me access the good stuff more easily, but I haven’t got to it yet.
Also: anyone want to join me in advocating for completion dictionaries and libraries to replace their ' apostrophes with ’, or at least to support both approaches equally? I’m fed up with not having this stuff, Vim is the only place where it was straightforward to get it about right, and mobile is just a mess.
Well─────that was bound to happen.
Some of us use triple dash to indicate the same thing. Like LateX. You should add that too.
I guess I’m confused. Why is it interesting to know how many em dashes were used before the dawn of ChatGPT? It’s how many AFTER that seems like it would be far more interesting.
Yes! #21! A list I finally made — and I was not surprised to find I was on it.
There's also https://news.ycombinator.com/item?id=27787448
I was hoping to see a graph of em-dash usage over time across all comments - would be interesting to see the spike post LLM
This is amazing The rise of the AI generated em dash is insane.
I do em dash on my phone, and --- on the computer. Can we expand this further? I wanna make at least the top 200!
If I had a key for it on my keyboard, I'd use it more often too.
The post where we discovered dan g was an AI.
We need a Column for em-dashes per 1000 words
Between the comments running correlations BC and AC, things still seem inconclusive.
@dang - can we add it to the HN guidelines that we should not or should call out AI when we see it? On one hand people might get defensive and the threads get out of hand. On the other hand, we don’t want AI slop.
I was surprised I only ranked 34th for earliest -- but then I saw it was the date my account was created.
This shows absolute numbers. It would be better to see frequency.
EDIT: There's a second ranking linked at the top that shows this.
Place 33. I hate the whole LLMs em-dash thing since I now have to consider how em-dash usage impacts the perception of those reading what I wrote.
At least I tended to use em-dash always with spaces surrounding it — like so. I know the anglospace-convention is to use it without spaces, but I just don't like that visually. At least one way to tell me apart from typical LLM-generated text.
So now some folks will intentially add in em dashes to get on the leaderboard — oops!
[dead]
[dead]
[dead]
v1 (the submitted URL) was https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo.... We've replaced it now v2, for more complex analytical em dash explorations :) - see https://news.ycombinator.com/item?id=45075379 and https://news.ycombinator.com/item?id=45072635.