Why is Apple Rosetta 2 fast? (2022)
Super interesting. Putting my PM hat on, I wonder: how many x86 apps on Apple still benefit from this much performance? What's the coverage? The switch to M1 happened 4 years ago, so the software was designed for hardware nearly half a decade old.
Excellent engineering and nice that it was built properly. Is this something that Linux / Wine / the Steam compatibility layer already benefit from?
Standardization of future Arm PCs, https://news.ycombinator.com/item?id=42182442
The Arm PC Base System Architecture 1.0 (PC-BSA) specifies a standard hardware system architecture for Personal Computers (PCs) that are based on the Arm 64-bit Architecture. PC system software, for example operating systems, hypervisors, and firmware can rely on this standard system architecture. PC-BSA extends the requirements specified in the Arm BSA.
Tangent: also why orbstack, a Docker replacement on Mac, is fast [1] (I'm not affiliated in any way, just a fan and happy user :-).
--
I wonder if these lessons might be applied to Wasm runtimes where the Wasm could be JIT compiled into native code. Of course this does raise the possibility of security concerns if the Wasm compilation has some bug, and then of course there’s also the question of whether Wasm’s requirements might mean native compilation doesn’t give much of a performance boost (as seems to be the case with e.g., Java byte code).
One other thing that is not mentioned is that Apple has an extension to compute rarely used x86 flags such as the parity flag in hardware rather than in software.
(2022)
Good article.
[dead]
Post got the big one: Total Store Ordering (TSO).
The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.
The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.