Migrating to Postgres
For all the Prisma-haters: I salute you. But I want to reply to numerous comments with the following:
ORMs come in two main types, that I'm aware of: Active Record (named after the original Ruby one, I think) and Data Mapper (think Hibernate; SQLAlchemy).
Active Record ORMs are slightly more ergonomic at the cost of doing loads of work in application memory. Data Mapper looks slightly more like SQL in your code but are much more direct wrappers over things you can do in SQL.
Data Mapper also lets you keep various niceties such as generating migration code, that stem from having your table definition as objects.
Use Data Mapper ORMs if you want to use an ORM.
I read it as: Why You Shouldn't Use Prisma and How Cockroach Hung Us Out To Dry
I already knew about prisma from the infamous https://github.com/prisma/prisma/discussions/19748
I can't help thinking more startups need greybeards around. (Of which, realistically, I'm now one.)
Largest table 100 million rows and they were paying 6 figures for database services annually? I have one now that sits happily enough on an 8yo laptop. I've worked on systems that had similar scale tables chugging along on very average for 20 years ago MSSQL 2000 boxes. There just isn't a need for cloud scale systems and cloud scale bills for that data volume.
The problems they're describing should never have got that far without an experienced hand pointing out they didn't make sense, and if they'd hired that greybeard they'd have spotted it long before.
The answer to the question, "what database should I use?" is "postgres". If you are in a situation where postgres actually won't work, then you already would know exactly why postgres won't work.
In other words: [Postgres -> exotic solution] is the path everyone should take (and 99% will just stay in postgres), and not [exotic solution -> postgres].
It's wild and hilarious, how often startups and companies go for distributed databases like CockroachDB/TiDB/Yugabyte before they actually need distribution, this trends sucks. 100 million rows is nothing that a well-tuned Postgres or MySQL instance (or even read-replicated setup) can't handle comfortably. Scale when you hit the wall.
I've lost count of how many "Migrating from X to Postgres" articles I've seen.
I don't think I've once seen a migrating away from Postgres article.
It's an awkward article. To answer why a query is slow you need a bit more details than just the query. Also, I reread about timeouts and didn't get it, what was the database, whether it was a database issue, how it was related to migration.
The only information I could extract was that the company made bad architectural decisions, believes in ORM (looking at the queries, there are many doubts that the data layouts in DB are adequate) and cannot clearly explain situations. But this is only interesting to their candidates or investors.
It may sound rude, so I apologise.
Feels like postgres is always the answer. I mean like there's gotta be some edge case somewhere where postgres just can't begin to compete with other more specialized database but I'd think that going from postgres to something else is much easier than the other way around.
> First, the support portal is a totally different website that doesn’t share auth with the main portal. Second, you have to re-input a lot of data they already know about you (cluster ID, etc). And by the time they respond it’s typically been a week.
I was about to ask what was main constraint for CockroachDB like iostats and atop info for CPU/disk drives, but realized that is probably something offloaded to some SaaS - so still curious
great blog. It seems like you might benefit from columnar storage in Postgres for that slow query that took ~20seconds.
It's interesting that people typically think of columnstores for strict BI / analytics. But there are so many App / user-facing workloads that actually need it.
ps: we're working on pg_mooncake v0.2. create a columnstore in Postgres that's always consistent with your OLTP tables.
It might help for this workload.
Did I miss something, or does the article not mention anything about sharding in Postgres? Was that just not needed?
Also, query planner maturity is a big deal. It's hard to get Spanner to use the indexes you want.
It is forever enraging to me that ORMs turn SELECT * into each individual column, mostly because people then post the whole thing and it’s obnoxiously large.
Similarly maddening, the appalling lack of normalization that is simply taken for granted. “It’s faster, bro.” No, no, it is not. Especially not at the hundreds of millions or billions of rows scale. If you store something low-cardinality like a status column, with an average length of perhaps 7 characters, that’s 8 bytes (1 byte overhead assumed, but it could be 2). Multiply that by 2 billion rows, and you’re wasting 16 GB. Disk is cheap, but a. Memory isn’t b. Don’t be lazy. There’s a right way to use an RDBMS, and a wrong way. If you want a KV store, use a damn KV store.
Finally, I’d be remiss if I failed to point out that Prisma is an unbelievably immature organization who launched without the ability to do JOINS [0]. They are forever dead to me for that. This isn’t “move fast and break things,” it’s “move fast despite having zero clue what we’re doing but convince JS devs that we do.”
I'm curious about Motion's experience with "Unused Indices". They suggest Cockroach's dashboard listed used indexes in the "Unused Indices" list.
I think the indexes they suspect were used are unused but Motion didn't realize CockroachDB was doing zigzag joins on other indexes to accomplish the same thing, leaving the indexes that would be obviously used as genuinely not used.
It's a great feature but CRDB's optimizer would prefer a zig zag join over a covering index, getting around this required indexes be written in a way to persuade the optimizer to not plan for a zig zag join.
a 100 million rows table is fairly small and you just don't need a distributed database. but you will need one if you hit 10 billion rows
I wonder increasingly with tools like ChatGPT whether ORMs make sense anymore? The argument I've always heard for ORMs is that they make it quick and easy to make the initial repository method, and that they make migrating to a new DB easier (hypothetically).
It seems to me LLMs now help fill both those roles, but with the benefit that you can now tune the SQL as needed. I also always prefer actually knowing exactly what queries are going to my DB, but YMMV.
did you try using the native pg library or postgres or pg-promise library and scrap the ORM completely to see what effect it has? If you are looking explicitly for migrations, you can simply use node-pg-migrate https://www.npmjs.com/package/node-pg-migrate and scrap the rest of all the FLUFF that ORMs come with. ORMs in general are horribly bloated and their performance for anything more than select from table where name = $1 is very questionable
Why not optimise the bad queries first?
Aside. Job section says not 9-5. What does that mean? Long hours? Or not 9-5 attitude?
It still makes me sad when half the queries I see are json_* - I know its far too late, but a big sad trombone in query performance is constantly left joining to planner queries that are going to give you 100 rows as an estimate forever.
WHERE CONDITION AND 1=1 results in scanning whole table? I dont think so...
That was an interesting read, seemed like an overwhelming amount of data for why they should move off cockroach. All of my db work has been read heavy and I’ve never had a need for super fast multi-region writes. Is a multi-region write architecture possible in Postgres? I’m trying to understand if GDPR was the requirement that resulted in cockroach or if the lackluster multi region write was the bigger driver.
indeed interesting
[dead]
> By Jan 2024, our largest table had roughly 100 million rows.
I did a double take at this. At the onset of the article, the fact they're using a distributed database and the mention of a "mid 6 figure" DB bill made me assume they have some obscenely large database that's far beyond what a single node could do. They don't detail the Postgres setup that replaced it, so I assume it's a pretty standard single primary and a 100 million row table is well within the abilities of that—I have a 150 million row table happily plugging along on a 2vCPU+16GB instance. Apples and oranges, perhaps, but people shouldn't underestimate what a single modern server can do.