Ask HN: How to learn how to scale software?

Pooge | 35 points

A lot in this area comes from an expecience in operations or at least operations mindset.

If you haven't solved certain problems caused by real users and real loads on production systems, you can still develop such mindset.

Build a habit of overloading your app with tons of requests - using load testing tools like wrk, wrk2, k6. Try to check this way all layers of your system, including the database (for this you obviously need to be able to create load-testing requests causing a real conversation with the database). Check how much of i/o, memory, and cpu (in this order) your app, database, cache, messaging etc is using under the said load. Develop a habit of opening inspector in your browser and peek at http conversations, not only for your services, but for a random page you visit. Get acquanted with Brendan Gregg's perf tools and his blog. Know how to use strace (although it's usage on production is discouraged as it slows down the examined process substantially) and understand its output. In general, become interested when your code ends up doing kernel syscalls and when not and, how to limit syscalls by buffering and how to read what's really going on. Make sure you understand how the virtual memory works on a modern system. That's just things from the top of my head

kunley | a month ago

You learn by doing it, but it's also not incredibly complex for nearly all SaaS products (most simply are not at the scale that actually necessitate it).

At it's most basic, you need insight into bottle necks. That means some sort of profiling tool. From there, you start looking at slow things and making decision.

In my experience, those decisions largely come down to a few things:

* Did I miss a basic optimization, like an index on a query, some sort of caching, an N+1, etc, etc. Don't look too deep, here.

* Am I hitting hard limits of my hardware. Memory, CPU, disk throughput, network throughput, etc, etc. If you are, you should ask can I simply add more/bigger servers? If you are small, the answer is almost always "yes".

* Am I doing something really ineffectively? This is similar to point 1, but looks at things a bit more holistically. This might be slow external network requests, a lack of caching, an inefficient join or query. Spend a lot of time here before moving onto the next point. These fixes are almost always easier than architectural changes.

* Do I need to make an architectural change? Look for things poor dependency isolation, spaghetti code (really spaghetti data), and components that can be generalized.

SkyPuncher | a month ago

What the contemporary dev environment teaches about scale is that it barely matters until you hit certain thresholds, and then it suddenly dominates everything.

A corollary of this is that as your program becomes more complex, relatively more of the code ceases to matter to performance, because its purpose is to configure a certain inner loop, and the inner loop becomes the only real bottleneck.

So, when coding for very small systems(retro or embedded) the whole program architecture matters, but at hyper scale, the problem is seen mostly in terms of efficiently distributing the load, and only deeply optimizing the program in very specific places.

A good starting point for thinking about this is a system like BitTorrent: The ideal torrent experience will most likely use multiple peers, and the performance needs to remain consistent as the number of peers increases and the load becomes more complex to distribute. But it's not really about the local performance so much as it is maintaining the overall network conditions - if every peer is doing "OK" at serving and retrieving files that's better than a very good experience served inconsistently.

crq-yml | a month ago

Another way to think about it is to be aware of and keep in mind all the various limitations of your implementation.

If you have a program that reads some stuff into memory, then transforms it, and then spits it back out (which is basically every program)... how can it still work if the data set is bigger than the amount of memory? What if it's bigger than the address space? (figure out how to split the data up and work on segments then combine the results)

Using ints or longs or whatever as an identifier? What if we have more things to identify than UINT_MAX or ULONG MAX? What if we need to allocate identifiers so quickly, and in so many different places, that contention for the next available number becomes a bottleneck? (use uuids or random identifiers)

A lot of people write software with the assumption that there is an unlimited amount of memory, CPU, etc., and usually that works fine, until it doesn't and it blows up. Scalability is about being aware of all the finite limitations, and having strategies for dealing with them instead of just hoping you don't get close to them.

tacostakohashi | a month ago

Skilles like that are only really learned by doing it.

There are no shortcuts, sorry. No amount of synthetic experiments, simulations, theoretical examples will give you the actual hands on knowledge.

For sure, books etc. are useful to gain some basic understanding.

But beyond that - you should set more practical and specific goals that would lead you to scaling challenges naturally.

For example - build your own web crawler etc.

Or just set a goal to join a company with demanding products.

aristofun | a month ago

I spent about a decade doing performance work, Ie: app crashes, can't handle the load, can't scale up to what we need, etc. I think there are at least parts to answering your question:

1) Get good at testing and profiling applications so that you can identify specific bottlenecks and then find architectural solutions to those pain points.

2) Get good at looking at an architectural model and identifying those pain points even before you've built a system. BUT - sometimes the problem is that you have incomplete information about the demands on a system and/or in other cases someone will want to over-architect from the beginning - which in some cases even becomes the bottleneck.

I guess I'd add perhaps a third option based on your specific question: Learn what it means to scale software. Sometimes that means low level things in software. Sometimes that means architecture. And sometimes it's just throwing a shit-ton of hardware at a problem. So in your desire to scale software, you need to know what technical solutions are available to you and make the tradeoffs depending on the resources (time, money, people, etc) available.

So how do you learn this stuff? Well, short of companies paying you like they did in my case, maybe consider some problem you want to explore, build a prototype, and consider introducing different types of both load and errors into the system so you can see how the system performs. A few years back I was asked to build a system to ingest what the client thought might be 10,000 images a day and run through a series of steps with those images. Well, the client effed up and I discovered that the real demand was more like 10,000,000 images a day that needed to be processed. So I had to modify the architecture. Point being - you could come up with some scenario like that, build yourself a little prototype lab experiment, and start playing with what happens as you modify parts of the architecture.

poulsbohemian | a month ago
[deleted]
| a month ago
[deleted]
| a month ago

Standard engineering practice scales to large engineering projects.

Know what problem the design should solve.

Know what resources are available.

Measure, prototype, test iteratively.

Eat the elephant one byte at a time.

Good luck.

brudgers | a month ago

First of all, the fact that you're asking this question puts you ahead of most engineers that I know. There's a well known saying that goes something like "Make it work, make it work well, then make it fast."

One of the simplest ways to think about scale is to think in terms of speed. This is a very very gross oversimplification and glosses over a lot of really important concepts, but at its core, you can say "if it's fast enough, it'll scale."

In a very simple mathematical sense, consider the idea that you have a single-instance, single-threaded application with no concurrency. If a request takes 1000ms to run, then you can do, at most, 1 request per second. If the request takes 100ms, you can do 10 requests per second. If it takes 10ms, you can do 100 request per second, and if it takes 1ms, you can do 1000 requests per second.

See? Speed is throughput is scale.

But that is, obviously, an oversimplification of the problem. Real applications are multi-threaded, multi-instance, and offer concurrency. So now the problem is identifying your bottlenecks and fixing them. But again, at its core, the main idea is speed. How can you make things as fast as possible?

(Note: There is a need to consider concurrency and parallelism, plus certain data stores have inherent speed limitations that may need to be overcome, and those things can offset poor speed, but the simplest path to scalability is speed and optimizing throughput.)

The analogy I like to use is the grocery store. Imagine you own a grocery store, and you want to make as much money as possible. Well, the best way to do that is to make sure your customers can get their food and check out as fast as possible. That means making sure the food is easy to find (i.e., read access is fast!), that they don't have to wait to check out (i.e., queue depth is low), and that checking out is fast (i.e., writes are fast). The faster your customers can walk in the door and back out again, the more customers you can sustain over a period of time.

On the other hand, if your customers take too long to find their groceries, or they spend too long waiting in line, or they have to write checks instead of swiping a smart phone, then you wind up with a backlog. And the larger the backlog, the longer it takes for money to hit your bank account.

So in this sense, time is literally money. The faster they can get through your system, the better.

I mentioned three different ways of thinking about speed: reads, writes, and queue depth.

Keeping with our grocery store analogy, consider how to improve each of those things. How do you make sure your customers can find what they're looking for as fast as possible? You "index" things. You put signs on the aisle, you organize your content in a way that is intuitive and puts related things near each other. If you want spaghetti, the pasta and the sauce and the parmesan cheese are all right next to each other. If you want breakfast, the eggs and milk and cinnamon rolls are right next to each other. In and out.

Similarly, your data needs to be organized smartly so that the user can get in and out as fast as possible. In a database, this means optimizing data structures, adding indices, and optimizing queries. Reduce expensive queries, keep cheap fast queries. Find ways to cache hot data. Make it easy to find what you need.

For writes, how do you speed up writes? One way is to make things asynchronous. Throw things that can be eventually consistent into queues and let an asynchronous job handle it outside the normal flow. The customer experiences minimal latency, and you've introduced concurrency to keep the data flowing while the customer is doing something else. This is, in part, why those little screens at the checkout counter ask you so many questions. They're distracting you while the cashier is scanning your groceries.

Queue depth optimization is important as well. If the queue gets really long at the grocery store, how do you improve that? You add more cashiers! The more cashiers you have, the more concurrent customers you can handle. But does it make sense to have 1 cashier per customer? Probably not. Now you've overscaled and you're spending too much money.

As you can see, this is a complex operation, and again, my analogy is overly simplified and very dumb, but I hope this gives you a decent idea of how to visualize a scalability problem.

I'm not familiar with Elixir, but frankly the concepts should translate to any language, although the details my vary.

My suggestion? Learn how to do profiling, identify bottlenecks, and target the biggest bang for your buck. The big risk here is micro-optimization, so fight for changes that give you order of magnitude improvements. Saving 50 microseconds isn't worth your time, but shaving off 1500 milliseconds almost certainly is.

Best of luck.

Jemaclus | a month ago