Why We're Building Stategraph: Terraform State as a Distributed Systems Problem

lawnchair | 45 points

Hmm

I don’t see the state file as a complete downside. It is very simple and very easy to understand. It makes it easy to tell or predict what terraform will do given the current state and desired state.

Its simpleness makes troubleshooting easier: the state files are easy to read and manipulate or repair in the event of a drift, mismatch, or botched provider update.

With the solution proposed it feels like the state becomes a black box I shouldn’t put my hands in. I wonder how the troubleshooting scenarios change with it.

Personally, I haven’t ran into the scaling issue described; at any given time there is usually only one entity working with the state file. We do use terragrunt for larger systems but it is manageable. ~1000 engineer org.

eschatology | 3 hours ago

This is very cool. I love the idea of querying the state, and it opens up a ton of very easy reporting options.

sgarland | 26 minutes ago

It's an interesting proposal because they correctly call out that segmenting state files by workspace/environment in a very judicious way causes its own issues as you approach scale or have to work across environments. There is an entire industry of tools and services that help to streamline this process for you, but it still feels very hacky.

I'm curious if this will be compatible with tools like Spacelift or Env Zero, or if they are going to build their own runner/agent to compete in that space.

sylens | 2 hours ago

Hey! One of the Stategraph developers here and can answer any questions. The major motivation is just how small scale Terraform/Tofu start to breakdown and creates work for users when they have to refactor for performance issues that shouldn't exist. So we want a drop in solution that just dissolves those issues without the user having to do anything.

sausagefeet | 3 hours ago

Not an expert, but doesn't microservices help with this. Each microservice has its own YAMLesque resource descriptor (TF, cloudformation, whatever) and is managed independently. My team can add a SQS or S3 without locking your team.

I might be wrong regarding more sophisticated infra though.

giveita | 3 hours ago

How does this compare with Pulumi? AFAIK they also don't have a state file and relay on an external database to store state. Is your locking granularity better?

angio | 2 hours ago

This is awesome. Having a single state for all resources in an environment is critical for keeping all the moving pieces in check and a core design aspect of Kubestack. But the growing state files quickly become a bottleneck. I'm definitely giving this a good test drive. Very excited.

pst | 3 hours ago

Are there any statistics/analyses for the popularity of these different configuration management languages/frameworks (Terraform, Pullumi etc) in cloud settings? Trying to figure out which one(s) are worth learning.

anonymousDan | 2 hours ago

If you use a tool like Atmos (https://atmos.tools/) you kind of fix this issue already for free - because it takes the place of the root module, it actually manages the state of each sub module separately (they each have their own individual state file rather than being converged into one).

dwroberts | 2 hours ago

so kind of like crossplane where each resource is managed individually?

arccy | 3 hours ago

can it be a sqlite db in s3 with locking implemented with s3?

tuananh | 3 hours ago