Show HN: Zig Topological Sort Library for Parallel Processing

ww520 | 109 points

I've been enjoying zig myself quite a bit, I'm fairly confident I could do some larger projects in it (excluding comptime since it's missing features I sorely need for some of my current projects.) I like it a bit more than C/C++ in a lot of cases, I need it to be pushed just a tiny bit further before I can really dedicate effort towards large projects in it. I was even curious if I could implement the features I need myself (there is even a proposal already), but got the answer "Just don't." (I don't blame andrew, he already has a lot on his plate and I'm a stranger to him.) So I'm at the point of either waiting for the feature or forking it, haven't decided what I'm going to do though.

More on topic for the project though, did you have any project ideas for this? I think I could use this for my opencv node editor, I just did the naive method of marking all outputs dirty as their inputs nodes got reprocessed. I assume this would fix the potential problem of recomputing a node twice? I also see you mention "Cycle detection and cycle reporting." but what specifically happens if I do have a cycle? Does it just give up or is it something like best effort?

Cieric | 2 days ago

For something like task scheduling in a build system, if you use an up-front partition of the tasks like this, you may have a task that has all its dependencies fulfilled because tasks may have different durations, but still remains unscheduled.

For example, with this input:

    $ cat data.file
    root: parentA parentB
    parentA: C D
    parentB: E F
Asking the package tool for the sorted sets gives me:

    $ ./zig-out/bin/toposort-cli --data data.file
    Processing succeeded.
      Topologically sorted sets: [ { C D E F }  { parentA parentB }  { root }  ]
      Topologically sorted list: [ C D E F parentA parentB root ]
      Nodes: [ root parentA parentB C D E F ]
      Dependency tree:
      [ C D E F ]
        C -> [ parentA ]
          parentA -> [ root ]
            root ->
        D -> [ parentA ]
          parentA -> [ root ]
            root ->
        E -> [ parentB ]
          parentB -> [ root ]
            root ->
        F -> [ parentB ]
          parentB -> [ root ]
            root ->
        
If we follow the README's advice to parallel process set-by-set, we can easily have starvation: once `C` and `D` are finished, we are ready to start `parentA`, even if E and F are still work-in-progress.

How would I use the API of this package to detect and start the task `parentA` as soon as `C` and `D` are finished? I guess when a task is finished, I can ask for the list of dependent tasks, and for each of them, check if all of their dependencies are finished, and if so start the task. But this real-world need feels un-met; it feels odd to me to focus on the sorted sets instead of more practical scheduling.

That is kind of doing Kahn's algorithm iteratively during the build. It would be cool to try and optimize that for maximum performance.

jitl | a day ago

For those that have not implemented toposort or don't remember it: 1) only directed graphs without cycles can be topologically sorted (DAGs) 2) there can be more than one topological order and 2) reverse post order of depth first traversal from every unvisited node shields a topological order.

In JavaScript:

    function toposort(graph = { a: ['b', 'c'], b: ['d'] }) {
      const order = [];
      const visited = new Map();
      
      function dfs(node) {
        const status = visited.get(node);

        if (status === 1) throw new Error("Cycle found.");
        if (status === 2) return;

        visited.set(node, 1); // In progress.
        const adjacent = graph[node] ?? [];
        for (const neighbor of adjacent) dfs(neighbor);
        visited.set(node, 2); // Done.

        order.unshift(node); // Reverse post order.
      }

      for (const node in graph) {
        if (!visited.has(node)) dfs(node);
      }

      return order;
    }
emmanueloga_ | a day ago

Hm this reminds me that the Python stdlib has grown a module for topological sorting fo a graph:

https://docs.python.org/3/library/graphlib.html

I haven't used it yet, I'd be curious if anyone has

chubot | a day ago

I really like this, this is the perfect size project for exploring a new piece of tech. I especially like that you implemented an actual cli and not just tests.

tediousgraffit1 | a day ago

> Generating dependence-free subsets for parallel processing.

Unsure how this is defined (*) but graph cutting approaches to concurrent task scheduling is both pessimistic (poor utilization of available resources) and (iirc) NP-hard, so you pay an big cost upfront.

On the other hand, if you know the indegree/outdegree of each node at the time they are visited (meaning the graph is static) you can run Kahn's algorithm concurrently and put each node into a shared work queue. You can optimize that further by having per-thread work queues and stealing between them. Depending on what the nodes represent there are even more optimizations and heuristics, concurrent task scheduling is a hard problem.

* imagine the graph

(a, b) (a, c) (b, d) (c, d)

Is it possible to get nodes b and c in parallel "subsets" in your library?

duped | a day ago

Congrats! I like Zig a lot even though I never implemented a full project with it.

FWIW we had to build a related lib in TypeScript: https://github.com/okcontract/graph as part of the runtime of https://github.com/okcontract/cells

hbbio | a day ago

Curious: have you benchmarked it against any similar tools in other languages (like Go or Rust) for comparison?

sharmasachin98 | a day ago

Good job! This is really cool!

jedisct1 | a day ago