SQL, Homomorphisms and Constraint Satisfaction Problems

xlinux | 125 points

The post mentions the idea that querying a database D can be understood algebraically as enumerating all morphisms Q -> D, where Q is the "classifying" database of the query, i.e. a minimal database instance that admits a single "generic" result of the query. You can use this to give a neat formulation of Datalog evaluation. A Datalog rule then corresponds a morphism P -> H, where P is the classifying database instance of the rule body and H is the classifying database instance for matches of both body and head. For example, for the the transitivity rule

  edge(x, z) :- edge(x, y), edge(y, z).
you'd take for P the database instance containing two rows (a_1, a_2) and (a_2, a_3), and the database instance H contains additionally (a_1, a_3). Now saying that a Database D satisfies this rule means that every morphism P -> D (i.e., every match of the premise of the rule) can be completed to a commuting diagram

  P --> D
  |    ^
  |   /
  ⌄  /
  Q 
where the additional map is the arrow Q -> D, which corresponds to a match of both body and head.

This kind of phenomenon is known in category theory as a "lifting property", and there's rich theory around it. For example, you can show in great generality that there's always a "free" way to add data to a database D so that it satisfies the lifting property (the orthogonal reflection construction/the small object argument). Those are the theoretical underpinnings of the Datalog engine I'm sometimes working on [1], and there they allow you to prove that Datalog evaluation is also well-defined if you allow adjoining new elements during evaluation in a controlled way. I believe the author of this post is involved in the egglog project [2], which might have similar features as well.

[1] https://github.com/eqlog/eqlog [2] https://github.com/egraphs-good/egglog

mbid | 11 hours ago

For anyone curious: the performance difference between Clang and GCC on the example C solution for verbal arithmetic comes down to Clang's auto-vectorisation (deducing SIMD) whilst GCC here sticks with scalar, which is why the counter brings Clang closer in line to GCC (https://godbolt.org/z/xfdxGvMYP), and it's actually a pretty nice example of auto-vectorisation (and its limitations) in action, which is a fun tangent from this article (given its relevance to high-performance SMT/SAT solving for CSP)

babel_ | 12 hours ago

The topic of huge queries on tiny databases makes me think of this recent discussion on the SQLite forum: https://sqlite.org/forum/forumpost/0d18320369

Someone had an issue because SQLite failed to optimize the following query

    select * from t where x = 'x' or '' = 'x'
Someone said that SQLite could not optimize out the "or '' = 'x'" because it would be too expensive to compute. Which is obviously true only for huge queries on tiny datasets.
lovasoa | 10 hours ago