hackerzr | 39 points

Using OpenSearch/Elasticsearch "exists" is on the face of it a sensible choice. But it has a couple of big pitfalls.

One potential pitfall here is if your schema supports nested objects. For example:

    {
      "user": {
        "name": "bob",
        "age": 20
      }
    }
A condition such as this:

    user != null 
would get translated to:

    {"exists": "user"}
But this is a "macro" that OS/ES internally expands at query time:

  {"boolean": {"should": [
    {"exists": "user.name"},
    {"exists": "user.age"}
  ]}
As you can imagine, this can cause an explosion of field names if you have many nested fields. OS/ES has a global limit on the number of boolean clauses a single query can have, and it can cause a query to suddenly fail. Such queries are also very much not performant.

Another pitfall is that empty values do not get indexed. This includes empty objects and empty arrays. Empty values aren't null, yet they will match here because "exists" evaluates to false for them. So "exists" don't work for this edge case.

You have at least two decent choices here. One is to index a special adjacent field of type "boolean" for every field. For any data you index, you also index this metadata field:

    {
      "user": {},
      "user$notNull": true
    }
However, you end up with many fields, and OS/ES performance degrades quite a bit the more fields you have.

Another option is to have a top-level field listing the names of all fields that are empty. So something like:

    {
      "user": {},
      "$notNullFields": ["user"]
    }
Now you can use a boolean match on this special field rather than "exists", and it will be performant.

Of course, the list has to include all parent key paths as well. So for a non-empty object:

    {
      "user": {
        "name": "bob", 
        "age": 20
      }, 
      "$notNullFields": [
        "user", "user.name", "user.age"
      ]
    }
But you have the option of only including empty objects/arrays in this list, and using "exists" for all value types, assuming you know which fields can be which.
atombender | 4 days ago

An engineer at Coralogix, a full-stack observability platform, recently shared an intriguing solution to translating SQL expressions with null semantics into OpenSearch DSL. This challenge arose while building the DataPrime query language and engine, which needed to maintain backwards compatibility with OpenSearch.

Key points:

The engineer confronted the disparity between SQL's three-valued logic (TRUE, FALSE, NULL) and OpenSearch DSL's binary filter system.

They devised a method to reduce three-valued logic to two-valued logic by considering the context of expressions, such as WHERE clauses.

The solution introduces is_false_or_null() and is_true_or_null() functions to bridge SQL and OpenSearch DSL. Boolean operators are handled by analyzing truth tables and deriving corresponding OpenSearch DSL translations. This approach enables the translation of complex SQL expressions to OpenSearch DSL while maintaining correct null semantics. The team implemented additional optimizations on the intermediate representation to enhance query efficiency.

This solution allows Coralogix users to seamlessly query both Parquet files and OpenSearch using a unified query language. The article offers valuable insights for developers working on query engines or database compatibility layers, demonstrating a creative approach to a common challenge in data querying and observability platforms.

hackerzr | 4 days ago

Someone meant to write single backticks in their markdown, but somehow got triple backticks. The article is quite difficult to read with so many highlighted words getting their own line.

wodenokoto | 3 days ago

I wrote a LINQ provider (.Net) for Elasticsearch. Was quite the challenge dealing with the sorts of quirks and subtle differences between the two. Case sensitivity and the tokenization are also hurdles to overcome but mapping Group by to Elastics aggregates was facets back then) was the toughest part.

Hats off to Coralogix for taking up the challenge in converting SQL.

damieng | 3 days ago

Please update your blog CSS to use a monospace font/typeface for the code blocks.

whalesalad | 3 days ago