Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL-style null comparison is not parsed as a binary expression in a search #5545

Open
philrz opened this issue Dec 21, 2024 · 0 comments
Open

Comments

@philrz
Copy link
Contributor

philrz commented Dec 21, 2024

tl;dr

After the changes in #5490, comparisons formerly expressed in the Zed language like san.ip==null are now expressed SQL style like san is null. However, I noticed that if it appears in a search rather than a where, the comparison using the SQL approach isn't treated as a binary expression, and this had some unexpected side effects for me.

Details

Repro is with super commit 95e746d, which is associated with the changes in #5490.

This repro is a simplification of an Autoperf query. As the original query was more complex and the data queried consisted of millions of large records across tens-of-GB of input data, as a user it was much more difficult to figure out what was going on than shown in the repro below.

Take this sample data.json:

{
    "_path": "x509",
    "id": "FnUlL92kZIhBUyfHZb",
    "f": "This field",
    "san": {
        "ip": null
    }
}
{
    "_path": "x509",
    "id": "FnUlL92kZIhBUyfHZb",
    "f": "This is also field",
    "san": {
        "ip": "foo"
    }
}

With the last GA Zed v1.18.0, isolating the record with the null value for san.ip looked like the following:

$ zq -version
Version: v1.18.0

$ zq '_path=="x509" san.ip==null' data.json 
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This field",san:{ip:null}}

Though of course I could have also been explicit and done either of these to get the same result:

$ zq 'where _path=="x509" and san.ip==null' data.json 
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This field",san:{ip:null}}

$ zq 'search _path=="x509" and san.ip==null' data.json 
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This field",san:{ip:null}}

That is, like I expect is true for many of our users, I've traditionally been lazy about noticing when implied operators turn my "searches" into where or search, since I've always been able to start with a binary expression like the one shown above (which turns into a where), tack on something like a regexp (which will turn it into a search), and the parser makes the necessary adjustments for me so it works as expected.

$ zq '/also/ or (_path=="x509" and san.ip==null)' data.json 
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This field",san:{ip:null}}
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This is also field",san:{ip:"foo"}}

In the SuperDB era at commit 95e746d (which is associated with the changes in #5490), If we start from the same query, it no longer parses.

$ super -version
Version: v1.18.0-161-g95e746d3

$ super -c '_path=="x509" san.ip==null' data.json
parse error at line 1, column 15:
_path=="x509" san.ip==null
          === ^ ===

There's two things in particular that may need to be adjusted:

  1. The changes in Bring back the "|" symbol and eliminate search rakes #5436 would require the use of ? to express this as a search
  2. The changes in sam: Use SQL semantics for null comparison #5490 mean the null comparison is now expressed via is rather than ==

If we do just the first change, we get no output at all. FWIW, SQL still allows for the ==null comparison but it just doesn't behave the same as it did in Zed (blah!).

$ super -c '? _path=="x509" san.ip==null' data.json
[no output]

And if we do just the second change, it still doesn't parse.

$ super -c '_path=="x509" san.ip is null' data.json
parse error at line 1, column 15:
_path=="x509" san.ip is null
          === ^ ===

Therefore, acting as a fully educated user (i.e., let's assume I read some release notes for the Zed->SuperDB transition), I do both changes together. But the query result is not the same as in the Zed era.

$ super -c '? _path=="x509" san.ip is null' data.json
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This field",san:{ip:null}}
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This is also field",san:{ip:"foo"}}

What seems to be happening is that these are now being treated as separate search terms, e.g.,

  1. The san.ip matches against the presence of the field key san.ip.
  2. The is matches against strings like This and is in the contents of field f.
  3. There's a case-insensitive string match on the nUlL in the id field.
$ super compile -Z '? _path=="x509" san.ip is null'
[
    {
        kind: "Search",
        expr: {
            kind: "BinaryExpr",
            op: "and",
            lhs: {
                kind: "BinaryExpr",
                op: "and",
                lhs: {
                    kind: "BinaryExpr",
                    op: "and",
                    lhs: {
                        kind: "BinaryExpr",
                        op: "==",
                        lhs: {
                            kind: "ID",
                            name: "_path",
                            loc: {
                                first: 2,
                                last: 6
                            }
                        },
                        rhs: {
                            kind: "Primitive",
                            type: "string",
                            text: "x509",
                            text_pos: 0,
                            loc: {
                                first: 9,
                                last: 14
                            }
                        },
                        loc: {
                            first: 2,
                            last: 14
                        }
                    },
                    rhs: {
                        kind: "Term",
                        text: "san.ip",
                        value: {
                            kind: "Primitive",
                            type: "string",
                            text: "san.ip",
                            text_pos: 0,
                            loc: {
                                first: 16,
                                last: 21
                            }
                        },
                        loc: {
                            first: 16,
                            last: 21
                        }
                    },
                    loc: {
                        first: 2,
                        last: 29
                    }
                },
                rhs: {
                    kind: "Term",
                    text: "is",
                    value: {
                        kind: "Primitive",
                        type: "string",
                        text: "is",
                        text_pos: 0,
                        loc: {
                            first: 23,
                            last: 24
                        }
                    },
                    loc: {
                        first: 23,
                        last: 24
                    }
                },
                loc: {
                    first: 2,
                    last: 29
                }
            },
            rhs: {
                kind: "Term",
                text: "null",
                value: {
                    kind: "Primitive",
                    type: "null",
                    text: "",
                    text_pos: 0,
                    loc: {
                        first: 26,
                        last: 29
                    }
                },
                loc: {
                    first: 26,
                    last: 29
                }
            },
            loc: {
                first: 2,
                last: 29
            }
        },
        loc: {
            first: 0,
            last: 29
        }
    }
]

If I was less lazy and hip to the fact that this all could be expressed as a where, I'd get my previous result (though I also had to now add the explicit and, per docs).

$ super -c 'where _path=="x509" and san.ip is null' data.json 
{_path:"x509",id:"FnUlL92kZIhBUyfHZb",f:"This field",san:{ip:null}}

Overall, however, my guess is that we probably intended the is null (or is not null) to still be treated as a binary expression even in these searches and hence avoid the surprise here. I think this would allow us to give guidance to users to just add ? to the start of implied-searches-that-worked-in-Zed and still get the same expected results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant