Evolvable serialized filters

Natural approaches to filter serialization

There we are. We have this endpoint.

POST api/v1/city_group/

{
    "name": "cities_to_visit"
}

For this endpoint to store cities we’d like to visit, we need to be able to filter down from the collection of all cities ever to a list of cities that we care about. Let’s add to the body.

{
    "name": "cities_to_visit",
    "city_names": [
        "Sydney",
        "Montréal",
        "Montpellier",
        "Dublin"
    ]
}

We recognize that this list of city_names is really just a filter. We’d like to be able to add more filters in the future.

{
    "name": "cities_to_visit",
    "filters": [
        {
            "city_names": [
                "Sydney",
                "Montréal",
                "Montpellier",
                "Dublin"
            ]
        },
        {
            "area": // A GeoJSON representing all of Andorra
        }
    ]
}

This is the normal, simple approach that everyone takes the first time they see this problem. It also suffers from three recurring pitfalls:

Filter composition is implicit. Should the service and or or the filters? The user doesn’t know. This design choice leads to confusion and surprise.
We can only express flat filters. I cannot express “Is an interesting city in Andorra or is one of these globally interesting cities”. I.e.,

(in_andorra AND in_first_city_name_list) OR in_second_city_name_list

Expressing simple filters is verbose. While we’d like to leave the door open to supporting multiple filters, right now, we only have a single filter use-case. To allow future growth we force our customers to pass in a singleton list. Less than ideal.

If we’re sure that those three points aren’t a problem – we have good reason to believe that flat filters will always be sufficient for our customers’ use-cases and we, say, slap on an argument to make composition explicit, then stop reading. Stick with the very simple approach, accept or mitigate its limitations, and go play outside¹.

Nested filters

If, however, usage dictates, we can consider an alternative form.

{
    "name": "cities_to_visit",
    "filter": {
        "and": [
            {
                "city_names": [
                    "Sydney",
                    "Montréal",
                    "Montpellier",
                    "Dublin"
                ]
            },
            {
                "area": // A GeoJSON representing all of Andorra
            }
        ]
    }
}

Our filter composition is explicit! We’re definitely anding. We can also express our nested filter:

{
    "name": "cities_to_visit",
    "filter": {
        "or": [
            {
                "and": [
                    {
                        "city_names": [
                            "Canillo",
                            "Encamp"
                        ]
                    },
                    {
                        "area": // A GeoJSON representing all of Andorra
                    }
                ]
            },
            {
                "city_names": [
                    "Sydney",
                    "Montréal",
                    "Montpellier",
                    "Dublin"
                ]
            }
        ]
    }
}

We can finally express our simple filter – the one we’re supporting initially at launch – without any verbosity whatsoever.

{
    "name": "cities_to_visit",
    "filter": {
        "city_names": [
            "Sydney",
            "Montréal",
            "Montpellier",
            "Dublin"
        ]
    }
}

In sum, this approach gives us:

Explicit filter composition. No guesswork about and vs. or.
Expressiveness. We can express compound filters easily.
Terse simple filters. Only providing a single filter? No singleton lists; just write your bare filter.

API evolution for nested filters

This approach leaves ample room for API evolution. Launch with just a single filter, implemented in the simplest possible way. Learn from that usage. Then support just a flat and. Maybe that’s all you need! If not, your setup allows you to progressively accommodate more involved use-cases and to further optimize your implementation along the way to support those progressions, all while maintaining backwards-compatibility.

That is: this approach allows you future-looking flexibility without premature optimization.

Formalizing nested filters

The recipe below uses the two example filter types from above. Replace them with your own.

Filter:
    { "city_names": string[] }
    | { "area": GeoJSON }
    | { "and": Filter[] }
    | { "or": Filter[] }
    | { "not": Filter }

Note: in parsing we need to ensure that inputs are well-formed and have appropriate arguments (e.g., no { "city_names": GeoJSON }). Frameworks won’t always give you this for free, and you might need to – e.g., in Pydantic – slap on a custom model validator that enforces a single present key that corresponds with an argument of the correct type.

Considerations

Tooling

See those pipes above? The |s ? Right, that’s a union type. Not all API frameworks like playing with union types. Python / Pydantic2 / FastAPI will, but depending on your language of choice and ecosystem, you might be more constrained.

In this case, we can work around:

FilterType: Enum(
    city_name_filter, 
    area_filter,
    and,
    or,
    not
)

Filter: { 
    "type": FilterType, 
    "city_names": Optional(string[]),
    "area": Optional(GeoJSON),
    "args": Optional(Filter[]),
    "arg": Optional(Filter)
}

where our API layer becomes responsible for validating proper correspondence between FilterTypes and applicable parameters.

Our examples become a bit more verbose.

{
    "name": "cities_to_visit",
    "filter": {
        "type": "or",
        "args": [
            {
                "type": "and",
                "args": [
                    {
                        "type": "city_name_filter",
                        "city_names": [
                            "Canillo",
                            "Encamp"
                        ]
                    },
                    {
                        "type": "area_filter",
                        "area": // A GeoJSON representing all of Andorra
                    }
                ]
            },
            {
                "type": "city_name_filter",
                "city_names": [
                    "Sydney",
                    "Montréal",
                    "Montpellier",
                    "Dublin"
                ]
            }
        ]
    }
}

Implementation

Let’s assume you’ve elected to support a non-trivial subset of the nested filter options. You have two options for implementing the filter: in application code or in database queries.

Filtering in application code means a small exercise in writing functional code. I’d find that fun and jump at the opportunity. Then again, my first language was scheme. Reasonable people can have different opinions here. Depending on your use-case and if there’s a database involved, this can involve an overhead of data transfer, some to most of which the filter then throws away.

Which leads us to the second option. Filtering in a database query can be more complex. First, your application code needs to figure out how to translate from serialized filter form into database query filters. This can have overlap with the exercise above. The hard part: ensuring that the generated query can be executed in a “reasonable” amount of time, producing bounded effort on your database. This problem is complicated further by ORMs; are you sure you understand how the ORM-generated query will execute and the amount of load it will put on your database?

There’s a lot of “it depends” to be found here, and a lot of room for “good enough” solutions that can be tailored to your particular setup. Plus, you’ve already started with a more restricted form of the API, and you’ve learned quite a bit about strengths / weaknesses of your existing implementation. You – at this point – have good ideas about how to make it more optimal.

Additional structural limits

As discussed before, consider limiting initially supported filter types. Also, consider setting additional structural limits. In particular, consider bounding depth and number of nodes – no customer is writing filters that are 10 layers deep, and those bounds can help make your life easier to reason about. You might also want to consider bounding number of ors or geospatial queries, which can put excessive load on your implementation².

Closing recommendations

Start with a restricted interface that solves well-understood customer need. Favor a dumb, simple implementation. Test it. Extensively. Does it meet your performance requirements? If so, take off early, and go play in the park. If not, instrument, figure out what’s causing the problem and figure out a “good enough” solution that will work here. Your implementation doesn’t need to be perfect. It needs to work.

There are plenty of well-used APIs that follow this pattern. Take eventbridge for example, in which it’s exceedingly clear that arguments are ored together by default. ↩︎
Very much an “it depends” here. If your implementation queries a database and performs the full filter in the database query, depending on the database type and its index structure, a lot of ors or geospatial queries can result in inefficient queries. ↩︎

Natural approaches to filter serialization#

Nested filters#

API evolution for nested filters#

Formalizing nested filters#

Considerations#

Tooling#

Implementation#

Additional structural limits#

Closing recommendations#