Functional Programming with SQL Query lambdas, functions and pipelines

Your input and feedback on this article are welcome on github.com/DavidPratten/sql-fp

This article aims to add lambdas, functions and pipelines to the functional programming core of standard SQL, and thereby:

  • separate transformation from data,
  • increase SQL use cases
  • increase SQL code re-use
  • reduce situations where programmers need to cross the "impedance barrier" from queries to use the procedural CREATE FUNCTION, and
  • increase processing speed through full utilisation of the query planner/ optimiser,

using an approach that may be implemented at scale with reasonable cost.

The outline of this article is as follows:

  • Introduce query lambdas and the GIVEN clause, including recursive query lambdas
  • Describe query functions and query pipelines
  • Conclude with a discussion of the benefits and costs of the proposed features.

Query Lambdas

Query lambdas are the foundation of this proposal, so we will consider them first. Query lambdas have a function signature of (table_or_sub-query, ...) -> table, that is:

  • accepting one, or more, tables or sub-queries as input
  • and returning a table.

All table references within the lambda are local to it and tables in the global namespace are invisible.

Realising Query Lambdas in SQL with GIVEN

SQL’s SELECT may be turned into a lambda by prefixing it with the proposed GIVEN clause and wrapping the whole expression in brackets. The local_tables identified in the GIVEN clause are the only ones able to be referenced in the context of the SELECT and it’s sub-queries.

Here is an example lambda that sums the supplied state level population data at the country level.

select *
FROM (GIVEN(mapping(summary_id, id), rawdata(id, amount)) 
      SELECT mapping.summary_id, sum(rawdatea.amount) amount
          FROM rawdata rawdata
              JOIN mapping mapping ON rawdata.id = mapping.id
          GROUP BY mapping.summary_id
     )((SELECT country_id summary_id, state_id id FROM states),
       (SELECT state_id id, population amount FROM population_by_states)
      );

which may be read as:

  • apply the input relation(s)
    • (select country_id summary_id, state_id id from states) which maps states to countries as the mapping table and
    • (select state_id id, population amount from population_by_states) as the rawdata table
  • and return a population summary by country.

The query inside the lambda may be reused in any context, by passing in different tables as parameters.

This particular query lambda can be transpiled to SQL-99 as:

select *
FROM (SELECT mapping.summary_id, sum(rawdatea.amount) amount
      FROM (select state_id id, population amount from population_by_states) rawdata
               JOIN (select country_id summary_id, state_id id from states) mapping ON rawdata.id = mapping.id
      GROUP BY mapping.summary_id
     );

Recursive Query Lambdas

SQL provides a recursive (iterative and fixed-point) relational transformation via the CTE. This makes SQL Turing Complete.

A GIVEN clause may be used in conjunction with the WITH clause and turns a CTE into a (potentially recursive) query lambda.

The following recursive query lambda will generate a table of the first 50 integers counting from 1.

SELECT * FROM
(GIVEN(count(n))
 WITH 
     enumerate
        AS (
            SELECT 1 AS n 
            
            UNION ALL
            
            SELECT n + 1 FROM enumerate WHERE n < (SELECT n from count LIMIT 1) 
            )
SELECT n from enumerate)((SELECT 50 n));

This query lambda transpiles to SQL-99 as follows:

SELECT * FROM
(WITH enumerate
        AS (
            SELECT 1 AS n 
            
            UNION ALL
            
            SELECT n + 1 FROM enumerate WHERE n < (SELECT n from (SELECT 50 n) LIMIT 1) 
            )
SELECT n from enumerate);

Let’s now look at how query lambdas are the building blocks for query functions and pipelines.

Query Functions

SQL provides the VIEW mechanism to name queries, and the GIVEN clause causes VIEWs to become parameterised. Our example query lambdas may be turned into query functions summarise() and generate_series() as follows:

CREATE VIEW summarise(summary_id, amount) AS
    GIVEN (mapping(summary_id, id), rawdata(id, amount))
    SELECT mapping.summary_id, sum(rawdatea.amount) amount
    FROM rawdata rawdata
             JOIN mapping mapping ON rawdata.id = mapping.id
    GROUP BY mapping.summary_id;
)

and

CREATE VIEW generate_series(n) AS
    GIVEN(count(n))
    WITH enumerate
    AS (
      SELECT 1 AS n
      UNION ALL
      SELECT n + 1 FROM enumerate WHERE n < (SELECT n from count LIMIT 1)
    )
    SELECT n from enumerate;

The query function signature, or contract, is immediately visible in the first two lines of each!

The examples above include the declaration of the return table’s columns in brackets after the query function name. This is a feature of CTE’s backported to CREATE VIEW. In SQL-99 VIEWs must infer the returned columns from the provided SELECT.

Query Function Invocation

Just like regular VIEWs (and CTEs), query functions stand in a SELECT query in the place of a table. Query functions are invoked with the view_name followed by bracketed table, or subquery, parameters. Here is how we can invoke the query function summarise defined above.

SELECT *
from summarise((SELECT country_id summary_id, state_id id FROM states),
               (SELECT state_id id, population amount FROM population_by_states)
              );

And we can generate a table of 5 numbers starting with 1 with:

SELECT *
from generate_series((select 5 n));

Query Pipelines

Some data transformations may be usefully modelled as a staged pipeline of changes with each stage consuming the results of an earlier stage. Here is a simple pipeline diagram that illustrates our summarisation example modelled as a two stage transformation:

Building on the definitions of query functions and lambdas, a PIPES TO compound operator is proposed to allow functional programmers to express computations using a pipeline model.

Here is the PIPES TO operator used to express this two stage pipeline:

SELECT *
from ( (SELECT country_id summary_id, state_id id FROM states)
      PIPES TO
      summarise((SELECT state_id id, population amount FROM population_by_states) );

Notice that this invocation of summarise() has just one parameter. In this case the first parameter is provided by the PIPES TO compound operator.

Transpilation to SQL-99 is straight forward. The PIPES TO compound operator may be reduced to application of query functions, by moving the left-handed expression to be the first parameter of the right-hand expression, and thence to SQL-99 as shown above. Here is the PIPES TO reduced to query functions.

select *
from (SELECT *
      from summarise((SELECT country_id summary_id, state_id id FROM states),
                     (SELECT state_id id, population amount FROM population_by_states)
          )
     );

Discussion of costs and benefits

Adding a new feature to the SQL standard needs to have compelling benefits and manageable implementation costs.

Implementation costs

My high-level estimate is that the implementation costs are of O(costs of adding CTEs as part of SQL99). At this early stage in the lifecycle of query lambdas, functions, and pipelines there is much uncertainty, however here are a few factors that suggest that the implementation costs are manageable:

  • "Not the first rodeo" – the GIVEN clause operates similarly to the WITH clause in that it shapes the visible tables in the following SELECT
  • Equivalent to SQL-99 as demonstrated above, query lambdas, functions, and pipelines (just like VIEWs) are close to zero-cost abstractions. They may all be trivially reduced back to SQL-99 and made available to the query planner / optimiser to execute.
  • Clean introduction of new capabilities no changes to any legacy code is required,
  • Ergonomics overloading VIEWs with parameters is a natural extension and leverages user’s experience with built-in and CREATE FUNCTION functions.

Benefits

The benefits claimed for query lambdas, functions and pipelines are not novel. They are all consequences of bringing well understood functional programming patterns to the world of SQL queries. Here is a short sketch justifying the claimed benefits:

Separate transformation from data,

From the point of view of the development team, maintaining multiple hand-coded transformation queries is hard to justify. In SQL-99 the available mechanisms for abstracting transformations and reusing them (VIEWs and CTEs) work to the extent that at every invocation they are referring to exactly the same tables and columns. In SQL-99 every query is, at it’s base, hard coded to a global namespace of tables and columns. This means that there is no mechanism to cleanly separate the query transformation from the actual data that is being processed. Query lambdas, functions and pipelines provide precisely this capability for SQL.

Increase SQL use cases

By separating the data from transformations, SQL queries becomes more attractive to organisations to encode re-usable business rules which need to be applied to multiple data sets in multiple contexts.

Increase SQL code re-use

From the point of view of the broader SQL community, hand coding every solution means that there are no libraries of SQL transformations that can become a shared resource and built on. Query functions enable SQL query transformations to be defined once and used in a myriad of contexts. A good use case here would be providing a library of hierarchy handling query functions (ie recursive CTEs) which embody the graph theoretic algorithms which can be applied to any data.

Reduce situations where programmers need to cross the "impedance barrier" from pure functional code to use the procedural CREATE FUNCTION

For the developer, the necessity in SQL-99 to hand-code every transformation for each use-case motivates programmers to cross the impedance barrier to alternative technologies to reduce or eliminate code duplication. In addition, if a programmer wants to build or use a pre-packed solution then a procedural CREATE FUNCTION is the only currently available option.

Query lambdas, functions and pipelines lessen, or remove, these drivers for programmers.

Increase processing speed through full utilisation of the query planner/ optimiser,

The relational query planner/optimiser plays a crucial role in the functional nature of SQL. It is this part of the execution environment that enables:

  • queries to be reordered and elided,
  • materialised, indexed,
  • lazily evaluated and/or
  • parallelised.

In many cases the procedural code is not able to be optimised by the planner, leading to relatively poor performance.

The "Impedance barrier" between SQL queries and procedural code is not just a cognitive load for programmers, it is also a performance blockage on real world workloads. There is current research ongoing on how to convert procedural code back into CTEs so as to boost application performance.

Query lambdas, functions and pipelines enable programmers to stay within the purely function world of SQL queries which can be executed in an optimal way by the planner / optimiser.

Last word

In summary the proposed GIVEN clause develops the pure functional core of the SQL query, improves programmer ergonomics, and broadens the use-cases for SQL and promises improved application performance.

Composable Recursive Relational Queries

Your input and feedback on this article are welcome on github.com/DavidPratten/sql-fp

Hierarchical and graph data may be transformed with relational queries that are recursive, iterative, and fixed-point. These are available in SQL through the recursive Common Table Expression (CTE):

WITH RECURSIVE ...

This article argues that SQL’s CTE machinery is a distraction from the core idea of a recursive query and a hinderance to composition of queries. We propose a simpler SQL form which is equally declarative, expressive and composes nicely:

SELECT ... UNION SELECT ... FROM UNION

In addition, we explore how this form may be translated into Morel, a descendent of ML with relational queries.

Let’s begin by introducing a motivating example

Transitive Closure of a Tree

If we have a tree

Example Tree

This can be represented in a relation as:

idparent_id
11
21
31
42
52
65
75

The transitive closure of the tree is:

Transitive Closure of the Tree with depth shown on the edges

And this may be represented by the following relation.

idparent_idlevel
110
220
330
440
550
660
770
211
311
421
521
651
751
412
512
622
722
613
713

Recursive CTE query

Using the Sqlite CTE form here is the query that computes the Transitive Closure given the Tree.

/* WITH RECURSIVE transitive closure of a tree (SQLITE)
   n.b. Tree's root node convention is id = parent_id
*/
with items (id, parent_id) as (
    select 1 id, 1 parent_id
    union
    select 2, 1
    union
    select 3, 1
    union
    select 4, 2
    union
    select 5, 2
    union
    select 6, 5
    union
    select 7, 5
),
     transitive_closure (id, root_id, level) as (
         select id, id, 0
            from items
         union
         select items1.id, tcitems.root_id, tcitems.level + 1
            from transitive_closure tcitems
            inner join items items1 on
                items1.parent_id = tcitems.id and
                items1.id <> items1.parent_id
     )
select * from transitive_closure;

Focusing in on the CTE let’s analyse its parts

  • A language closure (WITH). (Note this is a different concept than transitive closure.)
  • A name
  • Core Queries and Union
  • Final query.

As it turns out 3/4 of this structure may be dropped without loss.

SQL – SELECT … UNION SELECT … FROM UNION

We propose simplifying SQL by dropping the name, the closure and the final query and simplifying the composable recursive relational query down to:

select id, id root_id, 0 level 
    from items
union
select items1.id, tcitems.root_id, tcitems.level + 1
    from union tcitems
        inner join items items1 on items1.parent_id = tcitems.id 
        and items1.id <> items1.parent_id

This form:

SELECT ... UNION SELECT ... FROM UNION

SELECTs the fixed-point result of the UNION without the overhead of a name (repeated three times), the closure and the final query.

This demonstrates that CTE machinery is not required to express recursive fixed point relational queries.

The simplified recursive relational query is orthogonal to all constructs in SQL and is composable:

  • if it is desired to hide some of the columns included in the recursive query, then the simplified recursive query can be wrapped in an outer SELECT, and
  • if the simplified recursive query needs to be consumed by more than one query then it can be
    • wrapped in a CTE, or
    • wrapped in a VIEW.

Multi-UNION Queries

There do need to be disambiguation rules if there are multiple UNIONs in scope. However, this is true whatever the linguistic form. See for example the multi-UNION rules for Sqlite. All non-recursive UNION sub-queries are constrained to be first, and all recursive sub-queries follow. And each recursive sub-query is presented with all generated rows from every other subquery in the UNION.

Composability

To be fair, the CTE variant of recursive queries is composable and SELECT FROM (WITH RECURSIVE …. SELECT …) works (at least in Sqlite). However, the ergonomics are poor, requiring the developer to build a mental model that includes a linguistic closure, a temporary local name, and a SELECT to pop back out of the closure. Here is the equivalent composable CTE-based query (Sqlite) with the required extra machinery in bold.

select * from (
with _unionfixedpoint (id, root_id, level) as (
select id, id root_id, 0 level from items union select items1.id, tcitems.root_id, tcitems.level + 1 from _unionfixedpoint tcitems inner join items items1 on items1.parent_id = tcitems.id and items1.id <> items1.parent_id)
select * from _unionfixedpoint
);

Application for other languages

The simplification of the recursive relational query to its essence is designed to improve developer experience and also points to a pattern for incorporating recursive queries into another language Morel.

Morel – from … union … from … in query

Having separated the idea of a recursive relational query from the CTE machinery, it becomes straight forward to propose the analog for a future release of Morel:

(from item in items 
    yield {id = item.id, root_id=item.id, level=0})
union
(from tcitem in query,
    item1 in items
    where item1.parent_id = tcitem.id and item1.id != item1.parent_id
    yield {id = item1.id, root_id=tcitem.root_id, level=tcitem.level+1});

The form

in query

is proposed for Morel, because ‘in union’ is too specific. Morel’s roadmap includes consideration for other UNION-like operators including unnamed lambdas.

Aligning Real Estate Agent’s interests with those of the Vendor

The interests of property sellers (vendors) and the interests of their real estate agent can be better aligned. Firstly, this article shows how the "x% on sale price" commission model fails to reward agents for the effort required to find the … [Continue reading]

Table Valued Constrained Functions (TVCF) – Constraint Programming and SQL

I propose seamless integration of solvers into SQL via Table Valued Constrained Functions (TVCF). By "seamless", I mean that the SQL programmer would have to ask the query planner if a constraint solver was invoked to complete the query. A … [Continue reading]

I just found three bugs in SQLite!

SQLite is the most widely used and arguably the most rigorously tested database in the world. Today, the developers confirmed that valid queries that I created for the Rosella Model project exposed not one, not two, but three bugs hiding in the … [Continue reading]

Introducing tree “node elevation” to replace both “node depth and “node height”.

A tree's nodes may be assigned a depth and a height. If you would like to brush up on these concepts, then baeldung.com has a great explainer article Difference Between Tree Depth and Height. The difference may be summarised as MeasureMeasures … [Continue reading]

Finalist!

Celebrating Australian Underwater Running being chosen as a finalist in ABC RN Sporty program's "Invent-a-sport" competition! Here is the award broadcast starting around 6:20  the conversation is between host, Amanda Smith (AS), and judges Liz … [Continue reading]

Introducing Australian Underwater Running (AUR)

This is David Pratten's entry into the ABC's April, 2020 Invent a brand new sport and win a prize competition. Lets get ready for a post-COVID-19 world! When its time to come out and play, try Australian Underwater Running (AUR)! Play this … [Continue reading]

Applied Category Theory

Recently I reached out to Simon Willerton about applying Category Theory to Project Management models, here is the response. Putting Project Management models on a sound footing. Hi Simon, Thanks for your article on the Project Management Schedule … [Continue reading]

A relational language – Parasat

I am impressed by the logic language Picat and a recent presentation by Peter Alvaro "Three Things I Wish I Knew When I Started Designing Languages". Update 2: William Byrd (@webyrd) the author of miniKanren recommended that I use the … [Continue reading]

Using a Gantt Chart to Show Schedule Uncertainty

Temporary organisations (or projects) frequently work with uncertain timelines. Briefing senior management on the extent of uncertainty and progress towards reducing uncertainty is a crucial Project Management challenge. This article explores a … [Continue reading]

Formal and informal Latin script for Kazakh language

If you missed the news, recently President Nursultan Nazarbayev announced that Kazakhstan will fully transition from Cyrillic to Latin letters by 2025. I'm writing this post in response. As my good friend Maktaxali taught me, we often find that "The … [Continue reading]

Underwater Running Survey

The Australian style of underwater running is unique in that it works irrespective of the runner's buoyancy and depth of water. This article surveys several different ways of running underwater. There is a beauty to this slow motion running in a … [Continue reading]

Underwater Running Demos

Underwater running on sand in deep water provides high resistance aerobic exercise. (No weights or other equipment required.) Here are a couple of videos showing the … [Continue reading]

Management – Reading

The effect of power on the manager, a brilliant paper.  Very sobering. "Power and Perspectives Not Taken" by Galinsky et. al. … [Continue reading]