An Introduction to Window Functions

The aggregate functions perform calculations across rows and return a single output row.

The following query uses the SUM() aggregate function to calculate the total salary of all employees in the company:

SELECT SUM(salary) sum_salary FROM employees

Here is the output:

sum_salary
322400

As we can see above, all rows from the Employee table are grouped into a single row. Like an aggregate function, a window function calculates on a set of rows. However, a window function does not cause rows to become grouped into a single output row.

The following query uses the SUM() as a window function. It returns the sum salary of all employees along with the salary of each individual employee:

SELECT
    first_name,
    last_name,
    salary,
    SUM(salary) OVER() sum_salary
FROM
    employees

Here is the output:

first_namelast_namesalarysum_salary
StevenKing24000322400
NeenaKochhar17000322400
LexDe Haan17000322400
AlexanderHunold9000322400
BruceErnst6000322400
DavidAustin4800322400
ValliPataballa4800322400
DianaLorentz4200322400
NancyGreenberg12000322400
DanielFaviet9000322400

In this example, the OVER() clause signals that the SUM() function is used as a window function.

The following picture illustrates the main difference between aggregate functions and window functions:

SQL window function syntax

The syntax of the window functions is as follows:

window_function_name ( expression ) OVER (
    partition_clause
    order_clause
    frame_clause
)

window_function_name: The name of the supported window function such as ROW_NUMBER()RANK(), and SUM().

Expression: The target expression or column on which the window function operates.

OVER clause: The OVER clause defines window partitions to form groups of rows and specifies the orders of rows in a partition. The OVER clause consists of three clauses: the partition, order, and frame clauses.

Partition By clause: The partition clause divides the rows into partitions to which the window function applies. It has the following syntax:

PARTITION BY expr1, expr2, ...

If the PARTITION BY clause is not specified, then the whole result set is treated as a single partition.

Order By clause: The order clause specifies the orders of rows in a partition on which the window function operates:

ORDER BY
    expression [ASC | DESC]  [NULL {FIRST| LAST}]
,...

A frame is the subset of the current partition. To define the frame, you use one of the following syntaxes:

{ RANGE | ROWS } frame_start
{ RANGE | ROWS } BETWEEN frame_start AND frame_end  

where frame_start is one of the following options:

N PRECEDING
UNBOUNDED PRECEDING
CURRENT ROW

and frame_end is one of the following options:

CURRENT ROW
UNBOUNDED FOLLOWING
N FOLLOWING

The following picture illustrates a frame and its options:

  • UNBOUNDED PRECEDING: the frame starts at the first row of the partition.
  • N PRECEDING: the frame starts at Nth rows before the current row.
  • CURRENT ROW: means the current row that is being evaluated.
  • UNBOUNDED FOLLOWING: the frame ends at the final row in the partition.
  • N FOLLOWING: the frame ends at the Nh row after the current row.

The ROWS or RANGE specifies the type of relationship between the current row and frame rows.

  •  ROWS: the offsets of the current row and frame rows are row numbers.
  •  RANGE: the offset of the current row and frame rows are row values.

Window Functions in Druid

In Apache Druid, window functions are used in SQL queries to perform calculations over a specified range of rows related to the current row. Window functions allow you to work with a “window” of rows within the result set and perform computations on that window. Druid supports various window functions to facilitate advanced analytics and data manipulation. 

Below are some commonly used window functions in Apache Druid:

  1. LAG and LEAD:
  • LAG(column, offset, default) returns the value of the column at the given offset before the current row.
  • LEAD(column, offset, default) returns the value of the column at the given offset after the current row.

Example:

SELECT __time, value, LAG(value, 1, 0) OVER (ORDER BY __time) AS lagged_value
FROM your_datasource
  1. FIRST_VALUE and LAST_VALUE:
  • FIRST_VALUE(column) returns the value of the specified column for the first row in the window.
  • LAST_VALUE(column) returns the value of the specified column for the last row in the window.

Example:

SELECT __time, value, FIRST_VALUE(value) OVER (ORDER BY __time) AS first_value
FROM your_datasource
  1. SUM, AVG, MIN, MAX:
  • These aggregate functions can be used as window functions when combined with the OVER clause, allowing you to perform cumulative or sliding window calculations.

Example:

SELECT __time, value, SUM(value) OVER (ORDER BY __time) AS cumulative_sum
FROM your_datasource
  1. RANK and DENSE_RANK:
  • RANK() returns the rank of the current row within the result set.
  • DENSE_RANK() returns the rank of the current row without gaps in the ranking.

Example:

SELECT __time, value, RANK() OVER (ORDER BY value DESC) AS ranking
FROM your_datasource
  1. ROW_NUMBER:
  • ROW_NUMBER() returns the current row number within the result set.

Example:

SELECT __time, value, ROW_NUMBER() OVER (ORDER BY __time) AS row_num
FROM your_datasource

These are just a few examples of window functions in Apache Druid. The OVER clause is crucial when using window functions, as it defines the window specification (e.g., ordering and partitioning) for the function. Druid supports a variety of window specifications, providing flexibility in how you analyze and manipulate your data.

A Sample Use Case In Druid

Let’s say we have a sample dataset that contains 2 columns, as shown below:

  • DateTime
  • Quantity

We have to write a query that calculates a rolling seven-day sum of Quantity from the First DateTime value. Following would be the Query in Druid using Windows Function that will give the desired result:

WITH cte AS
(
    SELECT   
        floor(__time to day) AS "Event_DateTime",
        row_number() OVER (ORDER BY cast(floor (__time TO day) AS bigint)) AS row_no,
        sum(sum(quantity)) OVER (ORDER BY cast(floor (__time TO day) AS bigint) range BETWEEN 0 PRECEDING AND 6 following) AS "7_Day_Rolling"
    FROM "004Q2M61FD4QEO8"
    GROUP BY 
        __time 
)
SELECT
    CASE WHEN row_no=1 THEN Event_DateTime ELSE (CASE WHEN mod(row_no,7)=1 THEN Event_DateTime END) END ,
    "7_Day_Rolling"
FROM  cte
WHERE 
    (CASE WHEN row_no=1 THEN Event_DateTime ELSE ( CASE WHEN mod(row_no,7)=1 THEN Event_DateTime END) END) IS NOT NULL

The following would be the result of this query:

Newsletter Signup

Let us help with your analytics apps

Request a Demo