Monday, May 30, 2022
HomeProgrammingVaried Meanings of SQL’s PARTITION BY Syntax – Java, SQL and jOOQ.

Varied Meanings of SQL’s PARTITION BY Syntax – Java, SQL and jOOQ.


For SQL newbies, there’s a little bit of an esoteric syntax named PARTITION BY, which seems all over in SQL. It at all times has the same that means, although in fairly totally different contexts. The that means is much like that of GROUP BY, particularly to group/partition information units by some grouping/partitioning standards.

For instance, when querying the Sakila database:

SELECT actor_id, film_id
FROM film_actor

One thing just like the beneath may seem:

|actor_id|film_id|
|--------|-------|
|1       |1      |
|2       |3      |
|10      |1      |
|20      |1      |
|1       |23     |
|1       |25     |
|30      |1      |
|19      |2      |
|40      |1      |
|3       |17     |
|53      |1      |
|19      |3      |
|2       |31     |

And we might partition the information like this for the ACTOR_ID = 1 partition:

                      |actor_id|film_id|
                      |--------|-------|
                 +--> |1       |1      |
All ACTOR_ID = 1 |    |2       |3      |
                 |    |10      |1      |
                 |    |20      |1      |
                 +--> |1       |23     |
                 +--> |1       |25     |
                      |30      |1      |
                      |19      |2      |
                      |40      |1      |
                      |3       |17     |
                      |53      |1      |
                      |19      |3      |
                      |2       |31     |

For the ACTOR_ID = 2 partition:

                      |actor_id|film_id|
                      |--------|-------|
                      |1       |1      |
All ACTOR_ID = 2 +--> |2       |3      |
                 |    |10      |1      |
                 |    |20      |1      |
                 |    |1       |23     |
                 |    |1       |25     |
                 |    |30      |1      |
                 |    |19      |2      |
                 |    |40      |1      |
                 |    |3       |17     |
                 |    |53      |1      |
                 |    |19      |3      |
                 +--> |2       |31     |

How can we use these partitions in SQL, particularly? What do they imply? In brief:

A partition separates a knowledge set into subsets, which don’t overlap.

Window partitions

The very first thing we will do is the window PARTITION clause, which we use when calculating window features. For instance, we’d calculate:

SELECT 
  actor_id, 
  film_id,
  COUNT(*) OVER (PARTITION BY actor_id)
FROM film_actor

If we’re assuming that we’re seeing your complete information set (the precise desk has extra rows), then the next end result could be displayed:

|actor_id|film_id|depend|
|--------|-------|-----|
|1       |1      |3    |
|2       |3      |2    |
|10      |1      |1    |
|20      |1      |1    |
|1       |23     |3    |
|1       |25     |3    |
|30      |1      |1    |
|19      |2      |2    |
|40      |1      |1    |
|3       |17     |1    |
|53      |1      |1    |
|19      |3      |2    |
|2       |31     |2    |

In different phrases, we’re “counting rows over the partition“. It really works virtually like GROUP BY, the place we depend rows from the group, although the GROUP BY clause transforms the end result set and the projectable columns, making non-grouped columns unavailable:

SELECT actor_id, COUNT(*)
FROM film_actor
GROUP BY actor_id

Leading to:

|actor_id|depend|
|--------|-----|
|1       |3    |
|2       |2    |
|10      |1    |
|20      |1    |
|30      |1    |
|19      |2    |
|40      |1    |
|3       |1    |
|53      |1    |

If you’ll, the partition contents are actually collapsed such that every partition key / group key solely seems as soon as within the end result set. This distinction makes window features vastly extra highly effective than atypical combination features and grouping.

See additionally our article about GROUP BY for extra particulars.

MATCH_RECOGNIZE partitions

MATCH_RECOGNIZE is a part of the SQL customary, invented by Oracle, and the envy of all different RDBMS (although some have began adopting it). It combines the facility of normal expressions, sample matching, information technology and SQL. It is likely to be sentient, who is aware of.

For instance, let’s take a look at clients who make small funds inside a small period of time. Take a look at this magnificence. Simply look!

SELECT 
  customer_id,
  payment_date,
  payment_id, 
  quantity
FROM cost
MATCH_RECOGNIZE (

  -- Partition the information set by customer_id
  PARTITION BY customer_id

  -- Order every partition by payment_date
  ORDER BY payment_date

  -- Return all of the matched rows
  ALL ROWS PER MATCH

  -- Match rows with 3 occurrences of occasion "A" in a row
  PATTERN (A {3})

  -- Outline the occasion "A" as...
  DEFINE A AS 

      -- Being a cost whose quantity is lower than 1
      A.quantity < 1 

      -- And whose cost date is lower than 1 day after 
      -- the earlier cost
      AND A.payment_date - prev(A.payment_date) < 1
)
ORDER BY customer_id, payment_date

Whew! This makes use of so many fancy key phrases this low-cost weblog’s syntax highlighter right here can’t even remotely sustain!

The result’s:

|CUSTOMER_ID|PAYMENT_DATE           |PAYMENT_ID|AMOUNT|
|-----------|-----------------------|----------|------|
|72         |2005-08-18 10:59:04.000|1961      |0.99  |
|72         |2005-08-18 16:17:54.000|1962      |0.99  |
|72         |2005-08-19 12:53:53.000|1963      |0.99  |
|152        |2005-08-20 01:16:52.000|4152      |0.99  |
|152        |2005-08-20 19:13:23.000|4153      |0.99  |
|152        |2005-08-21 03:01:01.000|4154      |0.99  |
|207        |2005-07-08 17:14:14.000|5607      |0.99  |
|207        |2005-07-09 01:26:22.000|5608      |0.99  |
|207        |2005-07-09 13:56:56.000|5609      |0.99  |
|244        |2005-08-20 11:54:01.000|6615      |0.99  |
|244        |2005-08-20 17:12:28.000|6616      |0.99  |
|244        |2005-08-21 09:31:44.000|6617      |0.99  |

So, we will verify that for every of these teams of three funds, there are:

  • Quantities lower than 1.
  • Consecutive dates lower than 1 day aside.
  • Teams are per buyer, which is once more the partition.

Need to study extra about MATCH_RECOGNIZE? I feel this text explains it a lot better than anything on the net. You’ll be able to mess around with it without cost utilizing the Oracle XE 21c, e.g. accessible on Docker by Gerald Venzl.

MODEL partitions

Much more arcane than MATCH_RECOGNIZE is the Oracle-specific MODEL or SPREADSHEET clause. Each complicated software ought to have at the least one MODEL question simply to maintain your coworkers questioning. An instance will be present in our earlier article. In brief, you are able to do something you could possibly in any other case do in a spreadsheet software program, resembling MS Excel. I’ll give one other instance right here, with out a deep dive into the way it works:

SELECT 
  customer_id,
  payment_date,
  payment_id, 
  quantity
FROM (
  SELECT *
  FROM (
    SELECT p.*, 0 AS s, 0 AS n
    FROM cost p
  )
  MODEL

    -- We once more partition our information set by customer_id
    PARTITION BY (customer_id)

    -- The "spreadsheet dimension" is the row quantity ordered
    -- by cost date, inside a partition
    DIMENSION BY (
      row_number () OVER (
        PARTITION BY customer_id 
        ORDER BY payment_date
      ) AS rn
    )

    -- Measures is what we wish to challenge, together with
    -- o Desk columns
    -- o Extra calculated values
    MEASURES (payment_date, payment_id, quantity, s, n)

    -- These guidelines are the spreadsheet formulae
    RULES (

      -- S is the sum of earlier quantities which might be smaller than 1
      -- and whose cost dates are lower than 1 day aside
      s[any] = CASE 
          WHEN quantity[cv(rn)] < 1 
          AND payment_date[cv(rn)] - payment_date[cv(rn) - 1] < 1 
          THEN coalesce(s[cv(rn) - 1], 0) + quantity[cv(rn)] 
          ELSE 0 
      END,

      -- N is the variety of consecutive quantities with these properties
      n[any] = CASE 
          WHEN quantity[cv(rn)] < 1 
          AND payment_date[cv(rn)] - payment_date[cv(rn) - 1] < 1 
          THEN coalesce(n[cv(rn) - 1], 0) + 1 
          ELSE 0 
      END
    )
) t

-- Filter out solely these rows the place we had greater than 3 
-- consecutive occasions
WHERE n >= 3
ORDER BY customer_id, rn

Drop certainly one of these into your manufacturing code base on Friday earlier than deployment, and also you’ll be everybody’s darling, assured.

Anyway, MATCH_RECOGNIZE was a bit nicer, I feel. The result’s:

|CUSTOMER_ID|PAYMENT_DATE           |PAYMENT_ID|AMOUNT|
|-----------|-----------------------|----------|------|
|72         |2005-08-19 12:53:53.000|1963      |0.99  |
|152        |2005-08-21 03:01:01.000|4154      |0.99  |
|207        |2005-07-09 13:56:56.000|5609      |0.99  |
|244        |2005-08-21 09:31:44.000|6617      |0.99  |
|244        |2005-08-21 19:39:43.000|6618      |0.99  |
|252        |2005-07-28 02:44:25.000|6800      |0.99  |
|377        |2005-07-07 12:24:37.000|10211     |0.99  |
|425        |2005-08-01 12:37:46.000|11499     |0.99  |
|511        |2005-07-11 18:50:55.000|13769     |0.99  |

In the event you’re in for a thrill, strive modifying my question to return the same old triple rows that kind a gaggle, similar to within the MATCH_RECOGNIZE instance, and depart your answer within the feedback. It’s positively doable!

Partitioned tables

At the very least Oracle and PostgreSQL assist desk partitioning on a storage degree, in all probability others, too. The function helps tame your storage bother by separating information into separate bodily tables, whereas transparently pretending you’ve gotten a single logical desk in your software, and by introducing other forms of bother.

The standard instance is partitioning information units by date ranges, e.g. that’s what’s documented in PostgreSQL.

CREATE TABLE cost (
  customer_id int not null,
  quantity numeric not null,
  payment_date date not null
)
PARTITION BY RANGE (payment_date);

Now, we can’t use this desk but, as a result of it solely exists logically. It doesn’t know but the way to retailer information bodily:

INSERT INTO cost (customer_id, quantity, payment_date)
VALUES (1, 10, DATE '2000-01-01');

This produces:

SQL Error [23514]: ERROR: no partition of relation “cost” discovered for row
Element: Partition key of the failing row accommodates (payment_date) = (2000-01-01).

So, let’s create some bodily storage for a sure date vary, e.g.:

CREATE TABLE payment_2000
PARTITION OF cost
FOR VALUES FROM (DATE '2000-01-01') TO (DATE '2000-12-31');

Now, the insert works. This interpretation of PARTITION once more matches the window operate one, the place we partition our information set into subsets, that are clearly separated with out overlaps.

Bizarre one: Outer Be a part of Partitions

The subsequent partitioning function is a part of the SQL customary, however I’ve solely seen it being carried out in Oracle, up to now, which has had it without end: partitioned outer joins. They’re not trivial to elucidate, and regrettably, their partitions don’t have anything to do with the window partitions. They’re extra like CROSS JOIN syntax sugar (or vinegar, relying in your tastes).

Take into consideration them this manner, you should utilize partitioned outer joins to fill gaps in in any other case sparse information. Let’s take a look at an instance:

SELECT
  f.film_id,
  f.title,
  c.category_id,
  c.identify,
  depend(*) OVER ()
FROM movie f
  LEFT OUTER JOIN film_category fc
    ON f.film_id = fc.film_id
  LEFT OUTER JOIN class c
    ON fc.category_id = c.category_id
ORDER BY f.film_id, c.category_id

This question produces the class per movie. If a class doesn’t seem with a movie, there’s no report within the end result:

|FILM_ID|TITLE           |CATEGORY_ID|NAME       |COUNT(*)OVER()|
|-------|----------------|-----------|-----------|--------------|
|1      |ACADEMY DINOSAUR|6          |Documentary|1000          |
|2      |ACE GOLDFINGER  |11         |Horror     |1000          |
|3      |ADAPTATION HOLES|6          |Documentary|1000          |
|4      |AFFAIR PREJUDICE|11         |Horror     |1000          |
|5      |AFRICAN EGG     |8          |Household     |1000          |
|6      |AGENT TRUMAN    |9          |Overseas    |1000          |
|7      |AIRPLANE SIERRA |5          |Comedy     |1000          |
|8      |AIRPORT POLLOCK |11         |Horror     |1000          |
|9      |ALABAMA DEVIL   |11         |Horror     |1000          |
|10     |ALADDIN CALENDAR|15         |Sports activities     |1000          |

As you’ll be able to see, now we have 1000 movies, and since the Sakila database is so boring, each movie solely has 1 class, even when the many-to-many relationship would enable for multiple task.

What occurs if we add a PARTITION BY clause to one of many outer joins?

SELECT
  f.film_id,
  f.title,
  c.category_id,
  c.identify,
  depend(*) OVER ()
FROM movie f
  LEFT OUTER JOIN film_category fc
    ON f.film_id = fc.film_id
  LEFT OUTER JOIN class c
  PARTITION BY (c.category_id) -- Magic right here
    ON fc.category_id = c.category_id
ORDER BY f.film_id, c.category_id

I received’t present your complete end result, however as you’ll be able to see with the window operate end result, we now have 16000 rows in complete, not 1000. That’s as a result of now we have 1000 movies x 16 classes, so a cross product with clean class names (however not clean class IDs) in case there’s no match, if you’ll:

|FILM_ID|TITLE           |CATEGORY_ID|NAME       |COUNT(*)OVER()|
|-------|----------------|-----------|-----------|--------------|
|1      |ACADEMY DINOSAUR|1          |           |16000         |
|1      |ACADEMY DINOSAUR|2          |           |16000         |
|1      |ACADEMY DINOSAUR|3          |           |16000         |
|1      |ACADEMY DINOSAUR|4          |           |16000         |
|1      |ACADEMY DINOSAUR|5          |           |16000         |
|1      |ACADEMY DINOSAUR|6          |Documentary|16000         |
|1      |ACADEMY DINOSAUR|7          |           |16000         |
|1      |ACADEMY DINOSAUR|8          |           |16000         |
|1      |ACADEMY DINOSAUR|9          |           |16000         |
|1      |ACADEMY DINOSAUR|10         |           |16000         |
|1      |ACADEMY DINOSAUR|11         |           |16000         |
|1      |ACADEMY DINOSAUR|12         |           |16000         |
|1      |ACADEMY DINOSAUR|13         |           |16000         |
|1      |ACADEMY DINOSAUR|14         |           |16000         |
|1      |ACADEMY DINOSAUR|15         |           |16000         |
|1      |ACADEMY DINOSAUR|16         |           |16000         |
|2      |ACE GOLDFINGER  |1          |           |16000         |
|2      |ACE GOLDFINGER  |2          |           |16000         |
|2      |ACE GOLDFINGER  |3          |           |16000         |
|2      |ACE GOLDFINGER  |4          |           |16000         |
|2      |ACE GOLDFINGER  |5          |           |16000         |
|2      |ACE GOLDFINGER  |6          |           |16000         |
|2      |ACE GOLDFINGER  |7          |           |16000         |
|2      |ACE GOLDFINGER  |8          |           |16000         |
|2      |ACE GOLDFINGER  |9          |           |16000         |
|2      |ACE GOLDFINGER  |10         |           |16000         |
|2      |ACE GOLDFINGER  |11         |Horror     |16000         |
|2      |ACE GOLDFINGER  |12         |           |16000         |
|2      |ACE GOLDFINGER  |13         |           |16000         |
|2      |ACE GOLDFINGER  |14         |           |16000         |
|2      |ACE GOLDFINGER  |15         |           |16000         |
|2      |ACE GOLDFINGER  |16         |           |16000         |

In a manner, that is helpful everytime you wish to create a report primarily based on sparse information, and generate data for these gaps. An identical question with out PARTITION BY could be on utilizing CROSS JOIN

SELECT
  f.film_id,
  f.title,
  c.category_id,
  NVL2(fc.category_id, c.identify, NULL) AS identify,
  depend(*) OVER ()
FROM movie f
  CROSS JOIN class c
  LEFT JOIN film_category fc
    ON fc.film_id = f.film_id
    AND fc.category_id = c.category_id
ORDER BY f.film_id, c.category_id;

I need to say, I haven’t discovered these partitioned outer be part of very helpful or intelligible prior to now, and I’m not satisfied that different RDBMS are actually missing some vital function right here, regardless of this being customary SQL.

To date, jOOQ doesn’t emulate the function in different RDBMS but.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments