PostgreSQL의 윈도우 함수 응용 정리

PG는 8.4 이후 버전에 Windows Function 기능을 추가했습니다.
    A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
    Window Functions in SQL is an OLAP functionality that provides ranking, cumulative computation, and partitioning aggregation. Many commercial RDMBS such like Oracle, MS SQL Server and DB2 have implemented part of this specification, while open source RDMBS including PostgreSQL, MySQL and Firebird doesn't yet. To implement this functionality on PostgreSQL not only helps many users move from those RDBMS to PostgreSQL but encourages OLAP applications such as BI (Business Inteligence) to analyze large data set. This specification is defined first in SQL:2003, and improved in SQL:2008
요약하면 집합 함수는 각 그룹의 결과를 반환하고 창 함수는 각 줄의 결과를 반환합니다. 예는 다음과 같습니다.
1. 예시표 작성, 데이터 초기화
DROP TABLE IF EXISTS empsalary;
CREATE TABLE empsalary(
  depname varchar,
  empno bigint,
  salary int,
  enroll_date date
);
INSERT INTO empsalary VALUES('develop',10, 5200, '2007/08/01');
INSERT INTO empsalary VALUES('sales', 1, 5000, '2006/10/01');
INSERT INTO empsalary VALUES('personnel', 5, 3500, '2007/12/10');
INSERT INTO empsalary VALUES('sales', 4, 4800, '2007/08/08');
INSERT INTO empsalary VALUES('sales', 6, 5500, '2007/01/02');
INSERT INTO empsalary VALUES('personnel', 2, 3900, '2006/12/23');
INSERT INTO empsalary VALUES('develop', 7, 4200, '2008/01/01');
INSERT INTO empsalary VALUES('develop', 9, 4500, '2008/01/01');
INSERT INTO empsalary VALUES('sales', 3, 4800, '2007/08/01');
INSERT INTO empsalary VALUES('develop', 8, 6000, '2006/10/01');
INSERT INTO empsalary VALUES('develop', 11, 5200, '2007/08/15');

postgres=# select * from empsalary ;
  depname  | empno | salary | enroll_date 
-----------+-------+--------+-------------
 develop   |    10 |   5200 | 2007-08-01
 sales     |     1 |   5000 | 2006-10-01
 personnel |     5 |   3500 | 2007-12-10
 sales     |     4 |   4800 | 2007-08-08
 sales     |     6 |   5500 | 2007-01-02
 personnel |     2 |   3900 | 2006-12-23
 develop   |     7 |   4200 | 2008-01-01
 develop   |     9 |   4500 | 2008-01-01
 sales     |     3 |   4800 | 2007-08-01
 develop   |     8 |   6000 | 2006-10-01
 develop   |    11 |   5200 | 2007-08-15
(11 rows)

2. 통계 예
a. 각 부서의 총급여, 평균급여와 부서의 상세한 상황을 통계한다
postgres=# select sum(salary) OVER (PARTITION BY depname),avg(salary) OVER (PARTITION BY depname),* from empsalary;
  sum  |          avg          |  depname  | empno | salary | enroll_date 
-------+-----------------------+-----------+-------+--------+-------------
 25100 | 5020.0000000000000000 | develop   |    10 |   5200 | 2007-08-01
 25100 | 5020.0000000000000000 | develop   |     7 |   4200 | 2008-01-01
 25100 | 5020.0000000000000000 | develop   |     9 |   4500 | 2008-01-01
 25100 | 5020.0000000000000000 | develop   |     8 |   6000 | 2006-10-01
 25100 | 5020.0000000000000000 | develop   |    11 |   5200 | 2007-08-15
  7400 | 3700.0000000000000000 | personnel |     2 |   3900 | 2006-12-23
  7400 | 3700.0000000000000000 | personnel |     5 |   3500 | 2007-12-10
 20100 | 5025.0000000000000000 | sales     |     3 |   4800 | 2007-08-01
 20100 | 5025.0000000000000000 | sales     |     1 |   5000 | 2006-10-01
 20100 | 5025.0000000000000000 | sales     |     4 |   4800 | 2007-08-08
 20100 | 5025.0000000000000000 | sales     |     6 |   5500 | 2007-01-02
(11 rows)
b. 통계원이 소재한 부서의 임금 순위 상황
postgres=# select rank() OVER (PARTITION BY depname ORDER BY salary),* from empsalary;
 rank |  depname  | empno | salary | enroll_date 
------+-----------+-------+--------+-------------
    1 | develop   |     7 |   4200 | 2008-01-01
    2 | develop   |     9 |   4500 | 2008-01-01
    3 | develop   |    10 |   5200 | 2007-08-01
    3 | develop   |    11 |   5200 | 2007-08-15
    5 | develop   |     8 |   6000 | 2006-10-01
    1 | personnel |     5 |   3500 | 2007-12-10
    2 | personnel |     2 |   3900 | 2006-12-23
    1 | sales     |     4 |   4800 | 2007-08-08
    1 | sales     |     3 |   4800 | 2007-08-01
    3 | sales     |     1 |   5000 | 2006-10-01
    4 | sales     |     6 |   5500 | 2007-01-02
(11 rows)
3.하나의 재미있는 예는orderby를 주의해서 사용하면 결과는 두 가지가 될 것이다
 create table foo(a int,b int) ;
insert into foo values (1,1);
insert into foo values (1,1);
insert into foo values (2,1);
insert into foo values (4,1);
insert into foo values (2,1);
insert into foo values (4,1);
insert into foo values (5,1);
insert into foo values (11,3);
insert into foo values (12,3);
insert into foo values (22,3);
insert into foo values (16,3);
insert into foo values (16,3);
insert into foo values (16,3);

postgres=# select sum(a) over (partition by b), a, b from foo;
 sum | a  | b 
-----+----+---
  19 |  1 | 1
  19 |  1 | 1
  19 |  2 | 1
  19 |  4 | 1
  19 |  2 | 1
  19 |  4 | 1
  19 |  5 | 1
  93 | 11 | 3
  93 | 12 | 3
  93 | 22 | 3
  93 | 16 | 3
  93 | 16 | 3
  93 | 16 | 3
(13 rows)

postgres=# select sum(a) over (partition by b order by a), a, b from foo;
 sum | a  | b 
-----+----+---
   2 |  1 | 1
   2 |  1 | 1
   6 |  2 | 1
   6 |  2 | 1
  14 |  4 | 1
  14 |  4 | 1
  19 |  5 | 1
  11 | 11 | 3
  23 | 12 | 3
  71 | 16 | 3
  71 | 16 | 3
  71 | 16 | 3
  93 | 22 | 3
(13 rows)

postgres=# select a, b, sum(a) over (partition by b order by a ROWS 
postgres(# BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) from foo;
 a  | b | sum 
----+---+-----
  1 | 1 |  19
  1 | 1 |  19
  2 | 1 |  19
  2 | 1 |  19
  4 | 1 |  19
  4 | 1 |  19
  5 | 1 |  19
 11 | 3 |  93
 12 | 3 |  93
 16 | 3 |  93
 16 | 3 |  93
 16 | 3 |  93
 22 | 3 |  93
(13 rows)
홈페이지의 설명은 By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause.When ORDER BY is omitted the default frame consists of all rows in the partition.
기본적으로orderby 파라미터가 있으면 그룹의 시작 값부터 중첩됩니다. 현재 값이 있을 때까지orderby 파라미터를 무시하면 그룹의 모든 값의 합을 계산합니다.
4. 다른 창 함수
row_number (): 지금부터 1, 2, 3, 4, 5, 6
rank(): 지금부터 1, 2, 2, 4, 5, 6과 같이 중단됩니다.
dense_rank(): 지금부터 중단되지 않지만 1, 2, 2, 3, 4, 5와 같이 반복됩니다.
percent_rank(): 지금부터 그룹에서 계산된 비례, 예를 들어 0, 0.25, 0.25, 0.75, 1, 0, 1은 0-1에서 끊임없이 순환한다
cume_dist(): 현재 줄의 정렬은 그룹의 수량으로 나뉩니다. 그룹이 4줄이면 값은 0.25, 0.5, 0.75, 1입니다.
ntile(num_buckets integer): 그룹화된 수량을 제외하고 가능한 한 균등하게 분포합니다.
lag(value any [,offset integer [,default any]]): 편이량 함수, 정체 값, 예를 들어 lag(column_name, 2,0)는 필드 편이량이 2이고 없으면 default 값으로 대체합니다. 여기는 0입니다. 기본값은 null입니다.
lead(value any [,offset integer [,default any]]]): 편이량 함수, 앞당긴 값, 클래스
first_value (value any): 창 프레임의 첫 번째 값을 되돌려줍니다.
last_value (value any): 창 프레임의 마지막 값을 되돌려줍니다.
nth_value (value any, nth integer): 창 프레임워크에서 지정한 값을 되돌려줍니다. 예를 들어 nth_value (salary, 2), 필드salary의 두 번째 창 함수 값을 되돌려줍니다.
 5.기타 창 함수 예
postgres=# select row_number() over (partition by depname order by salary desc),* from empsalary;
 row_number |  depname  | empno | salary | enroll_date 
------------+-----------+-------+--------+-------------
          1 | develop   |     8 |   6000 | 2006-10-01
          2 | develop   |    10 |   5200 | 2007-08-01
          3 | develop   |    11 |   5200 | 2007-08-15
          4 | develop   |     9 |   4500 | 2008-01-01
          5 | develop   |     7 |   4200 | 2008-01-01
          1 | personnel |     2 |   3900 | 2006-12-23
          2 | personnel |     5 |   3500 | 2007-12-10
          1 | sales     |     6 |   5500 | 2007-01-02
          2 | sales     |     1 |   5000 | 2006-10-01
          3 | sales     |     3 |   4800 | 2007-08-01
          4 | sales     |     4 |   4800 | 2007-08-08
(11 rows)

postgres=# select rank() over(partition by depname order by salary desc),* from empsalary;
 rank |  depname  | empno | salary | enroll_date 
------+-----------+-------+--------+-------------
    1 | develop   |     8 |   6000 | 2006-10-01
    2 | develop   |    10 |   5200 | 2007-08-01
    2 | develop   |    11 |   5200 | 2007-08-15
    4 | develop   |     9 |   4500 | 2008-01-01
    5 | develop   |     7 |   4200 | 2008-01-01
    1 | personnel |     2 |   3900 | 2006-12-23
    2 | personnel |     5 |   3500 | 2007-12-10
    1 | sales     |     6 |   5500 | 2007-01-02
    2 | sales     |     1 |   5000 | 2006-10-01
    3 | sales     |     3 |   4800 | 2007-08-01
    3 | sales     |     4 |   4800 | 2007-08-08
(11 rows)


postgres=# select dense_rank() over(partition by depname order by salary desc),* from empsalary;
 dense_rank |  depname  | empno | salary | enroll_date 
------------+-----------+-------+--------+-------------
          1 | develop   |     8 |   6000 | 2006-10-01
          2 | develop   |    10 |   5200 | 2007-08-01
          2 | develop   |    11 |   5200 | 2007-08-15
          3 | develop   |     9 |   4500 | 2008-01-01
          4 | develop   |     7 |   4200 | 2008-01-01
          1 | personnel |     2 |   3900 | 2006-12-23
          2 | personnel |     5 |   3500 | 2007-12-10
          1 | sales     |     6 |   5500 | 2007-01-02
          2 | sales     |     1 |   5000 | 2006-10-01
          3 | sales     |     3 |   4800 | 2007-08-01
          3 | sales     |     4 |   4800 | 2007-08-08
(11 rows)

postgres=# select percent_rank() over(partition by depname order by salary desc),* from empsalary;
   percent_rank    |  depname  | empno | salary | enroll_date 
-------------------+-----------+-------+--------+-------------
                 0 | develop   |     8 |   6000 | 2006-10-01
              0.25 | develop   |    10 |   5200 | 2007-08-01
              0.25 | develop   |    11 |   5200 | 2007-08-15
              0.75 | develop   |     9 |   4500 | 2008-01-01
                 1 | develop   |     7 |   4200 | 2008-01-01
                 0 | personnel |     2 |   3900 | 2006-12-23
                 1 | personnel |     5 |   3500 | 2007-12-10
                 0 | sales     |     6 |   5500 | 2007-01-02
 0.333333333333333 | sales     |     1 |   5000 | 2006-10-01
 0.666666666666667 | sales     |     3 |   4800 | 2007-08-01
 0.666666666666667 | sales     |     4 |   4800 | 2007-08-08
(11 rows)

postgres=# select cume_dist()over(partition by depname order by salary desc),* from empsalary;
 cume_dist |  depname  | empno | salary | enroll_date 
-----------+-----------+-------+--------+-------------
       0.2 | develop   |     8 |   6000 | 2006-10-01
       0.6 | develop   |    10 |   5200 | 2007-08-01
       0.6 | develop   |    11 |   5200 | 2007-08-15
       0.8 | develop   |     9 |   4500 | 2008-01-01
         1 | develop   |     7 |   4200 | 2008-01-01
       0.5 | personnel |     2 |   3900 | 2006-12-23
         1 | personnel |     5 |   3500 | 2007-12-10
      0.25 | sales     |     6 |   5500 | 2007-01-02
       0.5 | sales     |     1 |   5000 | 2006-10-01
         1 | sales     |     3 |   4800 | 2007-08-01
         1 | sales     |     4 |   4800 | 2007-08-08
(11 rows)

postgres=# select ntile(3)over(partition by depname order by salary desc),* from empsalary;
 ntile |  depname  | empno | salary | enroll_date 
-------+-----------+-------+--------+-------------
     1 | develop   |     8 |   6000 | 2006-10-01
     1 | develop   |    10 |   5200 | 2007-08-01
     2 | develop   |    11 |   5200 | 2007-08-15
     2 | develop   |     9 |   4500 | 2008-01-01
     3 | develop   |     7 |   4200 | 2008-01-01
     1 | personnel |     2 |   3900 | 2006-12-23
     2 | personnel |     5 |   3500 | 2007-12-10
     1 | sales     |     6 |   5500 | 2007-01-02
     1 | sales     |     1 |   5000 | 2006-10-01
     2 | sales     |     3 |   4800 | 2007-08-01
     3 | sales     |     4 |   4800 | 2007-08-08
(11 rows)

postgres=# select lag(salary,2,null)over(partition by depname order by salary desc),* from empsalary;
 lag  |  depname  | empno | salary | enroll_date 
------+-----------+-------+--------+-------------
      | develop   |     8 |   6000 | 2006-10-01
      | develop   |    10 |   5200 | 2007-08-01
 6000 | develop   |    11 |   5200 | 2007-08-15
 5200 | develop   |     9 |   4500 | 2008-01-01
 5200 | develop   |     7 |   4200 | 2008-01-01
      | personnel |     2 |   3900 | 2006-12-23
      | personnel |     5 |   3500 | 2007-12-10
      | sales     |     6 |   5500 | 2007-01-02
      | sales     |     1 |   5000 | 2006-10-01
 5500 | sales     |     3 |   4800 | 2007-08-01
 5000 | sales     |     4 |   4800 | 2007-08-08
(11 rows)

postgres=# select first_value(salary)over(partition by depname order by salary desc),* from empsalary;
 first_value |  depname  | empno | salary | enroll_date 
-------------+-----------+-------+--------+-------------
        6000 | develop   |     8 |   6000 | 2006-10-01
        6000 | develop   |    10 |   5200 | 2007-08-01
        6000 | develop   |    11 |   5200 | 2007-08-15
        6000 | develop   |     9 |   4500 | 2008-01-01
        6000 | develop   |     7 |   4200 | 2008-01-01
        3900 | personnel |     2 |   3900 | 2006-12-23
        3900 | personnel |     5 |   3500 | 2007-12-10
        5500 | sales     |     6 |   5500 | 2007-01-02
        5500 | sales     |     1 |   5000 | 2006-10-01
        5500 | sales     |     3 |   4800 | 2007-08-01
        5500 | sales     |     4 |   4800 | 2007-08-08
(11 rows) 

postgres=# select last_value(salary)over(partition by depname order by salary desc),* from empsalary;
 last_value |  depname  | empno | salary | enroll_date 
------------+-----------+-------+--------+-------------
       6000 | develop   |     8 |   6000 | 2006-10-01
       5200 | develop   |    10 |   5200 | 2007-08-01
       5200 | develop   |    11 |   5200 | 2007-08-15
       4500 | develop   |     9 |   4500 | 2008-01-01
       4200 | develop   |     7 |   4200 | 2008-01-01
       3900 | personnel |     2 |   3900 | 2006-12-23
       3500 | personnel |     5 |   3500 | 2007-12-10
       5500 | sales     |     6 |   5500 | 2007-01-02
       5000 | sales     |     1 |   5000 | 2006-10-01
       4800 | sales     |     3 |   4800 | 2007-08-01
       4800 | sales     |     4 |   4800 | 2007-08-08
(11 rows)

postgres=# select last_value(aa.salary)over(partition by aa.depname),* from     
(select depname,empno,salary,enroll_date from empsalary order by depname,salary ) aa;
 last_value |  depname  | empno | salary | enroll_date 
------------+-----------+-------+--------+-------------
       6000 | develop   |     7 |   4200 | 2008-01-01
       6000 | develop   |     9 |   4500 | 2008-01-01
       6000 | develop   |    10 |   5200 | 2007-08-01
       6000 | develop   |    11 |   5200 | 2007-08-15
       6000 | develop   |     8 |   6000 | 2006-10-01
       3900 | personnel |     5 |   3500 | 2007-12-10
       3900 | personnel |     2 |   3900 | 2006-12-23
       5500 | sales     |     4 |   4800 | 2007-08-08
       5500 | sales     |     3 |   4800 | 2007-08-01
       5500 | sales     |     1 |   5000 | 2006-10-01
       5500 | sales     |     6 |   5500 | 2007-01-02
(11 rows)

postgres=# select nth_value(salary,2)over(partition by depname order by salary desc),* from empsalary;
 nth_value |  depname  | empno | salary | enroll_date 
-----------+-----------+-------+--------+-------------
           | develop   |     8 |   6000 | 2006-10-01
      5200 | develop   |    10 |   5200 | 2007-08-01
      5200 | develop   |    11 |   5200 | 2007-08-15
      5200 | develop   |     9 |   4500 | 2008-01-01
      5200 | develop   |     7 |   4200 | 2008-01-01
           | personnel |     2 |   3900 | 2006-12-23
      3500 | personnel |     5 |   3500 | 2007-12-10
           | sales     |     6 |   5500 | 2007-01-02
      5000 | sales     |     1 |   5000 | 2006-10-01
      5000 | sales     |     3 |   4800 | 2007-08-01
      5000 | sales     |     4 |   4800 | 2007-08-08
(11 rows)
하나의 조회가 여러 개의 창 함수와 관련될 때 별명 방법으로 사용할 수 있으며 더욱 간단하다.
postgres=# select sum(salary)over w,avg(salary) over w,* from empsalary window w as (partition by depname order by salary desc);
  sum  |          avg          |  depname  | empno | salary | enroll_date 
-------+-----------------------+-----------+-------+--------+-------------
  6000 | 6000.0000000000000000 | develop   |     8 |   6000 | 2006-10-01
 16400 | 5466.6666666666666667 | develop   |    10 |   5200 | 2007-08-01
 16400 | 5466.6666666666666667 | develop   |    11 |   5200 | 2007-08-15
 20900 | 5225.0000000000000000 | develop   |     9 |   4500 | 2008-01-01
 25100 | 5020.0000000000000000 | develop   |     7 |   4200 | 2008-01-01
  3900 | 3900.0000000000000000 | personnel |     2 |   3900 | 2006-12-23
  7400 | 3700.0000000000000000 | personnel |     5 |   3500 | 2007-12-10
  5500 | 5500.0000000000000000 | sales     |     6 |   5500 | 2007-01-02
 10500 | 5250.0000000000000000 | sales     |     1 |   5000 | 2006-10-01
 20100 | 5025.0000000000000000 | sales     |     3 |   4800 | 2007-08-01
 20100 | 5025.0000000000000000 | sales     |     4 |   4800 | 2007-08-08
(11 rows)
이 작법은 아래와 같지만 더욱 간단하다
SELECT sum(salary) OVER (PARTITION BY depname ORDER BY salary DESC), avg(salary) OVER (PARTITION BY depname ORDER BY salary DESC),* FROM empsalary;
위에 last_ 두 개 줬어요.value의 예시이지만 첫 번째 글쓰기는 문제없습니다. 창의 마지막 값을 되돌려 주는 효과는 아니지만first_value는 이런 문제가 없습니다. 사실 앞의 홈페이지 알림을 참고하면 비슷한 결론을 얻을 수 있습니다. 모두order by가 일으킨 사고입니다. 홈페이지의 해석은 By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plusany following rows that are equal to the current row according to the ORDER BY clause입니다.When ORDER BY is omitted the default frame consists of all rows in the partition.기본적으로orderby 파라미터가 있으면 그룹의 시작 값부터 중첩됩니다. 현재 값이 있을 때까지orderby 파라미터를 무시하면 그룹의 모든 값의 합을 계산합니다.감사[email protected]의 알람과 디골의 해석.
참조 문서:http://umitanuki.net/pgsql/wfv08/design.html

좋은 웹페이지 즐겨찾기