Welcome to aparke’s blog!
窗口分析函数 sum() over() :可以实现在窗口中进行逐行累加
有如下三个人,求出每个人截止到每个月的访问次数总额
数据如下A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20
创建表导入数据create table t_access_times(uid string,month string,times int)
row format delimited
fields terminated by ',';
load data local inpath '/root/hive1/baobiao.txt' into table t_access_times;
select * from t_access_times; |
方法一:传统方法解决:
第一步,先分组聚合,求出每个用户每月的总次数select uid,month,sum(times) as times from t_access_times
group by uid,month;
然后将上面查询的结果保存到另一张表 t_access_amountcreate table t_access_amount
as
select uid,month,sum(times) as amount
from t_access_times
group by uid,month;
select * from t_access_amount; |
第二步:将总额表进行join自连接 条件是uid相同select a.*,b.*
from t_access_amount a join t_access_amount b
on a.uid=b.uid;
+--------+----------+-----------+--------+----------+-----------+--+
| a.uid | a.month | a.amount | b.uid | b.month | b.amount |
+--------+----------+-----------+--------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-03 | 20 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| A | 2015-03 | 20 | A | 2015-02 | 10 |
| A | 2015-01 | 33 | A | 2015-03 | 20 |
| A | 2015-02 | 10 | A | 2015-03 | 20 |
| A | 2015-03 | 20 | A | 2015-03 | 20 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-03 | 45 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
| B | 2015-03 | 45 | B | 2015-02 | 15 |
| B | 2015-01 | 30 | B | 2015-03 | 45 |
| B | 2015-02 | 15 | B | 2015-03 | 45 |
| B | 2015-03 | 45 | B | 2015-03 | 45 |
| C | 2015-01 | 30 | C | 2015-01 | 30 |
| C | 2015-02 | 40 | C | 2015-01 | 30 |
| C | 2015-03 | 30 | C | 2015-01 | 30 |
| C | 2015-01 | 30 | C | 2015-02 | 40 |
| C | 2015-02 | 40 | C | 2015-02 | 40 |
| C | 2015-03 | 30 | C | 2015-02 | 40 |
| C | 2015-01 | 30 | C | 2015-03 | 30 |
| C | 2015-02 | 40 | C | 2015-03 | 30 |
| C | 2015-03 | 30 | C | 2015-03 | 30 |
+--------+----------+-----------+--------+----------+-----------+--+
第三步:将总额表进行join自连接,并只保留每个用户当月的次数(在b表保留本月的次数)此时a表就可以实现累加
a.month<=b.month即可满足select a.*,b.*
from t_access_amount a join t_access_amount b
on a.uid=b.uid
where a.month<=b.month;
+--------+----------+-----------+--------+----------+-----------+--+
| a.uid | a.month | a.amount | b.uid | b.month | b.amount |
+--------+----------+-----------+--------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| A | 2015-01 | 33 | A | 2015-03 | 20 |
| A | 2015-02 | 10 | A | 2015-03 | 20 |
| A | 2015-03 | 20 | A | 2015-03 | 20 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
| B | 2015-01 | 30 | B | 2015-03 | 45 |
| B | 2015-02 | 15 | B | 2015-03 | 45 |
| B | 2015-03 | 45 | B | 2015-03 | 45 |
| C | 2015-01 | 30 | C | 2015-01 | 30 |
| C | 2015-01 | 30 | C | 2015-02 | 40 |
| C | 2015-02 | 40 | C | 2015-02 | 40 |
| C | 2015-01 | 30 | C | 2015-03 | 30 |
| C | 2015-02 | 40 | C | 2015-03 | 30 |
| C | 2015-03 | 30 | C | 2015-03 | 30 |
+--------+----------+-----------+--------+----------+-----------+--+
第四步:按照右边的uid和month进行分组,对以上的结果进行保存分析create table t_tmp
as
select a.uid as a_uid,a.month as a_month,a.amount as a_amount,b.*
from t_access_amount a join t_access_amount b
on a.uid=b.uid
where a.month<=b.month;
+--------------+----------------+-----------------+------------+--------------+---------------+--+
| t_tmp.a_uid | t_tmp.a_month | t_tmp.a_amount | t_tmp.uid | t_tmp.month | t_tmp.amount |
+--------------+----------------+-----------------+------------+--------------+---------------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| A | 2015-01 | 33 | A | 2015-03 | 20 |
| A | 2015-02 | 10 | A | 2015-03 | 20 |
| A | 2015-03 | 20 | A | 2015-03 | 20 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
| B | 2015-01 | 30 | B | 2015-03 | 45 |
| B | 2015-02 | 15 | B | 2015-03 | 45 |
| B | 2015-03 | 45 | B | 2015-03 | 45 |
| C | 2015-01 | 30 | C | 2015-01 | 30 |
| C | 2015-01 | 30 | C | 2015-02 | 40 |
| C | 2015-02 | 40 | C | 2015-02 | 40 |
| C | 2015-01 | 30 | C | 2015-03 | 30 |
| C | 2015-02 | 40 | C | 2015-03 | 30 |
| C | 2015-03 | 30 | C | 2015-03 | 30 |
+--------------+----------------+-----------------+------------+--------------+---------------+--+
第五步:对b表的uid和month分组聚合,求出uid month 当月次数 amount 以及累计次数select uid,month,max(amount) as amount, sum(a_amount) as accumulate
from t_tmp
group by uid,month;
+------+----------+---------+-------------+--+
| uid | month | amount | accumulate |
+------+----------+---------+-------------+--+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| A | 2015-03 | 20 | 63 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
| B | 2015-03 | 45 | 90 |
| C | 2015-01 | 30 | 30 |
| C | 2015-02 | 40 | 70 |
| C | 2015-03 | 30 | 100 |
+------+----------+---------+-------------+--+
方法二:sum() over()函数解决
sum() over() 可以实现再窗口进行逐行累加
select uid,month,amount, |
解析partition by uid表示窗口 order by month 按照月份排序 rows between 夹..之间的行 unbounded没有边界的 preceding前面的 意思是往前面夹是没有边界的 and and current row 当前行的意思
+------+----------+---------+-------------+--+ |
示例 测试 熟练掌握 sum() over()函数累计次数
select uid,month,times, |