Hive的常见级联求和运算思想解析
1.需求:
有如下访客访问次数统计表 t_access_times
访客 月份 访问次数
A 2015-01-02 5
A 2015-01-03 15
B 2015-01-01 5
A 2015-01-04 8
B 2015-01-05 25
A 2015-01-06 5
A 2015-02-02 4
A 2015-02-06 6
B 2015-02-06 10
B 2015-02-07 5
…… …… ……
2.需要输出报表:t_access_times_accumulate
访客 月份 月访问总计 累计访问总计
A 2015-01 33 33
A 2015-02 10 43
……. ……. ……. …….
B 2015-01 30 30
B 2015-02 15 45
……. ……. ……. …….
3.根据每天的表t_access_times得到每个月的访问次数,然后根据每个月的访问次数得到:
一月份,月30次,总共30次
二月份,月10次,总共40次
三月份,月20次,总共60次
。。。。
4.思路:
#创建表
create table t_access_times(username string,month string,salary int) row format delimited fields terminated by ',';
#加载数据
load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_times;
原始数据:
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
5.第一步,先求个用户的月总金额sum是内置求和函数。
select username,month,sum(salary) as salary from t_access_times group by username,month
±----------±---------±--------±-+
| username | month | salary |
±----------±---------±--------±-+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
±----------±---------±--------±-+
第二步,将月总金额表 自己连接 自己连接
(select username,month,sum(salary) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(salary) as salary from t_access_times group by username,month) B
±------------±---------±----------±------------±---------±----------±-+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
±------------±---------±----------±------------±---------±----------±-+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
±------------±---------±----------±------------±---------±----------±-+
第三步,从上一步的结果中
进行分组查询,分组的字段是a.username a.month
求月累计值: 将b.month <= a.month的所有b.salary求和即可
#select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from
(select username,month,sum(salary) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(salary) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month
group by A.username,A.month //分组求和
order by A.username,A.month; //使总的有序