多维聚合函数

使用 SQL 的时候,可以用多维聚合函数,有 grouping set, cube, roll up,具体区别可以参考这里 https://webcache.googleusercon… 这个原文挂了,要看 google cahche,摘录如下

Hive分析窗口函数之GROUPING SETS,CUBE和ROLLUP
2016年04月04日 18:06:46 Free World 阅读数:2629
个人分类: Hive
所属专栏: Hive实战
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/jiangshouzhuang/article/details/51057616
这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计。

环境信息:

Hive版本为apache-hive-0.14.0-bin
Hadoop版本为hadoop-2.6.0
Tez版本为tez-0.7.0

数据:

2016-03,2016-03-10,user1

2016-03,2016-03-10,user5

2016-03,2016-03-12,user7

2016-04,2016-04-12,user3

2016-04,2016-04-13,user2

2016-04,2016-04-13,user4

2016-04,2016-04-16,user4

2016-03,2016-03-10,user2

2016-03,2016-03-10,user3

2016-04,2016-04-12,user5

2016-04,2016-04-13,user6

2016-04,2016-04-15,user3

2016-04,2016-04-15,user2

2016-04,2016-04-16,user1

创建表:

CREATE TABLE windows_gcr (

op_month STRING,

op_day STRING,

userno STRING

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

stored as textfile;

将数据导入Hive表中:

load data local inpath ‘/home/hadoop/testhivedata/windows_gcr.txt’ into table windows_gcr;

1. GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL

按照op_month和op_day维度进行聚合:

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

GROUPING SETS (op_month,op_day)

ORDER BY GROUPING__ID;

结果:

op_month op_day userno_count grouping__id

2016-03 NULL 5 1

2016-04 NULL 6 1

NULL 2016-03-10 4 2

NULL 2016-03-12 1 2

NULL 2016-04-12 2 2

NULL 2016-04-13 3 2

NULL 2016-04-15 2 2

NULL 2016-04-16 2 2

其中的 GROUPING__ID,表示结果属于哪一个分组集合。

等价于

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day;

再看一个例子:

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

GROUPING SETS(op_month,op_day,(op_month,op_day))

ORDER BY GROUPING__ID;

结果:

op_month op_day userno_count grouping__id

2016-04 NULL 6 1

2016-03 NULL 5 1

NULL 2016-03-12 1 2

NULL 2016-04-12 2 2

NULL 2016-04-13 3 2

NULL 2016-04-15 2 2

NULL 2016-04-16 2 2

NULL 2016-03-10 4 2

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-12 2 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

2016-04 2016-04-16 2 3

等价于

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day

UNION ALL

SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day;

2. CUBE

根据GROUP BY的维度的所有组合进行聚合。

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

WITH CUBE

ORDER BY GROUPING__ID;

结果:

op_month op_day userno_count grouping__id

NULL NULL 7 0

2016-03 NULL 5 1

2016-04 NULL 6 1

NULL 2016-04-12 2 2

NULL 2016-04-13 3 2

NULL 2016-04-15 2 2

NULL 2016-04-16 2 2

NULL 2016-03-10 4 2

NULL 2016-03-12 1 2

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-16 2 3

2016-04 2016-04-12 2 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

等价于

SELECT NULL as op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,0 AS GROUPING__ID FROM windows_gcr

UNION ALL

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day

UNION ALL

SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day

3. ROLLUP

是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

比如,以op_month维度进行层级聚合:

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

WITH ROLLUP

ORDER BY GROUPING__ID;

结果:

op_month op_day userno_count grouping__id

NULL NULL 7 0

2016-03 NULL 5 1

2016-04 NULL 6 1

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-12 2 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

2016-04 2016-04-16 2 3

可以实现这样的上钻过程:

月天的用户数->月的用户数->总用户数

–把op_month和op_day调换顺序,则以op_day维度进行层级聚合:

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_day,op_month

WITH ROLLUP

ORDER BY GROUPING__ID;

结果:

op_month op_day userno_count grouping__id

NULL NULL 7 0

NULL 2016-04-13 3 1

NULL 2016-03-12 1 1

NULL 2016-04-15 2 1

NULL 2016-03-10 4 1

NULL 2016-04-16 2 1

NULL 2016-04-12 2 1

2016-04 2016-04-12 2 3

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

2016-04 2016-04-16 2 3

可以实现这样的上钻过程:

天月的UV->天的UV->总UV

具体可以访问Hive官网:

https://cwiki.apache.org/confl…

另外还有这篇 https://blog.csdn.net/moon_yan…

GROUPING SETS
GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解,

以acorn_3g.test_xinyan_reg为例:

[dp@YZSJHL19-87 xjob]$ hive -e “use acorn_3g;desc test_xinyan_reg;”
user_id                 bigint                  None                
device_id               int                     None   手机,平板             
os_id                   int                     None   操作系统类型             
app_id                  int                     None   手机app_id             
client_version          string                  None   客户端版本             
from_id                 int                     None  四级渠道

几个demo帮助大家了解:

grouping sets语句 等价hive语句
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id)) 
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id 
UNION ALL 
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id 
UNION ALL 
SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id 
UNION ALL 
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id  
UNION ALL 
SELECT null,null,null,count(user_id) FROM test_xinyan_reg
CUBE函数
cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合

select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id with cube;
手工实现需要写的hql语句(写个程序自己生成的,手写累死):

SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
UNION ALL
SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
UNION ALL
SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
UNION ALL
SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
UNION ALL
SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
UNION ALL
SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
UNION ALL
SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
UNION ALL
SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
UNION ALL
SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
UNION ALL
SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
UNION ALL
SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
UNION ALL
SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
UNION ALL
SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
UNION ALL
SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
UNION ALL
SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
UNION ALL
SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
UNION ALL
SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
UNION ALL
SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
UNION ALL
SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
UNION ALL
SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
UNION ALL
SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
UNION ALL
SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg
看着很蛋疼是不是,体会到cube的强大了吗!(低版本hive可以通过union all方式解决,算是没有办法的办法)

ROLL UP函数
rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。

select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id with rollup;
等价以下sql语句:

select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id 
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
Grouping_ID函数
当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)

直接拿官方文档一个例子,O(∩_∩)O哈哈~

Column1 (key) Column2 (value)
1 NULL
1
1
2
2
3
3
3
NULL
4
5
hql统计:

 SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP
统计结果如下:

       
NULL NULL 0     00 6
1 NULL 1     10 2
1 NULL 3     11 1
1 1 3     11 1
2 NULL 1     10 1
2 2 3     11 1
3 NULL 1     10 2
3 NULL 3     11 1
3 3 3     11 1
4 NULL 1     10 1
4 5 3     11 1
GROUPING__ID转变为二进制,如果对应位上有值为null,说明这列本身值就是null。(通过类DataFilterNull.py 扫描,可以筛选过滤掉列中null、“”统计结果),
———————
作者:扫大街的程序员
来源:CSDN
原文:https://blog.csdn.net/moon_yang_bj/article/details/17200367
版权声明:本文为博主原创文章,转载请附上博文链接!

其实简单总结,就是 cube 最全,grouping set 可以自定义,而 rollup 就是逐级上卷

Leave a Reply

Your email address will not be published. Required fields are marked *