Creating Collect Set using Hive Analytical Functions

Hi,

I got a use case to score some claim related records based on the flag information they get through some processing logic.

The Input is below:

CLAIM1 HAS 1 EXPOSURE WHICH FALL INTO MULTIPLE FLAGS. HERE E1 IS SET WITH ALL FLAGS.
CLAIM2 HAS 1 EXPOSURE WHICH FALL INTO FLAG1
CLAIM3 HAS 2 EXPOSURES WHICH FALL INTO FLAG1,FLAG2

Input:

c1|p1|e1|1|0|0|0|0|giri
c1|p1|e1|0|1|0|0|0|giri
c1|p1|e1|0|0|1|0|0|giri
c1|p1|e1|0|0|0|1|0|giri
c1|p1|e1|0|0|0|0|1|giri
c2|p1|e1|1|0|0|0|0|puppa
c3|p3|e1|1|0|0|0|0|pappa
c3|p3|e2|0|1|0|0|0|pappa

Step1:

Create a Hive table having these input records.

create table hive_aggr
(claimnumber string,policynumber string,exposurenumber int,flag1 string,flag2 string,flag3 string,flag4 string,flag5 string,name string)
row format delimited
fields termintated by ‘|’
stored as textfile;

Step 2:

load data local inpath ‘/opt/md/giri/hive_aggr/’ overwrite into table hive_aggr;

Step 3:

Use the Collect_set udf to get all the flags in an array and use array_contains udf to filter the ‘1’ and ‘0’ flags.

select
s.claimnumber,
s.policynumber,
s.exposurenumber,
s.flag1,
s.flag2,
s.flag3,
s.flag4,
s.flag5,
s.name,
CAST(((s.flag1*10)+(s.flag2*20)+(s.flag3*30)+(s.flag4*40)+(s.flag5*50)) as INT) as totalsiupoints from
(select distinct A.claimnumber,
B.policynumber,
B.exposurenumber,
case when array_contains(B.flag1set,”1″)=true then ‘1’ else ‘0’ end as flag1,
case when array_contains(B.flag2set,”1″)=true then ‘1’ else ‘0’ end as flag2,
case when array_contains(B.flag3set,”1″)=true then ‘1’ else ‘0’ end as flag3,
case when array_contains(B.flag4set,”1″)=true then ‘1’ else ‘0’ end as flag4,
case when array_contains(B.flag5set,”1″)=true then ‘1’ else ‘0’ end as flag5,
A.name
from hive_aggr A
join
(select claimnumber,policynumber,exposurenumber,collect_set(flag1) as flag1set,collect_set(flag2) as flag2set,collect_set(flag3) as flag3set,collect_set(flag4) as flag4set,collect_set(flag5) as flag5set from hive_aggr group by claimnumber,policynumber,exposurenumber)B
on A.claimnumber=B.claimnumber)s

output:
——-
c1 p1 e1 1 1 1 1 1 giri 150
c2 p1 e1 1 0 0 0 0 puppa 10
c3 p3 e1 1 0 0 0 0 pappa 10
c3 p3 e2 0 1 0 0 0 pappa 20

Basically If you see the Claims, Policy, ExposureNumbers are grouped and given with an aggregated flag records and it;s corresponding score.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s