如何使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集
如何使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集
如何使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集,很多新手对此不是很清楚,为了帮助大家解决这个难题,下面小编将为大家详细讲解,有这方面需求的人可以来学习下,希望你能有所收获。
1. 开通Data Lake Analytics与OSS服务
如果您已经开通,可以跳过该步骤。如果没有开通,可以参考:https://help.aliyun.com/document_detail/70386.html
进行产品开通服务申请。
2. 下载TPC-H测试数据集
可以从这下载TPC-H 100MB的数据集:
https://public-datasets-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/tpch_100m_data.zip
3. 上传数据文件到OSS
登录阿里云官网的OSS控制台:https://oss.console.aliyun.com/overview
规划您要使用的OSS bucket,创建或选择好后,点击“文件管理”,因为有8个数据文件,为每个数据文件创建对应的文件目录:
创建好8个目录如下:
点击进入目录,上传相应的数据文件,例如,customer目录,则上传customer.tbl文件。
上传好后,如下图。然后,依次把其他7个数据文件也上传到对应的目录下。
至此,8个数据文件都上传到了您的OSS bucket中:
oss://xxx/tpch_100m/customer/customer.tbloss://xxx/tpch_100m/lineitem/lineitem.tbloss://xxx/tpch_100m/nation/nation.tbloss://xxx/tpch_100m/orders/orders.tbloss://xxx/tpch_100m/part/part.tbloss://xxx/tpch_100m/partsupp/partsupp.tbloss://xxx/tpch_100m/region/region.tbloss://xxx/tpch_100m/supplier/supplier.tbl
4. 登录Data Lake Analytics控制台
https://openanalytics.console.aliyun.com/
点击“登录数据库”,输入开通服务时分配的用户名和密码,登录Data Lake Analytics控制台。
5. 创建Schema和Table
输入创建SCHEMA的语句,点击“同步执行”。
CREATESCHEMAtpch_100mwithDBPROPERTIES(LOCATION='oss://test-bucket-julian-1/tpch_100m/',catalog='oss');
(注意:目前在同一个阿里云region,Data Lake Analytics的schema名全局唯一,建议schema名尽量根据业务定义,已有重名schema,在创建时会提示报错,则请换一个schema名字。)
Schema创建好后,在“数据库”的下拉框中,选择刚刚创建的schema。然后在SQL文本框中输入建表语句,点击同步执行。
建表语句语法参考:https://help.aliyun.com/document_detail/72006.html
TPC-H对应的8个表的建表语句如下,分别贴入文档框中执行(LOCATION子句中的数据文件位置请根据您的实际OSS bucket目录相应修改)。(注意:目前控制台中还不支持多个SQL语句执行,请单条语句执行。)
CREATEEXTERNALTABLEnation(N_NATIONKEYINT,N_NAMESTRING,N_IDSTRING,N_REGIONKEYINT,N_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/nation';CREATEEXTERNALTABLElineitem(L_ORDERKEYINT,L_PARTKEYINT,L_SUPPKEYINT,L_LINENUMBERINT,L_QUANTITYDOUBLE,L_EXTENDEDPRICEDOUBLE,L_DISCOUNTDOUBLE,L_TAXDOUBLE,L_RETURNFLAGSTRING,L_LINESTATUSSTRING,L_SHIPDATEDATE,L_COMMITDATEDATE,L_RECEIPTDATEDATE,L_SHIPINSTRUCTSTRING,L_SHIPMODESTRING,L_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/lineitem';CREATEEXTERNALTABLEorders(O_ORDERKEYINT,O_CUSTKEYINT,O_ORDERSTATUSSTRING,O_TOTALPRICEDOUBLE,O_ORDERDATEDATE,O_ORDERPRIORITYSTRING,O_CLERKSTRING,O_SHIPPRIORITYINT,O_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/orders';CREATEEXTERNALTABLEsupplier(S_SUPPKEYINT,S_NAMESTRING,S_ADDRESSSTRING,S_NATIONKEYINT,S_PHONESTRING,S_ACCTBALDOUBLE,S_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/supplier';CREATEEXTERNALTABLEpartsupp(PS_PARTKEYINT,PS_SUPPKEYINT,PS_AVAILQTYINT,PS_SUPPLYCOSTDOUBLE,PS_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/partsupp';CREATEEXTERNALTABLEcustomer(C_CUSTKEYINT,C_NAMESTRING,C_ADDRESSSTRING,C_NATIONKEYINT,C_PHONESTRING,C_ACCTBALDOUBLE,C_MKTSEGMENTSTRING,C_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/customer';CREATEEXTERNALTABLEpart(P_PARTKEYINT,P_NAMESTRING,P_MFGRSTRING,P_BRANDSTRING,P_TYPESTRING,P_SIZEINT,P_CONTAINERSTRING,P_RETAILPRICEDOUBLE,P_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/part';CREATEEXTERNALTABLEregion(R_REGIONKEYINT,R_NAMESTRING,R_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/region';
查看更多
建表完毕后,刷新页面,在左边导航条中能看到schema下的8张表。
6. 执行TPC-H查询
TPC-H总共22条查询,如下:
Q1:
SELECTl_returnflag,l_linestatus,Sum(l_quantity)ASsum_qty,Sum(l_extendedprice)ASsum_base_price,Sum(l_extendedprice*(1-l_discount))ASsum_disc_price,Sum(l_extendedprice*(1-l_discount)*(1+l_tax))ASsum_charge,Avg(l_quantity)ASavg_qty,Avg(l_extendedprice)ASavg_price,Avg(l_discount)ASavg_disc,Count(*)AScount_orderFROMlineitemWHEREl_shipdate<=date'1998-12-01'-INTERVAL'93'dayGROUPBYl_returnflag,l_linestatusORDERBYl_returnflag,l_linestatusLIMIT1;
查看更多
Q2:
SELECTs_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_commentFROMpart,supplier,partsupp,nation,regionWHEREp_partkey=ps_partkeyANDs_suppkey=ps_suppkeyANDp_size=35ANDp_typeLIKE'%NICKEL'ANDs_nationkey=n_nationkeyANDn_regionkey=r_regionkeyANDr_name='MIDDLEEAST'
查看更多
Q3:
SELECTl_orderkey,Sum(l_extendedprice*(1-l_discount))ASrevenue,o_orderdate,o_shippriorityFROMcustomer,orders,lineitemWHEREc_mktsegment='AUTOMOBILE'ANDc_custkey=o_custkeyANDl_orderkey=o_orderkeyANDo_orderdate<date'1995-03-31'ANDl_shipdate>date'1995-03-31'GROUPBYl_orderkey,o_orderdate,o_shippriorityORDERBYrevenueDESC,o_orderdateLIMIT10;
Q4:
SELECTo_orderpriority,Count(*)ASorder_countFROMorders,lineitemWHEREo_orderdate>=date'1997-10-01'ANDo_orderdate<date'1997-10-01'+INTERVAL'3'monthANDl_orderkey=o_orderkeyANDl_commitdate<l_receiptdateGROUPBYo_orderpriorityORDERBYo_orderpriorityLIMIT1;
Q5:
SELECTn_name,Sum(l_extendedprice*(1-l_discount))ASrevenueFROMcustomer,orders,lineitem,supplier,nation,regionWHEREc_custkey=o_custkeyANDl_orderkey=o_orderkeyANDl_suppkey=s_suppkeyANDc_nationkey=s_nationkeyANDs_nationkey=n_nationkeyANDn_regionkey=r_regionkeyANDr_name='ASIA'ANDo_orderdate>=date'1995-01-01'ANDo_orderdate<date'1995-01-01'+INTERVAL'1'yearGROUPBYn_nameORDERBYrevenueDESCLIMIT1;
Q6:
SELECTsum(l_extendedprice*l_discount)ASrevenueFROMlineitemWHEREl_shipdate>=date'1995-01-01'ANDl_shipdate<date'1995-01-01'+interval'1'yearANDl_discountbetween0.04-0.01AND0.04+0.01ANDl_quantity<24LIMIT1;
Q7:
SELECTsupp_nation,cust_nation,l_year,Sum(volume)ASrevenueFROM(SELECTn1.n_nameASsupp_nation,n2.n_nameAScust_nation,Extract(yearFROMl_shipdate)ASl_year,l_extendedprice*(1-l_discount)ASvolumeFROMsupplier,lineitem,orders,customer,nationn1,nationn2WHEREs_suppkey=l_suppkeyANDo_orderkey=l_orderkeyANDc_custkey=o_custkeyANDs_nationkey=n1.n_nationkeyANDc_nationkey=n2.n_nationkeyAND((n1.n_name='GERMANY'ANDn2.n_name='INDIA')OR(n1.n_name='INDIA'ANDn2.n_name='GERMANY'))ANDl_shipdateBETWEENdate'1995-01-01'ANDdate'1996-12-31')ASshippingGROUPBYsupp_nation,cust_nation,l_yearORDERBYsupp_nation,cust_nation,l_yearLIMIT1;
查看更多
Q8:
SELECTo_year,Sum(CASEWHENnation='INDIA'THENvolumeELSE0end)/Sum(volume)ASmkt_shareFROM(SELECTExtract(yearFROMo_orderdate)ASo_year,l_extendedprice*(1-l_discount)ASvolume,n2.n_nameASnationFROMpart,supplier,lineitem,orders,customer,nationn1,nationn2,regionWHEREp_partkey=l_partkeyANDs_suppkey=l_suppkeyANDl_orderkey=o_orderkeyANDo_custkey=c_custkeyANDc_nationkey=n1.n_nationkeyANDn1.n_regionkey=r_regionkeyANDr_name='ASIA'ANDs_nationkey=n2.n_nationkeyANDo_orderdateBETWEENdate'1995-01-01'ANDdate'1996-12-31'ANDp_type='STANDARDANODIZEDSTEEL')ASall_nationsGROUPBYo_yearORDERBYo_yearLIMIT1;
查看更多
Q9:
SELECTnation,o_year,Sum(amount)ASsum_profitFROM(SELECTn_nameASnation,Extract(yearFROMo_orderdate)ASo_year,l_extendedprice*(1-l_discount)-ps_supplycost*l_quantityASamountFROMpart,supplier,lineitem,partsupp,orders,nationWHEREs_suppkey=l_suppkeyANDps_suppkey=l_suppkeyANDps_partkey=l_partkeyANDp_partkey=l_partkeyANDo_orderkey=l_orderkeyANDs_nationkey=n_nationkeyANDp_nameLIKE'%aquamarine%')ASprofitGROUPBYnation,o_yearORDERBYnation,o_yearDESCLIMIT1;
查看更多
Q10:
SELECTc_custkey,c_name,Sum(l_extendedprice*(1-l_discount))ASrevenue,c_acctbal,n_name,c_address,c_phone,c_commentFROMcustomer,orders,lineitem,nationWHEREc_custkey=o_custkeyANDl_orderkey=o_orderkeyANDo_orderdate>=date'1994-08-01'ANDo_orderdate<date'1994-08-01'+INTERVAL'3'monthANDl_returnflag='R'ANDc_nationkey=n_nationkeyGROUPBYc_custkey,c_name,c_acctbal,c_phone,n_name,c_address,c_commentORDERBYrevenueDESCLIMIT20;
查看更多
Q11:
SELECTps_partkey,Sum(ps_supplycost*ps_availqty)ASvalueFROMpartsupp,supplier,nationWHEREps_suppkey=s_suppkeyANDs_nationkey=n_nationkeyANDn_name='PERU'GROUPBYps_partkeyHAVINGSum(ps_supplycost*ps_availqty)>(SELECTSum(ps_supplycost*ps_availqty)*0.0001000000assum_valueFROMpartsupp,supplier,nationWHEREps_suppkey=s_suppkeyANDs_nationkey=n_nationkeyANDn_name='PERU')ORDERBYvalueDESCLIMIT1;
Q12:
SELECTl_shipmode,sum(casewheno_orderpriority='1-URGENT'oro_orderpriority='2-HIGH'then1else0end)AShigh_line_count,sum(casewheno_orderpriority<>'1-URGENT'ando_orderpriority<>'2-HIGH'then1else0end)ASlow_line_countFROMorders,lineitemWHEREo_orderkey=l_orderkeyANDl_shipmodein('MAIL','TRUCK')ANDl_commitdate<l_receiptdateANDl_shipdate<l_commitdateANDl_receiptdate>=date'1996-01-01'ANDl_receiptdate<date'1996-01-01'+interval'1'yearGROUPBYl_shipmodeORDERBYl_shipmodeLIMIT1;
Q13:
SELECTc_count,count(*)AScustdistFROM(SELECTc_custkey,count(o_orderkey)ASc_countFROMcustomer,ordersWHEREc_custkey=o_custkeyANDo_commentNOTLIKE'%pending%accounts%'GROUPBYc_custkey)ASc_ordersGROUPBYc_countORDERBYcustdistDESC,c_countDESCLIMIT1;
Q14:
SELECT100.00*sum(casewhenp_typelike'PROMO%'thenl_extendedprice*(1-l_discount)else0end)/sum(l_extendedprice*(1-l_discount))ASpromo_revenueFROMlineitem,partWHEREl_partkey=p_partkeyANDl_shipdate>=date'1996-01-01'ANDl_shipdate<date'1996-01-01'+interval'1'monthLIMIT1;
Q15:
WITHrevenue0AS(SELECTl_suppkeyASsupplier_no,sum(l_extendedprice*(1-l_discount))AStotal_revenueFROMlineitemWHEREl_shipdate>=date'1993-01-01'ANDl_shipdate<date'1993-01-01'+interval'3'monthGROUPBYl_suppkey)SELECTs_suppkey,s_name,s_address,s_phone,total_revenueFROMsupplier,revenue0WHEREs_suppkey=supplier_noANDtotal_revenueIN(SELECTmax(total_revenue)FROMrevenue0)ORDERBYs_suppkey;
Q16:
SELECTp_brand,p_type,p_size,count(distinctps_suppkey)ASsupplier_cntFROMpartsupp,partWHEREp_partkey=ps_partkeyANDp_brand<>'Brand#23'ANDp_typeNOTLIKE'PROMOBURNISHED%'ANDp_sizeIN(1,13,10,28,21,35,31,11)ANDps_suppkeyNOTIN(SELECTs_suppkeyFROMsupplierWHEREs_commentLIKE'%Customer%Complaints%')GROUPBYp_brand,p_type,p_sizeORDERBYsupplier_cntDESC,p_brand,p_type,p_sizeLIMIT1;
Q17:
SELECTsum(l_extendedprice)/7.0ASavg_yearlyFROMlineitem,partWHEREp_partkey=l_partkeyANDp_brand='Brand#44'ANDp_container='WRAPPKG'ANDl_quantity<(SELECT0.2*avg(l_quantity)FROMlineitem,partWHEREl_partkey=p_partkey);
Q18:
SELECTc_name,c_custkey,o_orderkey,o_orderdate,o_totalprice,sum(l_quantity)FROMcustomer,orders,lineitemWHEREo_orderkeyIN(SELECTl_orderkeyFROMlineitemGROUPBYl_orderkeyHAVINGsum(l_quantity)>315)ANDc_custkey=o_custkeyANDo_orderkey=l_orderkeyGROUPBYc_name,c_custkey,o_orderkey,o_orderdate,o_totalpriceORDERBYo_totalpriceDESC,o_orderdateLIMIT100;
Q19:
SELECTsum(l_extendedprice*(1-l_discount))ASrevenueFROMlineitem,partWHERE(p_partkey=l_partkeyandp_brand='Brand#12'andp_containerin('SMCASE','SMBOX','SMPACK','SMPKG')andl_quantity>=6andl_quantity<=6+10andp_sizebetween1and5andl_shipmodein('AIR','AIRREG')andl_shipinstruct='DELIVERINPERSON')or(p_partkey=l_partkeyandp_brand='Brand#13'andp_containerin('MEDBAG','MEDBOX','MEDPKG','MEDPACK')andl_quantity>=10andl_quantity<=10+10andp_sizebetween1and10andl_shipmodein('AIR','AIRREG')andl_shipinstruct='DELIVERINPERSON')or(p_partkey=l_partkeyandp_brand='Brand#24'andp_containerin('LGCASE','LGBOX','LGPACK','LGPKG')andl_quantity>=21andl_quantity<=21+10andp_sizebetween1and15andl_shipmodein('AIR','AIRREG')andl_shipinstruct='DELIVERINPERSON')LIMIT1;
查看更多
Q20:
withtemp_tableas(select0.5*sum(l_quantity)ascol1fromlineitem,partsuppwherel_partkey=ps_partkeyandl_suppkey=ps_suppkeyandl_shipdate>=date'1993-01-01'andl_shipdate<date'1993-01-01'+interval'1'year)selects_name,s_addressfromsupplier,nationwheres_suppkeyin(selectps_suppkeyfrompartsupp,temp_tablewhereps_partkeyin(selectp_partkeyfrompartwherep_namelike'dark%')andps_availqty>temp_table.col1)ands_nationkey=n_nationkeyandn_name='JORDAN'orderbys_namelimit1;
Q21:
selects_name,count(*)asnumwaitfromsupplier,lineiteml1,orders,nationwheres_suppkey=l1.l_suppkeyando_orderkey=l1.l_orderkeyando_orderstatus='F'andl1.l_receiptdate>l1.l_commitdateandexists(select*fromlineiteml2wherel2.l_orderkey=l1.l_orderkeyandl2.l_suppkey<>l1.l_suppkey)andnotexists(select*fromlineiteml3wherel3.l_orderkey=l1.l_orderkeyandl3.l_suppkey<>l1.l_suppkeyandl3.l_receiptdate>l3.l_commitdate)ands_nationkey=n_nationkeyandn_name='SAUDIARABIA'groupbys_nameorderbynumwaitdesc,s_namelimit100;
查看更多
Q22:
withtemp_table_1as(selectavg(c_acctbal)asavg_valuefromcustomerwherec_acctbal>0.00andsubstring(c_phonefrom1for2)in('33','29','37','35','25','27','43')),temp_table_2as(selectcount(*)ascount1fromorders,customerwhereo_custkey=c_custkey)selectcntrycode,count(*)asnumcust,sum(c_acctbal)astotacctbalfrom(selectsubstring(c_phonefrom1for2)ascntrycode,c_acctbalfromcustomer,temp_table_1,temp_table_2wheresubstring(c_phonefrom1for2)in('33','29','37','35','25','27','43')andc_acctbal>temp_table_1.avg_valueandtemp_table_2.count1=0)ascustsalegroupbycntrycodeorderbycntrycodelimit1;
7. 异步执行查询
Data Lake Analytics支持“同步执行”模式和“异步执行”模式。“同步执行”模式下,控制台界面等待执行结果返回;“异步执行”模式下,立刻返回查询任务的ID。
点击“执行状态”,可以看到该异步查询任务的执行状态,主要分为:“RUNNING”,“SUCCESS”,“FAILURE”。
点击“刷新”,当STATUS变为“SUCCESS”时,表示查询成功,同时可查看查询耗时“ELAPSE_TIME”和查询扫描的数据字节数“SCANNED_DATA_BYTES”。
8. 查看查询历史
点击“执行历史”,可以看到您执行的查询的历史详细信息,包括:
1)查询语句;
2)查询耗时与执行具体时间;
3)查询结果返回行数;
4)查询状态;
5)查询扫描的字节数;
6)结果集回写到的目标OSS文件(Data Lake Analytics会将查询结果集保存用户的bucket中)。
查询结果文件自动上传到用户同region的OSS bucket中,其中包括结果数据文件和结果集元数据描述文件。
{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata
其中QueryLocation为:
aliyun-oa-query-results-<your_account_id>-<oss_region>
看完上述内容是否对您有帮助呢?如果还想对相关知识有进一步的了解或阅读更多相关文章,请关注恰卡编程网行业资讯频道,感谢您对恰卡编程网的支持。