Data Lake Analytics + OSS数据文件的格式处理分别是怎样的
Data Lake Analytics + OSS数据文件的格式处理分别是怎样的
这期内容当中小编将会给大家带来有关Data Lake Analytics + OSS数据文件的格式处理分别是怎样的,文章内容丰富且以专业的角度为大家分析和叙述,阅读完这篇文章希望大家可以有所收获。
0. 前言
Data Lake Analytics是Serverless化的云上交互式查询分析服务。用户可以使用标准的SQL语句,对存储在OSS、TableStore上的数据无需移动,直接进行查询分析。
除了纯文本文件(例如,CSV,TSV等),用户存储在OSS上的其他格式的数据文件,也可以使用Data Lake Analytics进行查询分析,包括ORC, PARQUET, JSON, RCFILE, AVRO甚至ESRI规范的地理JSON数据,还可以用正则表达式匹配的文件等。
下面详细介绍如何根据存储在OSS上的文件格式使用Data Lake Analytics (下文简称 DLA)进行分析。DLA内置了各种处理文件数据的SerDe(Serialize/Deserilize的简称,目的是用于序列化和反序列化)实现,用户无需自己编写程序,基本上能选用DLA中的一款或多款SerDe来匹配您OSS上的数据文件格式。
1. 存储格式与SerDe
用户可以依据存储在OSS上的数据文件进行建表,通过STORED AS 指定数据文件的格式。
例如,
CREATEEXTERNALTABLEnation(N_NATIONKEYINT,N_NAMESTRING,N_REGIONKEYINT,N_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/tpch_100m/nation';
建表成功后可以使用SHOW CREATE TABLE语句查看原始建表语句。
mysql>showcreatetablenation;+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|Result|+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|CREATEEXTERNALTABLE`nation`(`n_nationkey`int,`n_name`string,`n_regionkey`int,`n_comment`string)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDAS`TEXTFILE`LOCATION'oss://test-bucket-julian-1/tpch_100m/nation'|+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+1rowinset(1.81sec)
下表中列出了目前DLA已经支持的文件格式,当针对下列格式的文件建表时,可以直接使用STORED AS,DLA会选择合适的SERDE/INPUTFORMAT/OUTPUTFORMAT。
存储格式 | 描述 |
STORED AS TEXTFILE | 数据文件的存储格式为纯文本文件。默认的文件类型。 文件中的每一行对应表中的一条记录。 |
STORED AS ORC | 数据文件的存储格式为ORC。 |
STORED AS PARQUET | 数据文件的存储格式为PARQUET。 |
STORED AS RCFILE | 数据文件的存储格式为RCFILE。 |
STORED AS AVRO | 数据文件的存储格式为AVRO。 |
STORED AS JSON | 数据文件的存储格式为JSON (Esri ArcGIS的地理JSON数据文件 除外)。 |
在指定了STORED AS 的同时,还可以根据具体文件的特点,指定SerDe (用于解析数据文件并映射到DLA表),特殊的列分隔符等。
后面的部分会做进一步的讲解。
2. 示例
2.1 CSV文件
CSV文件,本质上还是纯文本文件,可以使用STORED AS TEXTFILE。
列与列之间以逗号分隔,可以通过ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 表示。
普通CSV文件
例如,数据文件oss://bucket-for-testing/oss/text/cities/city.csv的内容为
Beijing,China,010ShangHai,China,021Tianjin,China,022
建表语句可以为
CREATEEXTERNALTABLEcity(citySTRING,countrySTRING,codeINT)ROWFORMATDELIMITEDFIELDSTERMINATEDBY','STOREDASTEXTFILELOCATION'oss://bucket-for-testing/oss/text/cities';
使用OpenCSVSerde__处理引号__引用的字段
OpenCSVSerde在使用时需要注意以下几点:
用户可以为行的字段指定字段分隔符、字段内容引用符号和转义字符,例如:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\" );
不支持字段内嵌入的行分割符;
所有字段定义STRING类型;
其他数据类型的处理,可以在SQL中使用函数进行转换。
例如,
CREATEEXTERNALTABLEtest_csv_opencsvserde(idSTRING,nameSTRING,locationSTRING,create_dateSTRING,create_timestampSTRING,longitudeSTRING,latitudeSTRING)ROWFORMATSERDE'org.apache.hadoop.hive.serde2.OpenCSVSerde'withserdeproperties('separatorChar'=',','quoteChar'='"','escapeChar'='\\')STOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/test_csv_serde_1';
自定义分隔符
需要自定义列分隔符(FIELDS TERMINATED BY),转义字符(ESCAPED BY),行结束符(LINES TERMINATED BY)。
需要在建表语句中指定
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t'ESCAPEDBY'\\'LINESTERMINATEDBY'\n'
忽略CSV文件中的HEADER
在csv文件中,有时会带有HEADER信息,需要在数据读取时忽略掉这些内容。这时需要在建表语句中定义skip.header.line.count。
例如,数据文件oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的内容如下:
N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT0|ALGERIA|0|haggle.carefullyfinaldepositsdetectslylyagai|1|ARGENTINA|1|alfoxespromiseslylyaccordingtotheregularaccounts.boldrequestsalon|2|BRAZIL|1|yalongsideofthependingdeposits.carefullyspecialpackagesareabouttheironicforges.slylyspecial|3|CANADA|1|eashangironic,silentpackages.slylyregularpackagesarefuriouslyoverthetithes.fluffilybold|4|EGYPT|4|yabovethecarefullyunusualtheodolites.finaldugoutsarequicklyacrossthefuriouslyregulard|5|ETHIOPIA|0|venpackageswakequickly.regu|
相应的建表语句为:
CREATEEXTERNALTABLEnation_header(N_NATIONKEYINT,N_NAMESTRING,N_REGIONKEYINT,N_COMMENTSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'STOREDASTEXTFILELOCATION'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl'TBLPROPERTIES("skip.header.line.count"="1");
skip.header.line.count的取值x和数据文件的实际行数n有如下关系:
当x<=0时,DLA在读取文件时,不会过滤掉任何信息,即全部读取;
当0
当x>=n时,DLA在读取文件时,会过滤掉所有的文件内容。
2.2 TSV文件
与CSV文件类似,TSV格式的文件也是纯文本文件,列与列之间的分隔符为Tab。
例如,数据文件oss://bucket-for-testing/oss/text/cities/city.tsv的内容为
BeijingChina010ShangHaiChina021TianjinChina022
建表语句可以为
CREATEEXTERNALTABLEcity(citySTRING,countrySTRING,codeINT)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t'STOREDASTEXTFILELOCATION'oss://bucket-for-testing/oss/text/cities';
2.3 多字符数据字段分割符文件
假设您的数据字段的分隔符包含多个字符,可采用如下示例建表语句,其中每行的数据字段分割符为“||”,可以替换为您具体的分割符字符串。
ROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'withserdeproperties("field.delim"="||")
示例:
CREATEEXTERNALTABLEtest_csv_multidelimit(idSTRING,nameSTRING,locationSTRING,create_dateSTRING,create_timestampSTRING,longitudeSTRING,latitudeSTRING)ROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'withserdeproperties("field.delim"="||")STOREDASTEXTFILELOCATION'oss://bucket-for-testing/oss/text/cities/';
2.4 JSON文件
DLA可以处理的JSON文件通常以纯文本的格式存储,数据文件的编码方式需要是UTF-8。
在JSON文件中,每行必须是一个完整的JSON对象。
例如,下面的文件格式是不被接受的
{"id":123,"name":"jack","c3":"2001-02-0312:34:56"}{"id":456,"name":"rose","c3":"1906-04-1805:12:00"}{"id":789,"name":"tom","c3":"2001-02-0312:34:56"}{"id":234,"name":"alice","c3":"1906-04-1805:12:00"}
需要改写成:
{"id":123,"name":"jack","c3":"2001-02-0312:34:56"}{"id":456,"name":"rose","c3":"1906-04-1805:12:00"}{"id":789,"name":"tom","c3":"2001-02-0312:34:56"}{"id":234,"name":"alice","c3":"1906-04-1805:12:00"}
不含嵌套的JSON数据
建表语句可以写
CREATEEXTERNALTABLEt1(idint,namestring,c3timestamp)STOREDASJSONLOCATION'oss://path/to/t1/directory';
含有嵌套的JSON文件
使用struct和array结构定义嵌套的JSON数据。
例如,用户原始数据(注意:无论是否嵌套,一条完整的JSON数据都只能放在一行上,才能被Data Lake Analytics处理):
{"DocId":"Alibaba","User_1":{"Id":1234,"Username":"bob1234","Name":"Bob","ShippingAddress":{"Address1":"969WenyiWestSt.","Address2":null,"City":"Hangzhou","Province":"Zhejiang"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2017"},{"ItemId":4352,"OrderDate":"12/12/2017"}]}}
使用在线JSON格式化工具格式化后,数据内容如下:
{"DocId":"Alibaba","User_1":{"Id":1234,"Username":"bob1234","Name":"Bob","ShippingAddress":{"Address1":"969WenyiWestSt.","Address2":null,"City":"Hangzhou","Province":"Zhejiang"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2017"},{"ItemId":4352,"OrderDate":"12/12/2017"}]}}
则建表语句可以写成如下(注意:LOCATION中指定的路径必须是JSON数据文件所在的目录,该目录下的所有JSON文件都能被识别为该表的数据):
CREATEEXTERNALTABLEjson_table_1(docidstring,user_1struct<id:INT,username:string,name:string,shippingaddress:struct<address1:string,address2:string,city:string,province:string>,orders:array<struct<itemid:INT,orderdate:string>>>)STOREDASJSONLOCATION'oss://xxx/test/json/hcatalog_serde/table_1/';
对该表进行查询:
select*fromjson_table_1;+---------+----------------------------------------------------------------------------------------------------------------+|docid|user_1|+---------+----------------------------------------------------------------------------------------------------------------+|Alibaba|[1234,bob1234,Bob,[969WenyiWestSt.,null,Hangzhou,Zhejiang],[[6789,11/11/2017],[4352,12/12/2017]]]|+---------+----------------------------------------------------------------------------------------------------------------+
对于struct定义的嵌套结构,可以通过“.”进行层次对象引用,对于array定义的数组结构,可以通过“[数组下标]”(注意:数组下标从1开始)进行对象引用。
selectDocId,User_1.Id,User_1.ShippingAddress.Address1,User_1.Orders[1].ItemIdfromjson_table_1whereUser_1.Username='bob1234'andUser_1.Orders[2].OrderDate='12/12/2017';+---------+------+--------------------+-------+|DocId|id|address1|_col3|+---------+------+--------------------+-------+|Alibaba|1234|969WenyiWestSt.|6789|+---------+------+--------------------+-------+
使用JSON函数处理数据
例如,把“value_string”的嵌套JSON值作为字符串存储:
{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}
使用在线JSON格式化工具格式化后,数据内容如下:
{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}
建表语句为
CREATEexternalTABLEjson_table_2(data_keystring,tsbigint,value_stringstring)STOREDASJSONLOCATION'oss://xxx/test/json/hcatalog_serde/table_2/';
表建好后,可进行查询:
select*fromjson_table_2;+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|data_key|ts|value_string|+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|com.taobao.vipserver.domains.meta.biz.alibaba.com|1524550275112|{"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","张三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false}|+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
下面SQL示例json_parse,json_extract_scalar,json_extract等常用JSON函数的使用方式:
mysql>selectjson_extract_scalar(json_parse(value),'$.owners[1]')fromjson_table_2;+--------+|_col0|+--------+|张三|+--------+mysql>selectjson_extract_scalar(json_obj.json_col,'$.DEFAULT.submask')from(selectjson_extract(json_parse(value),'$.clusterMap')asjson_colfromjson_table_2)json_objwherejson_extract_scalar(json_obj.json_col,'$.DEFAULT.healthChecker.curlPath')='/status.taobao';+-----------+|_col0|+-----------+|0.0.0.0/0|+-----------+mysql>withjson_objas(selectjson_extract(json_parse(value),'$.clusterMap')asjson_colfromjson_table_2)selectjson_extract_scalar(json_obj.json_col,'$.DEFAULT.submask')fromjson_objwherejson_extract_scalar(json_obj.json_col,'$.DEFAULT.healthChecker.curlPath')='/status.taobao';+-----------+|_col0|+-----------+|0.0.0.0/0|+-----------+
2.5 ORC文件
Optimized Row Columnar(ORC)是Apache开源项目Hive支持的一种优化的列存储文件格式。与CSV文件相比,不仅可以节省存储空间,还可以得到更好的查询性能。
对于ORC文件,只需要在建表时指定 STORED AS ORC。
例如,
CREATEEXTERNALTABLEorders_orc_date(O_ORDERKEYINT,O_CUSTKEYINT,O_ORDERSTATUSSTRING,O_TOTALPRICEDOUBLE,O_ORDERDATEDATE,O_ORDERPRIORITYSTRING,O_CLERKSTRING,O_SHIPPRIORITYINT,O_COMMENTSTRING)STOREDASORCLOCATION'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';
2.6 PARQUET文件
Parquet是Apache开源项目Hadoop支持的一种列存储的文件格式。
使用DLA建表时,需要指定STORED AS PARQUET即可。
例如,
CREATEEXTERNALTABLEorders_parquet_date(O_ORDERKEYINT,O_CUSTKEYINT,O_ORDERSTATUSSTRING,O_TOTALPRICEDOUBLE,O_ORDERDATEDATE,O_ORDERPRIORITYSTRING,O_CLERKSTRING,O_SHIPPRIORITYINT,O_COMMENTSTRING)STOREDASPARQUETLOCATION'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';
2.7 RCFILE文件
Record Columnar File (RCFile), 列存储文件,可以有效地将关系型表结构存储在分布式系统中,并且可以被高效地读取和处理。
DLA在建表时,需要指定STORED AS RCFILE。
例如,
CREATEEXTERNALTABLElineitem_rcfile_date(L_ORDERKEYINT,L_PARTKEYINT,L_SUPPKEYINT,L_LINENUMBERINT,L_QUANTITYDOUBLE,L_EXTENDEDPRICEDOUBLE,L_DISCOUNTDOUBLE,L_TAXDOUBLE,L_RETURNFLAGSTRING,L_LINESTATUSSTRING,L_SHIPDATEDATE,L_COMMITDATEDATE,L_RECEIPTDATEDATE,L_SHIPINSTRUCTSTRING,L_SHIPMODESTRING,L_COMMENTSTRING)STOREDASRCFILELOCATION'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'
2.8 AVRO文件
DLA针对AVRO文件建表时,需要指定STORED AS AVRO,并且定义的字段需要符合AVRO文件的schema。
如果不确定可以通过使用Avro提供的工具,获得schema,并根据schema建表。
在Apache Avro官网下载avro-tools-.jar到本地,执行下面的命令获得Avro文件的schema:
java-jaravro-tools-1.8.2.jargetschema/path/to/your/doctors.avro{"type":"record","name":"doctors","namespace":"testing.hive.avro.serde","fields":[{"name":"number","type":"int","doc":"Orderofplayingtherole"},{"name":"first_name","type":"string","doc":"firstnameofactorplayingrole"},{"name":"last_name","type":"string","doc":"lastnameofactorplayingrole"}]}
建表语句如下,其中fields中的name对应表中的列名,type需要参考本文档中的表格转成DLA支持的类型
CREATEEXTERNALTABLEdoctors(numberint,first_namestring,last_namestring)STOREDASAVROLOCATION'oss://mybucket-for-testing/directory/to/doctors';
大多数情况下,Avro的类型可以直接转换成DLA中对应的类型。如果该类型在DLA不支持,则会转换成接近的类型。具体请参照下表:
Avro类型 | 对应DLA类型 |
---|---|
null | void |
boolean | boolean |
int | int |
long | bigint |
float | float |
double | double |
bytes | binary |
string | string |
record | struct |
map | map |
list | array |
union | union |
enum | string |
fixed | binary |
2.9 可以用正则表达式匹配的文件
通常此类型的文件是以纯文本格式存储在OSS上的,每一行代表表中的一条记录,并且每行可以用正则表达式匹配。
例如,Apache WebServer日志文件就是这种类型的文件。
某日志文件的内容为:
127.0.0.1-frank[10/Oct/2000:13:55:36-0700]"GET/apache_pb.gifHTTP/1.0"2002326127.0.0.1--[26/May/2009:00:00:00+0000]"GET/someurl/?track=Blabla(Main)HTTP/1.1"2005864-"Mozilla/5.0(Windows;U;WindowsNT6.0;en-US)AppleWebKit/525.19(KHTML,likeGecko)Chrome/1.0.154.65Safari/525.19"
每行文件可以用下面的正则表达式表示,列之间使用空格分隔:
([^]*)([^]*)([^]*)(-|\\[[^\\]]*\\])([^\"]*|\"[^\"]*\")(-|[0-9]*)(-|[0-9]*)(?:([^\"]*|\"[^\"]*\")([^\"]*|\"[^\"]*\"))?
针对上面的文件格式,建表语句可以表示为:
CREATEEXTERNALTABLEserde_regex(hostSTRING,identitySTRING,userNameSTRING,timeSTRING,requestSTRING,statusSTRING,sizeINT,refererSTRING,agentSTRING)ROWFORMATSERDE'org.apache.hadoop.hive.serde2.RegexSerDe'WITHSERDEPROPERTIES("input.regex"="([^]*)([^]*)([^]*)(-|\\[[^\\]]*\\])([^\"]*|\"[^\"]*\")(-|[0-9]*)(-|[0-9]*)(?:([^\"]*|\"[^\"]*\")([^\"]*|\"[^\"]*\"))?")STOREDASTEXTFILELOCATION'oss://bucket-for-testing/datasets/serde/regex';
查询结果
mysql>select*fromserde_regex;+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+|host|identity|userName|time|request|status|size|referer|agent|+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+|127.0.0.1|-|frank|[10/Oct/2000:13:55:36-0700]|"GET/apache_pb.gifHTTP/1.0"|200|2326|NULL|NULL||127.0.0.1|-|-|[26/May/2009:00:00:00+0000]|"GET/someurl/?track=Blabla(Main)HTTP/1.1"|200|5864|-|"Mozilla/5.0(Windows;U;WindowsNT6.0;en-US)AppleWebKit/525.19(KHTML,likeGecko)Chrome/1.0.154.65Safari/525.19"|+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
2.10 Esri ArcGIS的地理JSON数据文件
DLA支持Esri ArcGIS的地理JSON数据文件的SerDe处理,关于这种地理JSON数据格式说明,可以参考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats
示例:
CREATEEXTERNALTABLEIFNOTEXISTScalifornia_counties(Namestring,BoundaryShapebinary)ROWFORMATSERDE'com.esri.hadoop.hive.serde.JsonSerde'STOREDASINPUTFORMAT'com.esri.json.hadoop.EnclosedJsonInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION'oss://test_bucket/datasets/geospatial/california-counties/'
通过以上例子可以看出,DLA可以支持大部分开源存储格式的文件。对于同一份数据,使用不同的存储格式,在OSS中存储文件的大小,DLA的查询分析速度上会有较大的差别。推荐使用ORC格式进行文件的存储和查询。
为了获得更快的查询速度,DLA还在不断的优化中,后续也会支持更多的数据源,为用户带来更好的大数据分析体验。
上述就是小编为大家分享的Data Lake Analytics + OSS数据文件的格式处理分别是怎样的了,如果刚好有类似的疑惑,不妨参照上述分析进行理解。如果想知道更多相关知识,欢迎关注恰卡编程网行业资讯频道。