导入 parquet 数据到 hive 中

查看已有的parquet 数据格式，可以通过python 或者 jar 工具

使用python查看 parquet 数据

# 需要预先安装 fastparquet 或者 pyarrow, pandas
import pandas as pd
df = pd.read_parquet("<filename>", engine="fastparquet")
for col in df.columns:
	print(col, df[col].dtype)

使用 jar包工具查看, 可以尝试自行编译

# 查看结构
java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema -d activity.201711171437.0.parquet |head -n 30
# 查看内容
java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar head -n 2 activity.201711171437.0.parquet

这里我使用的是 report_voice 的parquet 数据包. 查看结构

message schema {
  optional int64 uid;
  optional int64 voice_id;
  optional int64 time (TIMESTAMP_MILLIS);
  optional binary text (UTF8);
  optional int32 login (INT_8);
  optional int32 network (INT_8);
  optional int32 recognition_type (INT_8);
  optional int32 err_code1;
  optional double err_msg1;
  optional int32 err_code2;
  optional binary json (UTF8);
  optional int32 download (INT_8);
  optional binary txz_app_id (UTF8);
  optional int64 time_create (TIMESTAMP_MILLIS);
  optional int32 record_type (INT_8);
  optional int32 sex (INT_8);
  optional int32 singal_type (INT_8);
  optional int32 sample_rate (INT_8);
  optional int32 language (INT_8);
  optional int64 __index_level_0__;
}

确认好数据格式之后，进行建表 (这里可以使用外置表 EXTERNAL 或者普通的表 )

上传已有的 parquet 文件，通过 external 进行关联

首先将parquet 文件上传 hdfs.

hdfs dfs -mkdir -p /warehouse/original/report_voice
hdfs dfs -put <local_file> /warehouse/original/report_voice

在hive 中创建外置表, 使用parquet 存储，gzip 压缩

CREATE EXTERNAL TABLE original_report_voice (
 uid bigint,
 voice_id bigint,
 time TIMESTAMP,
 text string,
 login int,
 network int,
 recognition_type int,
 err_code1 int,
 err_msg1 string,
 err_code2 int,
 json string,
 download int,
 txz_app_id string,
 time_create TIMESTAMP,
 record_type int,
 sex TINYINT,
 singal_type int,
 sample_rate int,
 language int
) STORED AS parquet
LOCATION '/warehouse/original/report_voice/'
TBLPROPERTIES('parquet.compression'='GZIP');

进入 hive 执行查询，可以查到目录下的文件会被抽象为一个大表
```
select count(*) from original_report_voice
```