- HCatalog是让Hive的元数据存储为基于Hadoop的其他工具所公用。它为MapReduce和Pig提供了连接器,用户可以从Hive数据仓库中读取和写入数据。
- MR从InputFormat读取数据,然后分片成不同的map任务并行处理,公共一个RecordReader从输入数据源中读取记录转换成键值对来供map任务进行处理。
- Hcatalog 提供HCatInputFormat类来供MapReduce用户从Hive的数据仓库中读取数据 。其允许用户只读需要的表分区和字段,并且提供一种列表格式来展示 。
- 初始化HCatInputFormat时,首先要读取指定的表,通过创建InputJobInfo类,然后指定数据库、表和分区过滤条件就可以达到这个目的。
//数据库名称
private final String databaseName;
//表名
private final String tableName;
//表信息
private HCatTableInfo tableInfo;
//分区过滤条件
private String filter;
//全部分区
transient private List<PartInfo> partitions;
/** implementation specific job properties */
private Properties properties;
HCatSchema baseSchema = HCatBaseInputFormat.getOutputSchema(context);
List<HCatFieldSchema> fields = new List<HCatFieldSchema>(2);
fields.add(baseSchema.get("user"));
fields.add(baseSchema.get("url"));
HCatBaseInputFormat.setOutputSchema(job, new HCatSchema(fields));
/** The db and table names. */
private final String databaseName;
private final String tableName;
/** The table info provided by user. */
private HCatTableInfo tableInfo;
/** The output schema. This is given to us by user. This wont contain any
* partition columns ,even if user has specified them.
* */
private HCatSchema outputSchema;
/** The location of the partition being written */
private String location;
/** The root location of custom dynamic partitions being written */
private String customDynamicRoot;
/** The relative path of custom dynamic partitions being written */
private String customDynamicPath;
//用户期望创建的分区名称
/** The partition values to publish to, if used for output*/
private Map<String, String> partitionValues;
private List<Integer> posOfPartCols;
private List<Integer> posOfDynPartCols;
private Properties properties;
private int maxDynamicPartitions;
/** List of keys for which values were not specified at write setup time, to be infered at write time */
private List<String> dynamicPartitioningKeys;
private boolean harRequested;
- 进入HIVE_HOME/hcatalog/bin中
hcat -e "create table test(name string);"


- HcatInputFormat通过和Hive元数据存储进行同学来获取要读取的表和分区的信息。包括获取表的模式,也包括获取每个分区的模式。