|
| 1 | +# Paimon |
| 2 | + |
| 3 | +> Paimon 源连接器 |
| 4 | +
|
| 5 | +## 描述 |
| 6 | + |
| 7 | +用于从 `Apache Paimon` 读取数据 |
| 8 | + |
| 9 | +## 主要功能 |
| 10 | + |
| 11 | +- [x] [批处理](../../concept/connector-v2-features.md) |
| 12 | +- [x] [流处理](../../concept/connector-v2-features.md) |
| 13 | +- [ ] [精确一次](../../concept/connector-v2-features.md) |
| 14 | +- [ ] [列投影](../../concept/connector-v2-features.md) |
| 15 | +- [ ] [并行度](../../concept/connector-v2-features.md) |
| 16 | +- [ ] [支持用户自定义分片](../../concept/connector-v2-features.md) |
| 17 | + |
| 18 | +## 配置选项 |
| 19 | + |
| 20 | +| 名称 | 类型 | 是否必须 | 默认值 | |
| 21 | +|-------------------------|--------|----------|---------------| |
| 22 | +| warehouse | String | 是 | - | |
| 23 | +| catalog_type | String | 否 | filesystem | |
| 24 | +| catalog_uri | String | 否 | - | |
| 25 | +| database | String | 是 | - | |
| 26 | +| table | String | 是 | - | |
| 27 | +| hdfs_site_path | String | 否 | - | |
| 28 | +| query | String | 否 | - | |
| 29 | +| paimon.hadoop.conf | Map | 否 | - | |
| 30 | +| paimon.hadoop.conf-path | String | 否 | - | |
| 31 | + |
| 32 | +### warehouse [string] |
| 33 | + |
| 34 | +Paimon warehouse 路径 |
| 35 | + |
| 36 | +### catalog_type [string] |
| 37 | + |
| 38 | +Paimon Catalog 类型,支持 filesystem 和 hive |
| 39 | + |
| 40 | +### catalog_uri [string] |
| 41 | + |
| 42 | +Paimon 的 catalog uri,仅当 catalog_type 为 hive 时需要 |
| 43 | + |
| 44 | +### database [string] |
| 45 | + |
| 46 | +需要访问的数据库 |
| 47 | + |
| 48 | +### table [string] |
| 49 | + |
| 50 | +需要访问的表 |
| 51 | + |
| 52 | +### hdfs_site_path [string] |
| 53 | + |
| 54 | +`hdfs-site.xml` 文件地址 |
| 55 | + |
| 56 | +### query [string] |
| 57 | + |
| 58 | +读取表格的筛选条件,例如:`select * from st_test where id > 100`。如果未指定,则将读取所有记录。 |
| 59 | + |
| 60 | +目前,`where` 支持`<, <=, >, >=, =, !=, or, and,is null, is not null`,其他暂不支持。 |
| 61 | + |
| 62 | +由于 Paimon 限制,目前不支持 `Having`, `Group By` 和 `Order By`,未来版本将会支持 `projection` 和 `limit`。 |
| 63 | + |
| 64 | +注意:当 `where` 后的字段为字符串或布尔值时,其值必须使用单引号,否则将会报错。例如 `name='abc'` 或 `tag='true'`。 |
| 65 | + |
| 66 | +当前 `where` 支持的字段数据类型如下: |
| 67 | + |
| 68 | +* string |
| 69 | +* boolean |
| 70 | +* tinyint |
| 71 | +* smallint |
| 72 | +* int |
| 73 | +* bigint |
| 74 | +* float |
| 75 | +* double |
| 76 | +* date |
| 77 | +* timestamp |
| 78 | + |
| 79 | +### paimon.hadoop.conf [string] |
| 80 | + |
| 81 | +hadoop conf 属性 |
| 82 | + |
| 83 | +### paimon.hadoop.conf-path [string] |
| 84 | + |
| 85 | +指定 'core-site.xml', 'hdfs-site.xml', 'hive-site.xml' 文件加载路径。 |
| 86 | + |
| 87 | +## Filesystems |
| 88 | + |
| 89 | +Paimon 连接器支持向多个文件系统写入数据。目前,支持的文件系统有 `hdfs` 和 `s3`。 |
| 90 | +如果使用 `s3` 文件系统,可以在 `paimon.hadoop.conf` 中配置`fs.s3a.access-key`、`fs.s3a.secret-key`、`fs.s3a.endpoint`、`fs.s3a.path.style.access`、`fs.s3a.aws.credentials.provider` 属性,数仓地址应该以 `s3a://` 开头。 |
| 91 | + |
| 92 | +## 示例 |
| 93 | + |
| 94 | +### 简单示例 |
| 95 | + |
| 96 | +```hocon |
| 97 | +source { |
| 98 | + Paimon { |
| 99 | + warehouse = "/tmp/paimon" |
| 100 | + database = "default" |
| 101 | + table = "st_test" |
| 102 | + } |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +### Filter 示例 |
| 107 | + |
| 108 | +```hocon |
| 109 | +source { |
| 110 | + Paimon { |
| 111 | + warehouse = "/tmp/paimon" |
| 112 | + database = "full_type" |
| 113 | + table = "st_test" |
| 114 | + query = "select c_boolean, c_tinyint from st_test where c_boolean= 'true' and c_tinyint > 116 and c_smallint = 15987 or c_decimal='2924137191386439303744.39292213'" |
| 115 | + } |
| 116 | +} |
| 117 | +``` |
| 118 | + |
| 119 | +### S3 示例 |
| 120 | +```hocon |
| 121 | +env { |
| 122 | + execution.parallelism = 1 |
| 123 | + job.mode = "BATCH" |
| 124 | +} |
| 125 | +
|
| 126 | +source { |
| 127 | + Paimon { |
| 128 | + warehouse = "s3a://test/" |
| 129 | + database = "seatunnel_namespace11" |
| 130 | + table = "st_test" |
| 131 | + paimon.hadoop.conf = { |
| 132 | + fs.s3a.access-key=G52pnxg67819khOZ9ezX |
| 133 | + fs.s3a.secret-key=SHJuAQqHsLrgZWikvMa3lJf5T0NfM5LMFliJh9HF |
| 134 | + fs.s3a.endpoint="http://minio4:9000" |
| 135 | + fs.s3a.path.style.access=true |
| 136 | + fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider |
| 137 | + } |
| 138 | + } |
| 139 | +} |
| 140 | +
|
| 141 | +sink { |
| 142 | + Console{} |
| 143 | +} |
| 144 | +``` |
| 145 | + |
| 146 | +### Hadoop 配置示例 |
| 147 | + |
| 148 | +```hocon |
| 149 | +source { |
| 150 | + Paimon { |
| 151 | + catalog_name="seatunnel_test" |
| 152 | + warehouse="hdfs:///tmp/paimon" |
| 153 | + database="seatunnel_namespace1" |
| 154 | + table="st_test" |
| 155 | + query = "select * from st_test where pk_id is not null and pk_id < 3" |
| 156 | + paimon.hadoop.conf = { |
| 157 | + fs.defaultFS = "hdfs://nameservice1" |
| 158 | + dfs.nameservices = "nameservice1" |
| 159 | + dfs.ha.namenodes.nameservice1 = "nn1,nn2" |
| 160 | + dfs.namenode.rpc-address.nameservice1.nn1 = "hadoop03:8020" |
| 161 | + dfs.namenode.rpc-address.nameservice1.nn2 = "hadoop04:8020" |
| 162 | + dfs.client.failover.proxy.provider.nameservice1 = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" |
| 163 | + dfs.client.use.datanode.hostname = "true" |
| 164 | + } |
| 165 | + } |
| 166 | +} |
| 167 | +``` |
| 168 | + |
| 169 | +### Hive catalog 示例 |
| 170 | + |
| 171 | +```hocon |
| 172 | +source { |
| 173 | + Paimon { |
| 174 | + catalog_name="seatunnel_test" |
| 175 | + catalog_type="hive" |
| 176 | + catalog_uri="thrift://hadoop04:9083" |
| 177 | + warehouse="hdfs:///tmp/seatunnel" |
| 178 | + database="seatunnel_test" |
| 179 | + table="st_test3" |
| 180 | + paimon.hadoop.conf = { |
| 181 | + fs.defaultFS = "hdfs://nameservice1" |
| 182 | + dfs.nameservices = "nameservice1" |
| 183 | + dfs.ha.namenodes.nameservice1 = "nn1,nn2" |
| 184 | + dfs.namenode.rpc-address.nameservice1.nn1 = "hadoop03:8020" |
| 185 | + dfs.namenode.rpc-address.nameservice1.nn2 = "hadoop04:8020" |
| 186 | + dfs.client.failover.proxy.provider.nameservice1 = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" |
| 187 | + dfs.client.use.datanode.hostname = "true" |
| 188 | + } |
| 189 | + } |
| 190 | +} |
| 191 | +``` |
| 192 | + |
| 193 | +## Changelog |
| 194 | + |
| 195 | +如果要读取 paimon 表的 changelog,首先要为 Paimon 源表设置 `changelog-producer`,然后使用 SeaTunnel 流任务读取。 |
| 196 | + |
| 197 | +### Note |
| 198 | + |
| 199 | +目前,批读取总是读取最新的快照,如需读取更完整的 changelog 数据,需使用流读取,并在将数据写入 Paimon 表之前开始流读取,为了确保顺序,流读取任务并行度应该设置为 1。 |
| 200 | + |
| 201 | +### Streaming read 示例 |
| 202 | +```hocon |
| 203 | +env { |
| 204 | + parallelism = 1 |
| 205 | + job.mode = "Streaming" |
| 206 | +} |
| 207 | +
|
| 208 | +source { |
| 209 | + Paimon { |
| 210 | + warehouse = "/tmp/paimon" |
| 211 | + database = "full_type" |
| 212 | + table = "st_test" |
| 213 | + } |
| 214 | +} |
| 215 | +
|
| 216 | +sink { |
| 217 | + Paimon { |
| 218 | + warehouse = "/tmp/paimon" |
| 219 | + database = "full_type" |
| 220 | + table = "st_test_sink" |
| 221 | + paimon.table.primary-keys = "c_tinyint" |
| 222 | + } |
| 223 | +} |
| 224 | +``` |
0 commit comments