-
Notifications
You must be signed in to change notification settings - Fork 34
[DCOS-39050] Added files for Hive Docker image #392
base: master
Are you sure you want to change the base?
Changes from 12 commits
c497250
edb729f
fdb6c45
85cd11b
2585dff
81719af
d2839f9
8ab718b
16e032d
1bfcff9
4457238
c42d82b
585571e
7552c5c
fed868e
27a0a4d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# Cloudera Hadoop and Hive Docker Image with Kerberos | ||
|
||
|
||
This is a Hadoop Docker image running CDH5 versions of Hadoop and Hive, all in one container. There is a separate Kerberos image in which Hadoop and Hive use Kerberos for authentication. Adapted from https://github.com/tilakpatidar/cdh5_hive_postgres and based on Ubuntu (trusty). | ||
|
||
Postgres is also installed so that Hive can use it for its Metastore backend and run in remote mode. | ||
|
||
## Current Version | ||
* Hadoop 2.6.0 | ||
* Hive 1.1.0 | ||
|
||
## Dependencies | ||
The Kerberos image assumes that a KDC has been launched by the dcos-commons kdc.py script. | ||
|
||
## Build the image | ||
|
||
### Build the Hadoop + Hive image: | ||
``` | ||
cd hadoop-hive | ||
docker build -t cdh5-hive . | ||
``` | ||
|
||
### Build the Kerberized Hadoop + Hive image: | ||
First, autogenerate the Hadoop config files. | ||
``` | ||
cd ../kerberos | ||
scripts/generate_configs.sh | ||
``` | ||
|
||
Then build the image: | ||
``` | ||
docker build -t cdh5-hive-kerberos . | ||
``` | ||
|
||
## Run the Hive image interactively | ||
``` | ||
docker run -it cdh5-hive:latest /etc/hive-bootstrap.sh -bash | ||
``` | ||
|
||
## Run the Kerberized Hive image in DC/OS | ||
First, deploy a KDC via the dcos-commons kdc.py utility. See [the kdc README](https://github.com/mesosphere/dcos-commons/tree/master/tools/kdc) for details. | ||
|
||
From the dcos-commons repo: | ||
``` | ||
PYTHONPATH=testing ./tools/kdc/kdc.py deploy principals.txt | ||
``` | ||
|
||
At a minimum, `principals.txt` should include the following principals (for the Hadoop container hostname, pick any private agent in the cluster): | ||
|
||
``` | ||
hdfs/<hostname of Hadoop container>@LOCAL | ||
HTTP/<hostname of Hadoop container>@LOCAL | ||
yarn/<hostname of Hadoop container>@LOCAL | ||
hive/<hostname of Hadoop container>@LOCAL | ||
``` | ||
|
||
Deploy the Kerberized Hadoop / Hive container via Marathon. (Update the Marathon config's `constraint` field first with the host selected above.) | ||
|
||
``` | ||
dcos marathon app add kerberos/marathon/hdfs-hive-kerberos.json | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
FROM ubuntu:16.04 | ||
|
||
USER root | ||
|
||
ENV JAVA_HOME /usr/lib/jvm/java-8-oracle | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Considering that this is also being set elsewhere in this Dockerfile, I would suggest moving the envvars closer to where they are being used. This makes it easier to reason about what is being done and recognise unused variables. For example, |
||
ENV HADOOP_VERSION 2.6.0 | ||
ENV CDH_VERSION 5 | ||
ENV CDH_EXACT_VERSION 5.11.0 | ||
ENV HADOOP_HOME /usr/local/hadoop | ||
ENV HADOOP_PREFIX /usr/local/hadoop | ||
ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop | ||
ENV HIVE_HOME /usr/local/hive | ||
ENV HIVE_CONF /usr/local/hive/conf | ||
ENV HIVE_VERSION 1.1.0 | ||
ENV POSTGRES_VERSION 9.5 | ||
ENV POSTGRESQL_MAIN /var/lib/postgresql/9.5/main/ | ||
ENV POSTGRESQL_CONFIG_FILE /var/lib/postgresql/9.5/main/postgresql.conf | ||
ENV POSTGRESQL_BIN /usr/lib/postgresql/9.5/bin/postgres | ||
ENV PGPASSWORD hive | ||
|
||
# install dev tools | ||
RUN apt-get update && \ | ||
apt-get install -y curl wget tar openssh-server openssh-client rsync python-software-properties apt-file apache2 && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
# for running sshd in ubuntu trusty. https://github.com/docker/docker/issues/5704 | ||
RUN mkdir /var/run/sshd | ||
RUN echo 'root:secretpasswd' | chpasswd | ||
RUN sed -i 's/PermitRootLogin without-password/PermitRootLogin yes/' /etc/ssh/sshd_config | ||
|
||
# passwordless ssh | ||
RUN yes | ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key | ||
RUN yes | ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key | ||
RUN yes | ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa | ||
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys | ||
|
||
# fix the 254 error code | ||
RUN sed -i "/^[^#]*UsePAM/ s/.*/#&/" /etc/ssh/sshd_config | ||
RUN echo "UsePAM no" >> /etc/ssh/sshd_config | ||
RUN echo "Port 2122" >> /etc/ssh/sshd_config | ||
RUN /usr/sbin/sshd | ||
|
||
# ssh client config | ||
ADD conf/ssh_config /root/.ssh/config | ||
RUN chmod 600 /root/.ssh/config | ||
RUN chown root:root /root/.ssh/config | ||
|
||
EXPOSE 22 | ||
|
||
# oracle jdk 8 | ||
RUN apt-get update && \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also pull in the java archive that we use in all our applications, but this isn't a blocker. |
||
apt-get install -y software-properties-common && \ | ||
add-apt-repository ppa:webupd8team/java && \ | ||
apt-get update && \ | ||
# to accept license agreement automatically | ||
echo debconf shared/accepted-oracle-license-v1-1 select true | debconf-set-selections && \ | ||
echo debconf shared/accepted-oracle-license-v1-1 seen true | debconf-set-selections && \ | ||
apt-get install -y oracle-java8-installer && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
# java env setup | ||
ENV JAVA_HOME /usr/lib/jvm/java-8-oracle | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is set on https://github.com/mesosphere/spark-build/pull/392/files#diff-0aa25f6cadbb637eae9df102b049a59dR5 as well. Rather just set it in one place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 deleted the first instance |
||
ENV PATH $PATH:$JAVA_HOME/bin | ||
|
||
# download cdh hadoop | ||
RUN curl -L http://archive.cloudera.com/cdh${CDH_VERSION}/cdh/${CDH_VERSION}/hadoop-${HADOOP_VERSION}-cdh${CDH_EXACT_VERSION}.tar.gz \ | ||
| tar -xzC /usr/local && \ | ||
cd /usr/local && \ | ||
ln -s ./hadoop-${HADOOP_VERSION}-cdh${CDH_EXACT_VERSION} hadoop | ||
|
||
# need to define JAVA_HOME inside hadoop-env.sh | ||
RUN sed -i '/^export JAVA_HOME/ s:.*:export JAVA_HOME=/usr/lib/jvm/java-8-oracle\n:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh | ||
|
||
# pseudo distributed configurations of hadoop | ||
ADD templates/core-site.xml.template $HADOOP_PREFIX/etc/hadoop/core-site.xml.template | ||
ADD templates/hdfs-site.xml.template $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml.template | ||
ADD conf/mapred-site.xml $HADOOP_PREFIX/etc/hadoop/mapred-site.xml | ||
ADD templates/yarn-site.xml.template $HADOOP_PREFIX/etc/hadoop/yarn-site.xml.template | ||
|
||
# add and set permissions for bootstrap script | ||
ADD scripts/hadoop-bootstrap.sh /etc/hadoop-bootstrap.sh | ||
RUN chown root:root /etc/hadoop-bootstrap.sh | ||
RUN chmod 700 /etc/hadoop-bootstrap.sh | ||
|
||
RUN chmod +x /usr/local/hadoop/etc/hadoop/*-env.sh | ||
|
||
# add hadoop to path | ||
ENV PATH $PATH:$HADOOP_HOME:$HADOOP_HOME/bin | ||
|
||
#for exposed ports refer | ||
#https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_ports_cdh5.html | ||
EXPOSE 50010 50020 50070 50075 50090 8020 9000 10020 19888 8030 8031 8032 8033 8040 8042 8088 | ||
|
||
# download cdh hive | ||
RUN curl -L http://archive.cloudera.com/cdh${CDH_VERSION}/cdh/${CDH_VERSION}/hive-1.1.0-cdh${CDH_EXACT_VERSION}.tar.gz \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
| tar -xzC /usr/local && \ | ||
cd /usr/local && \ | ||
mv hive-1.1.0-cdh${CDH_EXACT_VERSION} hive | ||
|
||
# add hive to path | ||
ENV PATH $PATH:$HIVE_HOME/bin | ||
|
||
# add postgresql jdbc jar to classpath | ||
RUN ln -s /usr/share/java/postgresql-jdbc4.jar $HIVE_HOME/lib/postgresql-jdbc4.jar | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this not be moved to AFTER the postgres install below? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like it was copied from the parent project: https://github.com/tilakpatidar/cdh5_hive_postgres/blob/master/hive_pg/Dockerfile#L31 If I move it, |
||
|
||
# to configure postgres as hive metastore backend | ||
RUN sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list' | ||
RUN wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | apt-key add - | ||
RUN apt-get update -y && \ | ||
apt-get -yq install vim postgresql-9.5 libpostgresql-jdbc-java && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
USER postgres | ||
# initialize hive metastore db | ||
# create metastore db, hive user and assign privileges | ||
RUN cd $HIVE_HOME/scripts/metastore/upgrade/postgres/ &&\ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
/etc/init.d/postgresql start &&\ | ||
psql --command "CREATE DATABASE metastore;" &&\ | ||
psql --command "CREATE USER hive WITH PASSWORD 'hive';" && \ | ||
psql --command "ALTER USER hive WITH SUPERUSER;" && \ | ||
psql --command "GRANT ALL PRIVILEGES ON DATABASE metastore TO hive;" && \ | ||
psql -U hive -d metastore -h localhost -f hive-schema-${HIVE_VERSION}.postgres.sql | ||
|
||
|
||
# revert back to default user | ||
USER root | ||
|
||
# disable ssl in postgres.conf | ||
ADD conf/postgresql.conf $POSTGRESQL_MAIN | ||
RUN echo $POSTGRESQL_MAIN | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't find the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. removed |
||
RUN echo $POSTGRESQL_CONFIG_FILE | ||
RUN chown postgres:postgres $POSTGRESQL_CONFIG_FILE | ||
RUN sed -i -e 's/peer/md5/g' /etc/postgresql/$POSTGRES_VERSION/main/pg_hba.conf | ||
|
||
# copy config, sql, data files to /opt/files | ||
RUN mkdir /opt/files | ||
RUN echo $HIVE_CONF | ||
ADD templates/hive-site.xml.template /opt/files/ | ||
ADD templates/hive-site.xml.template $HIVE_CONF/hive-site.xml.template | ||
|
||
# set permissions for hive bootstrap file | ||
ADD scripts/hive-bootstrap.sh /etc/hive-bootstrap.sh | ||
RUN chown root:root /etc/hive-bootstrap.sh | ||
RUN chmod 700 /etc/hive-bootstrap.sh | ||
|
||
EXPOSE 10000 10001 10002 10003 9083 50111 5432 | ||
|
||
# run bootstrap script | ||
CMD ["/etc/hive-bootstrap.sh", "-d"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
<configuration> | ||
<property> | ||
<name>mapreduce.framework.name</name> | ||
<value>yarn</value> | ||
</property> | ||
</configuration> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can handle this in a follow-up, but should we consider using the
18.04
LTS image?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember this email thread not long ago asking about DCOS on 18.04 and it seemed like Mesos still had to sort out some issues.
Let's hold off this for now