Skip to content
This repository was archived by the owner on Dec 4, 2024. It is now read-only.

[DCOS-39050] Added files for Hive Docker image #392

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions tools/hive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Cloudera Hadoop and Hive Docker Image with Kerberos


This is a Hadoop Docker image running CDH5 versions of Hadoop and Hive, all in one container. There is a separate Kerberos image in which Hadoop and Hive use Kerberos for authentication. Adapted from https://github.com/tilakpatidar/cdh5_hive_postgres and based on Ubuntu (trusty).

Postgres is also installed so that Hive can use it for its Metastore backend and run in remote mode.

## Current Version
* Hadoop 2.6.0
* Hive 1.1.0

## Dependencies
The Kerberos image assumes that a KDC has been launched by the dcos-commons kdc.py script.

## Build the image

### Build the Hadoop + Hive image:
```
cd hadoop-hive
docker build -t cdh5-hive .
```

### Build the Kerberized Hadoop + Hive image:
First, autogenerate the Hadoop config files.
```
cd ../kerberos
scripts/generate_configs.sh
```

Then build the image:
```
docker build -t cdh5-hive-kerberos .
```

## Run the Hive image interactively
```
docker run -it cdh5-hive:latest /etc/hive-bootstrap.sh -bash
```

## Run the Kerberized Hive image in DC/OS
First, deploy a KDC via the dcos-commons kdc.py utility. See [the kdc README](https://github.com/mesosphere/dcos-commons/tree/master/tools/kdc) for details.

From the dcos-commons repo:
```
PYTHONPATH=testing ./tools/kdc/kdc.py deploy principals.txt
```

At a minimum, `principals.txt` should include the following principals (for the Hadoop container hostname, pick any private agent in the cluster):

```
hdfs/<hostname of Hadoop container>@LOCAL
HTTP/<hostname of Hadoop container>@LOCAL
yarn/<hostname of Hadoop container>@LOCAL
hive/<hostname of Hadoop container>@LOCAL
```

Deploy the Kerberized Hadoop / Hive container via Marathon. (Update the Marathon config's `constraint` field first with the host selected above.)

```
dcos marathon app add kerberos/marathon/hdfs-hive-kerberos.json
```
149 changes: 149 additions & 0 deletions tools/hive/hadoop-hive/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
FROM ubuntu:16.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can handle this in a follow-up, but should we consider using the 18.04 LTS image?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this email thread not long ago asking about DCOS on 18.04 and it seemed like Mesos still had to sort out some issues.

Let's hold off this for now


USER root

ENV JAVA_HOME /usr/lib/jvm/java-8-oracle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that this is also being set elsewhere in this Dockerfile, I would suggest moving the envvars closer to where they are being used. This makes it easier to reason about what is being done and recognise unused variables. For example, POSTGRES_VERSION only seems to be used to reconstruct what is effectively POSTGRES_MAIN here: https://github.com/mesosphere/spark-build/pull/392/files#diff-0aa25f6cadbb637eae9df102b049a59dR133

ENV HADOOP_VERSION 2.6.0
ENV CDH_VERSION 5
ENV CDH_EXACT_VERSION 5.11.0
ENV HADOOP_HOME /usr/local/hadoop
ENV HADOOP_PREFIX /usr/local/hadoop
ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop
ENV HIVE_HOME /usr/local/hive
ENV HIVE_CONF /usr/local/hive/conf
ENV HIVE_VERSION 1.1.0
ENV POSTGRES_VERSION 9.5
ENV POSTGRESQL_MAIN /var/lib/postgresql/9.5/main/
ENV POSTGRESQL_CONFIG_FILE /var/lib/postgresql/9.5/main/postgresql.conf
ENV POSTGRESQL_BIN /usr/lib/postgresql/9.5/bin/postgres
ENV PGPASSWORD hive

# install dev tools
RUN apt-get update && \
apt-get install -y curl wget tar openssh-server openssh-client rsync python-software-properties apt-file apache2 && \
rm -rf /var/lib/apt/lists/*

# for running sshd in ubuntu trusty. https://github.com/docker/docker/issues/5704
RUN mkdir /var/run/sshd
RUN echo 'root:secretpasswd' | chpasswd
RUN sed -i 's/PermitRootLogin without-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# passwordless ssh
RUN yes | ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN yes | ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN yes | ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys

# fix the 254 error code
RUN sed -i "/^[^#]*UsePAM/ s/.*/#&/" /etc/ssh/sshd_config
RUN echo "UsePAM no" >> /etc/ssh/sshd_config
RUN echo "Port 2122" >> /etc/ssh/sshd_config
RUN /usr/sbin/sshd

# ssh client config
ADD conf/ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config

EXPOSE 22

# oracle jdk 8
RUN apt-get update && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also pull in the java archive that we use in all our applications, but this isn't a blocker.

apt-get install -y software-properties-common && \
add-apt-repository ppa:webupd8team/java && \
apt-get update && \
# to accept license agreement automatically
echo debconf shared/accepted-oracle-license-v1-1 select true | debconf-set-selections && \
echo debconf shared/accepted-oracle-license-v1-1 seen true | debconf-set-selections && \
apt-get install -y oracle-java8-installer && \
rm -rf /var/lib/apt/lists/*

# java env setup
ENV JAVA_HOME /usr/lib/jvm/java-8-oracle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 deleted the first instance

ENV PATH $PATH:$JAVA_HOME/bin

# download cdh hadoop
RUN curl -L http://archive.cloudera.com/cdh${CDH_VERSION}/cdh/${CDH_VERSION}/hadoop-${HADOOP_VERSION}-cdh${CDH_EXACT_VERSION}.tar.gz \
| tar -xzC /usr/local && \
cd /usr/local && \
ln -s ./hadoop-${HADOOP_VERSION}-cdh${CDH_EXACT_VERSION} hadoop

# need to define JAVA_HOME inside hadoop-env.sh
RUN sed -i '/^export JAVA_HOME/ s:.*:export JAVA_HOME=/usr/lib/jvm/java-8-oracle\n:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

# pseudo distributed configurations of hadoop
ADD templates/core-site.xml.template $HADOOP_PREFIX/etc/hadoop/core-site.xml.template
ADD templates/hdfs-site.xml.template $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml.template
ADD conf/mapred-site.xml $HADOOP_PREFIX/etc/hadoop/mapred-site.xml
ADD templates/yarn-site.xml.template $HADOOP_PREFIX/etc/hadoop/yarn-site.xml.template

# add and set permissions for bootstrap script
ADD scripts/hadoop-bootstrap.sh /etc/hadoop-bootstrap.sh
RUN chown root:root /etc/hadoop-bootstrap.sh
RUN chmod 700 /etc/hadoop-bootstrap.sh

RUN chmod +x /usr/local/hadoop/etc/hadoop/*-env.sh

# add hadoop to path
ENV PATH $PATH:$HADOOP_HOME:$HADOOP_HOME/bin

#for exposed ports refer
#https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_ports_cdh5.html
EXPOSE 50010 50020 50070 50075 50090 8020 9000 10020 19888 8030 8031 8032 8033 8040 8042 8088

# download cdh hive
RUN curl -L http://archive.cloudera.com/cdh${CDH_VERSION}/cdh/${CDH_VERSION}/hive-1.1.0-cdh${CDH_EXACT_VERSION}.tar.gz \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the 1.1.0 be ${HIVE_VERSION}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

| tar -xzC /usr/local && \
cd /usr/local && \
mv hive-1.1.0-cdh${CDH_EXACT_VERSION} hive

# add hive to path
ENV PATH $PATH:$HIVE_HOME/bin

# add postgresql jdbc jar to classpath
RUN ln -s /usr/share/java/postgresql-jdbc4.jar $HIVE_HOME/lib/postgresql-jdbc4.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be moved to AFTER the postgres install below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it was copied from the parent project: https://github.com/tilakpatidar/cdh5_hive_postgres/blob/master/hive_pg/Dockerfile#L31

If I move it, docker build runs fine but I'd have to test it against the hive integration PR to know everything works


# to configure postgres as hive metastore backend
RUN sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'
RUN wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | apt-key add -
RUN apt-get update -y && \
apt-get -yq install vim postgresql-9.5 libpostgresql-jdbc-java && \
rm -rf /var/lib/apt/lists/*

USER postgres
# initialize hive metastore db
# create metastore db, hive user and assign privileges
RUN cd $HIVE_HOME/scripts/metastore/upgrade/postgres/ &&\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit &&\ => && \

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/etc/init.d/postgresql start &&\
psql --command "CREATE DATABASE metastore;" &&\
psql --command "CREATE USER hive WITH PASSWORD 'hive';" && \
psql --command "ALTER USER hive WITH SUPERUSER;" && \
psql --command "GRANT ALL PRIVILEGES ON DATABASE metastore TO hive;" && \
psql -U hive -d metastore -h localhost -f hive-schema-${HIVE_VERSION}.postgres.sql


# revert back to default user
USER root

# disable ssl in postgres.conf
ADD conf/postgresql.conf $POSTGRESQL_MAIN
RUN echo $POSTGRESQL_MAIN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the echo commands really useful as we are only outputting envvars that are set to constant values in this Dockerfile. I think they were echoed before because they were implemented as build arguments and not envvars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

RUN echo $POSTGRESQL_CONFIG_FILE
RUN chown postgres:postgres $POSTGRESQL_CONFIG_FILE
RUN sed -i -e 's/peer/md5/g' /etc/postgresql/$POSTGRES_VERSION/main/pg_hba.conf

# copy config, sql, data files to /opt/files
RUN mkdir /opt/files
RUN echo $HIVE_CONF
ADD templates/hive-site.xml.template /opt/files/
ADD templates/hive-site.xml.template $HIVE_CONF/hive-site.xml.template

# set permissions for hive bootstrap file
ADD scripts/hive-bootstrap.sh /etc/hive-bootstrap.sh
RUN chown root:root /etc/hive-bootstrap.sh
RUN chmod 700 /etc/hive-bootstrap.sh

EXPOSE 10000 10001 10002 10003 9083 50111 5432

# run bootstrap script
CMD ["/etc/hive-bootstrap.sh", "-d"]
6 changes: 6 additions & 0 deletions tools/hive/hadoop-hive/conf/mapred-site.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Loading