SlideShare a Scribd company logo
How-to create a multi tenancy for
an interactive data analysis with
JupyterHub & LDAP
Spark Cluster + Jupyter + LDAP
Introduction
With this presentation you should be able to create an architecture for a framework of an
interactive data analysis by using a Cloudera Spark Cluster with Kerberos, a Jupyter
machine with JupyterHub and authentication via LDAP.
Architecture
This architecture enables the following:
● Transparent data-science development
● User Impersonation
● Authentication via LDAP
● Upgrades on Cluster won’t affect the developments.
● Controlled access to the data and resources by Kerberos/Sentry.
● Several coding API’s (Scala, R, Python, PySpark, etc…).
● Two layers of security with Kerberos & LDAP
Architecture
Pre-Assumptions
1. Cluster hostname: cm1.localdomain Jupyter hostname: cm3.localdomain
2. Cluster Python version: 3.7.1
3. Cluster Manager: Cloudera Manager 5.12.2
4. Service Yarn & PIP Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. Chosen IDE: Jupyter
7. JupyterHub Machine Authentication Not-Installed: Kerberos
8. AD Machine Installed with hostname: ad.localdomain
9. Java 1.8 installed in Both Machines
10. Cluster Spark version 2.2.0
Anaconda
Download and installation
su - root
wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh
chmod +x Anaconda3-2018.12-Linux-x86_64.sh
./Anaconda3-2018.12-Linux-x86_64.sh
Note 1: Change with your hostname and domain in the highlighted field.
Note 2: Due to the package SudoSpawner - that requires Anaconda be installed with the root user!
Note 3: JupyterHub requires Python 3.X, therefore it will be installed Anaconda 3
Anaconda
Path environment variables
export PATH=/opt/anaconda3/bin:$PATH
Java environment variables
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64/;
Spark environment variables
export SPARK_HOME=/opt/spark;
export SPARK_MASTER_IP=10.191.38.83;
Yarn environment variables
export YARN_CONF_DIR=/etc/hadoop/conf
Yarn environment variables
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip;
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py;
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python;
Note: Change with your values in the highlighted field.
Hadoop environment variables
export HADOOP_HOME=/etc/hadoop/conf;
export HADOOP_CONF_DIR=/etc/hadoop/conf;
Hive environment variables
export HIVE_HOME=/etc/hadoop/conf;
Anaconda
Validate installation
anaconda-navigator
Update Conda (Only if needed)
conda update -n base -c defaults conda
Start Jupyter Notebook (If non root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1
Start Jupyter Notebook (if root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug --allow-root > /opt/anaconda3/log.txt 2>&1
Note: it’s only necessary to change the highlighted, ex: for your ip.
Jupyter or JupyterHub?
JupyterHub it’s a multi-purpose notebook that:
● Manages authentication.
● Spawns single-user notebook on-demand.
● Gives each user a complete notebook
server.
How to choose?
JupyterHub
Install JupyterHub Package (with Http-Proxy)
conda install -c conda-forge jupyterhub
Validate Installation
jupyterhub -h
Start JupyterHub Server
jupyterhub --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1
Note: it’s only necessary to change the highlighted, ex: for your ip.
JupyterHub With LDAP
Install Simple LDAP Authenticator Plugin for JupyterHub
conda install -c conda-forge jupyterhub-ldapauthenticator
Install SudoSpawner
conda install -c conda-forge sudospawner
Install Package LDAP to be able to Create Users Locally
pip install jupyterhub-ldapcreateusers
Generate JupyterHub Config File
jupyterhub --generate-config
Note 1: it’s only necessary to change the highlighted, ex: for your ip.
Note 2: Sudospawner enables JupyterHub to spawn single-user servers without being root
JupyterHub With LDAP
Configure JupyterHub Config File
nano /opt/anaconda3/jupyterhub_config.py
import os
import pwd
import subprocess
# Function to Create User Home
def create_dir_hook(spawner):
if not os.path.exists(os.path.join('/home/', spawner.user.name)):
subprocess.call(["sudo", "/sbin/mkhomedir_helper", spawner.user.name])
c.Spawner.pre_spawn_hook = create_dir_hook
c.JupyterHub.authenticator_class = 'ldapcreateusers.LocalLDAPCreateUsers'
c.LocalLDAPCreateUsers.server_address = 'ad.localdomain'
c.LocalLDAPCreateUsers.server_port = 3268
c.LocalLDAPCreateUsers.use_ssl = False
c.LocalLDAPCreateUsers.lookup_dn = True
# Instructions to Define LDAP Search - Doesn't have in consideration possible group users
c.LocalLDAPCreateUsers.bind_dn_template = ['CN={username},DC=ad,DC=localdomain']
c.LocalLDAPCreateUsers.user_search_base = 'DC=ad,DC=localdomain'
JupyterHub With LDAP
c.LocalLDAPCreateUsers.lookup_dn_search_user = 'admin'
c.LocalLDAPCreateUsers.lookup_dn_search_password = 'passWord'
c.LocalLDAPCreateUsers.lookup_dn_user_dn_attribute = 'CN'
c.LocalLDAPCreateUsers.user_attribute = 'sAMAccountName'
c.LocalLDAPCreateUsers.escape_userdn = False
c.JupyterHub.hub_ip = '10.111.22.333’
c.JupyterHub.port = 9001
# Instructions Required to Add User Home
c.LocalAuthenticator.add_user_cmd = ['useradd', '-m']
c.LocalLDAPCreateUsers.create_system_users = True
c.Spawner.debug = True
c.Spawner.default_url = 'tree/home/{username}'
c.Spawner.notebook_dir = '/'
c.PAMAuthenticator.open_sessions = True
Start JupyterHub Server With Config File
jupyterhub -f /opt/anaconda3/jupyterhub_config.py --debug
Note: it’s only necessary to change the highlighted, ex: for your ip.
JupyterHub with LDAP + ProxyUser
Has a reminder, to have ProxyUser working, you will require on both Machines (Cluster and JupyterHub): Java 1.8 and
same Spark version, for this example it will be used the 2.2.0.
[Cluster] Confirm Cluster Spark & Hadoop Version
spark-shell
hadoop version
[JupyterHub] Download Spark & Create Symbolic link
cd /tmp/
wget https://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.6.tgz
tar zxvf spark-2.2.0-bin-hadoop2.6.tgz
mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0
ln -s /opt/spark-2.2.0 /opt/spark
Note: change with your Spark and Hadoop version in the highlighted field.
Jupyter Hub with LDAP + ProxyUser
[Cluster] Copy Hadoop/Hive/Spark Config files
cd /etc/spark2/conf.cloudera.spark2_on_yarn/
scp * root@10.111.22.333:/etc/hadoop/conf/
[Cluster] HDFS ProxyUser
Note: change with your IP and directory’s in the highlighted field.
[JupyterHub] Create hadoop config files directory
mkdir -p /etc/hadoop/conf/
ln -s /etc/hadoop/conf/ conf.cloudera.yarn
[JupyterHub] Create spark-events directory
mkdir /tmp/spark-events
chown spark:spark spark-events
chmod 777 /tmp/spark-events
[JupyterHub] Test Spark 2
spark-submit --class org.apache.spark.examples.SparkPi 
--master yarn 
--num-executors 1 --driver-memory 512m --executor-memory 512m 
--executor-cores 1 --deploy-mode cluster 
--proxy-user tpsimoes --keytab /root/jupyter.keytab 
--conf spark.eventLog.enabled=true 
/opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar 10;
Check available kernel specs
jupyter kernelspec list
Install PySpark Kernel
conda install -c conda-forge pyspark
Confirm kernel installation
jupyter kernelspec list
Edit PySpark kernel
nano /opt/anaconda3/share/jupyter/kernels/pyspark/kernel.json
{"argv":
["/opt/anaconda3/share/jupyter/kernels/pyspark/python.sh", "-f", "{connection_file}"],
"display_name": "PySpark (Spark 2.2.0)", "language":"python" }
Create PySpark Script
cd /opt/anaconda3/share/jupyter/kernels/pyspark;
touch python.sh;
chmod a+x python.sh;
Jupyter Hub with LDAP + ProxyUser
Jupyter Hub with LDAP + ProxyUser
The python.sh script was elaborated due to the limitations on JupyterHub Kernel configurations that isn't able to get the
Kerberos Credentials and also due to LDAP package that doesn't allow the proxyUser has is possible with Zeppelin. Therefore
with this architecture solution you are able to:
● Add a new step of security, that requires the IDE keytab
● Enable the usage of proxyUser by using the flag from spark --proxy-user ${KERNEL_USERNAME}
Edit PySpark Script
touch /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
nano /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
# !/usr/bin/env bash
# setup environment variable, etc.
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export SPARK_MASTER_IP=10.111.22.333
export HADOOP_HOME=/etc/hadoop/conf
Jupyter Hub with LDAP + ProxyUser
Edit PySpark Script
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export PYSPARK_SUBMIT_ARGS="-v --master yarn --deploy-mode client --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 2 --driver-memory 1024m --executor-memory 1024m
--executor-cores 2 --proxy-user "${PROXY_USER}" --keytab /tmp/jupyter.keytab pyspark-shell"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note: change with your IP and directories in the highlighted field.
Interact with JupyterHub
Login
http://10.111.22.333:9001/hub/login
Notebook Kernel
To use JupyterLab without it being the default interface, you just have to
swap on your browser url the “tree” with Lab!
http://10.111.22.333:9001/user/tpsimoes/lab
JupyterLab
JupyterLab it’s the next-generation web-based
interface for Jupyter.
Install JupyterLab
conda install -c conda-forge jupyterlab
Install JupyterLab Launcher
conda install -c conda-forge jupyterlab_launcher
JupyterLab
To be able to use the JupyterLab interface as default on Jupyter it requires additional changes.
● Change the JupyterHub Config File
● Additional extensions (for the Hub Menu)
● Create config file for JupyterLab
Edit PySpark Script
nano /opt/anaconda3/jupyterhub_config.py
...
# Change the values on this Flags
c.Spawner.default_url = '/lab'
c.Spawner.notebook_dir = '/home/{username}'
# Add this Flag
c.Spawner.cmd = ['jupyter-labhub']
JupyterLab
Install jupyterlab-hub extension
jupyter labextension install @jupyterlab/hub-extension
Create JupyterLab Config File
cd /opt/anaconda3/share/jupyter/lab/settings/
nano page_config.json
{
"hub_prefix": "/jupyter"
}
JupyterLab
The final architecture:
R, Hive and Impala on JupyterHub
On this section the focus will reside on R, Hive, Impala and Kerberized Kernel.
With R Kernel, it requires libs on both Machines (Cluster and Jupyter)
[Cluster & Jupyter] Install R Libs
yum install -y openssl-devel openssl libcurl-devel libssh2-devel
[Jupyter] Create SymLinks for R libs
ln -s /opt/anaconda3/lib/libssl.so.1.0.0 /usr/lib64/libssl.so.1.0.0;
ln -s /opt/anaconda3/lib/libcrypto.so.1.0.0 /usr/lib64/libcrypto.so.1.0.0;
[Cluster & Jupyter] To use SparkR
devtools::install_github('apache/spark@v2.2.0', subdir='R/pkg')
Note: Change with your values in the highlighted field.
[Cluster & Jupyter] Start R & Install Packages
R
install.packages('git2r')
install.packages('devtools')
install.packages('repr')
install.packages('IRdisplay')
install.packages('crayon')
install.packages('pbdZMQ')
R, Hive and Impala on JupyterHub
To interact with Hive metadata and the direct use of the sintax, the my recommendation is the HiveQL.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install Python + Hive interface (SQLAlchemy interface for Hive)
pip install pyhive
Install HiveQL Kernel
pip install --upgrade hiveqlKernel
jupyter hiveql install
Confirm HiveQL Kernel installation
jupyter kernelspec list
R, Hive and Impala on JupyterHub
Edit HiveQL Kernel
cd /usr/local/share/jupyter/kernels/hiveql
nano kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
Create and Edit HiveQL script
touch /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
nano /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
# !/usr/bin/env bash
# setup environment variable, etc.
PROXY_USER="$(whoami)"
R, Hive and Impala on JupyterHub
Edit HiveQL script
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note 1: change with your IP. directories and versions in the highlighted field.
Note 2: add your users keytab to a chosen directory so that is possible to run with proxyuser
R, Hive and Impala on JupyterHub
To interact with Impala metadata, my recommendation is the Impyla, but there’s a catch, because due to a specific version of a
lib (thrift_sasl), the HiveQL will stop working, because hiveqlkernel 1.0.13 has the requirement thrift-sasl==0.3.*.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install additional Libs for Impyla
pip install thrift_sasl==0.2.1: pip install sasl;
Install ipython-sql
conda install -c conda-forge ipython-sql
Install impyla
pip install impyla==0.15a1
Note: it was installed a alfa version for impyla due to an incompatibility with python versions superior to 3.7.
R, Hive and Impala on JupyterHub
If you require to have access to Hive & Impala metadata, you can use Python + Hive with a kerberized custom kernel.
Install Jaydebeapi package
conda install -c conda-forge jaydebeapi
Create Python Kerberized Kernel
mkdir -p /usr/share/jupyter/kernels/pythonKerb
cd /usr/share/jupyter/kernels/pythonKerb
touch kernel.json
touch pythonKerb.sh
chmod a+x /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
Note: Change with your values in the highlighted field.
Edit Kerberized Kernel
nano /usr/share/jupyter/kernels/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/pythonKerb/pythonKerb.sh
", "-f", "{connection_file}"],
"display_name": "PythonKerberized", "language": "python",
"name": "pythonKerb"}
Edit Kerberized Kernel script
nano /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
R, Hive and Impala on JupyterHub
Edit Kerberized Kernel script
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
export CLASSPATH=$CLASSPATH:`hadoop classpath`:/etc/hadoop/*:/tmp/*
export PYTHONPATH=$PYTHONPATH:/opt/anaconda3/lib/python3.7/site-packages/jaydebeapi
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel_launcher $@
R, Hive and Impala on JupyterHub
Assuming that you don't have Impyla installed, or if so, you have created an environment for it!
HiveQL it’s the best Kernel to access to hive metadata and it has support.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install Hive interface & HiveQL Kernel
pip install pyhive; pip install --upgrade hiveqlKernel;
Jupyter Install Kernel
jupyter hiveql install
Check kernel installation
jupyter kernelspec list
R, Hive and Impala on JupyterHub
To access to a kerberized Cluster you will require a Kerberos Ticket in cache, therefore the solution will be the following:
Edit Kerberized Kernel
nano /usr/local/share/jupyter/kernels/hiveql/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
Edit Kerberized Kernel script
touch /usr/local/share/jupyter/kernels/hiveql/hiveql.sh
nano /usr/local/share/jupyter/kernels/hiveql/hiveql.sh
Note: Change with your values in the highlighted field.
R, Hive and Impala on JupyterHub
Edit Kerberized Kernel script
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export HADOOP_HOME=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m hiveql $@
Note: Change with your values in the highlighted field.
Interact with JupyterHub Kernels
The following information will serve as base of knowledge, how to interact with previous configured kernels with a
kerberized Cluster.
[HiveQL] Create Connection
$$ url=hive://hive@cm1.localdomain:10000/
$$ connect_args={"auth": "KERBEROS","kerberos_service_name": "hive"}
$$ pool_size=5
$$ max_overflow=10
[Impyla] Create Connection
from impala.dbapi import connect
conn = connect(host='cm1.localdomain', port=21050, kerberos_service_name='impala', auth_mechanism='GSSAPI')
Note: Change with your values in the highlighted field.
Interact with JupyterHub Kernels
[Impyla] Create Connection via SQLMagic
%load_ext sql
%config SqlMagic.autocommit=False
%sql impala://tpsimoes:welcome1@cm1.localdomain:21050/db?kerberos_service_name=impala&auth_mechanism=GSSAPI
[Python] Create Connection
import jaydebeapi
import pandas as pd
conn_hive =
jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver","jdbc:hive2://cm1.localdomain:10000/db;AuthMech=1;KrbRealm=DOMAIN.
COM;KrbHostFQDN=cm1.localdomain;KrbServiceName=hive;KrbAuthType=2")
[Python] Kinit Keytab
import subprocess
result = subprocess.run(['kinit', '-kt','/tmp/tpsimoes.keytab',tpsimoes/cm1.localdomain@DOMAIN.COM'],
stdout=subprocess.PIPE)
result.stdout
Note: Change with your values in the highlighted field.
Thanks
Big Data Engineer
Tiago Simões

More Related Content

What's hot (20)

PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PPT
Hive(ppt)
Abhinav Tyagi
 
PPT
Php Simple Xml
mussawir20
 
PDF
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
HostedbyConfluent
 
PPTX
Deep Dive into Apache Kafka
confluent
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
 
PDF
Introduction to PySpark
Russell Jurney
 
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
PPTX
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
PDF
AWS glue technical enablement training
Info Alchemy Corporation
 
PDF
Morel, a Functional Query Language
Julian Hyde
 
PPTX
Tuning PostgreSQL for High Write Throughput
Grant McAlister
 
Apache Spark Architecture
Alexey Grishchenko
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Hive(ppt)
Abhinav Tyagi
 
Php Simple Xml
mussawir20
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
HostedbyConfluent
 
Deep Dive into Apache Kafka
confluent
 
New Directions for Apache Arrow
Wes McKinney
 
Transformations and actions a visual guide training
Spark Summit
 
Spark architecture
GauravBiswas9
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
 
Introduction to PySpark
Russell Jurney
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
AWS glue technical enablement training
Info Alchemy Corporation
 
Morel, a Functional Query Language
Julian Hyde
 
Tuning PostgreSQL for High Write Throughput
Grant McAlister
 

Similar to How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap (20)

PPTX
How to create a secured multi tenancy for clustered ML with JupyterHub
Tiago Simões
 
PDF
Provisioning with Puppet
Joe Ray
 
PDF
Build Your Own CaaS (Container as a Service)
HungWei Chiu
 
PDF
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Nagios
 
PDF
One-Man Ops
Jos Boumans
 
PPTX
Puppet for Developers
sagarhere4u
 
PDF
Improving Operations Efficiency with Puppet
Nicolas Brousse
 
PPTX
Harmonious Development: Via Vagrant and Puppet
Achieve Internet
 
PDF
Automating Complex Setups with Puppet
Kris Buytaert
 
PDF
Automating complex infrastructures with Puppet
Kris Buytaert
 
PDF
EC CUBE 3.0.x installation guide
Nguyễn Đoàn Quốc Phong
 
PDF
[Devconf.cz][2017] Understanding OpenShift Security Context Constraints
Alessandro Arrichiello
 
KEY
Cooking with Chef
Ken Robertson
 
PDF
Pyramid Deployment and Maintenance
Jazkarta, Inc.
 
PPTX
k8s practice 2023.pptx
wonyong hwang
 
PDF
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)
Bastian Feder
 
PPTX
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Nicolas Brousse
 
PDF
Chef - industrialize and automate your infrastructure
Michaël Lopez
 
PDF
Puppet: Eclipsecon ALM 2013
grim_radical
 
PPTX
How to create a secured cloudera cluster
Tiago Simões
 
How to create a secured multi tenancy for clustered ML with JupyterHub
Tiago Simões
 
Provisioning with Puppet
Joe Ray
 
Build Your Own CaaS (Container as a Service)
HungWei Chiu
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Nagios
 
One-Man Ops
Jos Boumans
 
Puppet for Developers
sagarhere4u
 
Improving Operations Efficiency with Puppet
Nicolas Brousse
 
Harmonious Development: Via Vagrant and Puppet
Achieve Internet
 
Automating Complex Setups with Puppet
Kris Buytaert
 
Automating complex infrastructures with Puppet
Kris Buytaert
 
EC CUBE 3.0.x installation guide
Nguyễn Đoàn Quốc Phong
 
[Devconf.cz][2017] Understanding OpenShift Security Context Constraints
Alessandro Arrichiello
 
Cooking with Chef
Ken Robertson
 
Pyramid Deployment and Maintenance
Jazkarta, Inc.
 
k8s practice 2023.pptx
wonyong hwang
 
Advanced Eclipse Workshop (held at IPC2010 -spring edition-)
Bastian Feder
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Nicolas Brousse
 
Chef - industrialize and automate your infrastructure
Michaël Lopez
 
Puppet: Eclipsecon ALM 2013
grim_radical
 
How to create a secured cloudera cluster
Tiago Simões
 
Ad

More from Tiago Simões (7)

PPTX
How to go the extra mile on monitoring
Tiago Simões
 
PPTX
How to scheduled jobs in a cloudera cluster without oozie
Tiago Simões
 
PPTX
How to implement a gdpr solution in a cloudera architecture
Tiago Simões
 
PPTX
How to configure a hive high availability connection with zeppelin
Tiago Simões
 
PPTX
How to install and use multiple versions of applications in run-time
Tiago Simões
 
PPTX
Hive vs impala vs spark - tuning
Tiago Simões
 
PPTX
How to create a multi tenancy for an interactive data analysis
Tiago Simões
 
How to go the extra mile on monitoring
Tiago Simões
 
How to scheduled jobs in a cloudera cluster without oozie
Tiago Simões
 
How to implement a gdpr solution in a cloudera architecture
Tiago Simões
 
How to configure a hive high availability connection with zeppelin
Tiago Simões
 
How to install and use multiple versions of applications in run-time
Tiago Simões
 
Hive vs impala vs spark - tuning
Tiago Simões
 
How to create a multi tenancy for an interactive data analysis
Tiago Simões
 
Ad

Recently uploaded (20)

PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
July Patch Tuesday
Ivanti
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 

How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap

  • 1. How-to create a multi tenancy for an interactive data analysis with JupyterHub & LDAP Spark Cluster + Jupyter + LDAP
  • 2. Introduction With this presentation you should be able to create an architecture for a framework of an interactive data analysis by using a Cloudera Spark Cluster with Kerberos, a Jupyter machine with JupyterHub and authentication via LDAP.
  • 3. Architecture This architecture enables the following: ● Transparent data-science development ● User Impersonation ● Authentication via LDAP ● Upgrades on Cluster won’t affect the developments. ● Controlled access to the data and resources by Kerberos/Sentry. ● Several coding API’s (Scala, R, Python, PySpark, etc…). ● Two layers of security with Kerberos & LDAP
  • 5. Pre-Assumptions 1. Cluster hostname: cm1.localdomain Jupyter hostname: cm3.localdomain 2. Cluster Python version: 3.7.1 3. Cluster Manager: Cloudera Manager 5.12.2 4. Service Yarn & PIP Installed 5. Cluster Authentication Pre-Installed: Kerberos a. Kerberos Realm DOMAIN.COM 6. Chosen IDE: Jupyter 7. JupyterHub Machine Authentication Not-Installed: Kerberos 8. AD Machine Installed with hostname: ad.localdomain 9. Java 1.8 installed in Both Machines 10. Cluster Spark version 2.2.0
  • 6. Anaconda Download and installation su - root wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh chmod +x Anaconda3-2018.12-Linux-x86_64.sh ./Anaconda3-2018.12-Linux-x86_64.sh Note 1: Change with your hostname and domain in the highlighted field. Note 2: Due to the package SudoSpawner - that requires Anaconda be installed with the root user! Note 3: JupyterHub requires Python 3.X, therefore it will be installed Anaconda 3
  • 7. Anaconda Path environment variables export PATH=/opt/anaconda3/bin:$PATH Java environment variables export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64/; Spark environment variables export SPARK_HOME=/opt/spark; export SPARK_MASTER_IP=10.191.38.83; Yarn environment variables export YARN_CONF_DIR=/etc/hadoop/conf Yarn environment variables export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip; export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py; export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python; Note: Change with your values in the highlighted field. Hadoop environment variables export HADOOP_HOME=/etc/hadoop/conf; export HADOOP_CONF_DIR=/etc/hadoop/conf; Hive environment variables export HIVE_HOME=/etc/hadoop/conf;
  • 8. Anaconda Validate installation anaconda-navigator Update Conda (Only if needed) conda update -n base -c defaults conda Start Jupyter Notebook (If non root) jupyter-notebook --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1 Start Jupyter Notebook (if root) jupyter-notebook --ip='10.111.22.333' --port 9001 --debug --allow-root > /opt/anaconda3/log.txt 2>&1 Note: it’s only necessary to change the highlighted, ex: for your ip.
  • 9. Jupyter or JupyterHub? JupyterHub it’s a multi-purpose notebook that: ● Manages authentication. ● Spawns single-user notebook on-demand. ● Gives each user a complete notebook server. How to choose?
  • 10. JupyterHub Install JupyterHub Package (with Http-Proxy) conda install -c conda-forge jupyterhub Validate Installation jupyterhub -h Start JupyterHub Server jupyterhub --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1 Note: it’s only necessary to change the highlighted, ex: for your ip.
  • 11. JupyterHub With LDAP Install Simple LDAP Authenticator Plugin for JupyterHub conda install -c conda-forge jupyterhub-ldapauthenticator Install SudoSpawner conda install -c conda-forge sudospawner Install Package LDAP to be able to Create Users Locally pip install jupyterhub-ldapcreateusers Generate JupyterHub Config File jupyterhub --generate-config Note 1: it’s only necessary to change the highlighted, ex: for your ip. Note 2: Sudospawner enables JupyterHub to spawn single-user servers without being root
  • 12. JupyterHub With LDAP Configure JupyterHub Config File nano /opt/anaconda3/jupyterhub_config.py import os import pwd import subprocess # Function to Create User Home def create_dir_hook(spawner): if not os.path.exists(os.path.join('/home/', spawner.user.name)): subprocess.call(["sudo", "/sbin/mkhomedir_helper", spawner.user.name]) c.Spawner.pre_spawn_hook = create_dir_hook c.JupyterHub.authenticator_class = 'ldapcreateusers.LocalLDAPCreateUsers' c.LocalLDAPCreateUsers.server_address = 'ad.localdomain' c.LocalLDAPCreateUsers.server_port = 3268 c.LocalLDAPCreateUsers.use_ssl = False c.LocalLDAPCreateUsers.lookup_dn = True # Instructions to Define LDAP Search - Doesn't have in consideration possible group users c.LocalLDAPCreateUsers.bind_dn_template = ['CN={username},DC=ad,DC=localdomain'] c.LocalLDAPCreateUsers.user_search_base = 'DC=ad,DC=localdomain'
  • 13. JupyterHub With LDAP c.LocalLDAPCreateUsers.lookup_dn_search_user = 'admin' c.LocalLDAPCreateUsers.lookup_dn_search_password = 'passWord' c.LocalLDAPCreateUsers.lookup_dn_user_dn_attribute = 'CN' c.LocalLDAPCreateUsers.user_attribute = 'sAMAccountName' c.LocalLDAPCreateUsers.escape_userdn = False c.JupyterHub.hub_ip = '10.111.22.333’ c.JupyterHub.port = 9001 # Instructions Required to Add User Home c.LocalAuthenticator.add_user_cmd = ['useradd', '-m'] c.LocalLDAPCreateUsers.create_system_users = True c.Spawner.debug = True c.Spawner.default_url = 'tree/home/{username}' c.Spawner.notebook_dir = '/' c.PAMAuthenticator.open_sessions = True Start JupyterHub Server With Config File jupyterhub -f /opt/anaconda3/jupyterhub_config.py --debug Note: it’s only necessary to change the highlighted, ex: for your ip.
  • 14. JupyterHub with LDAP + ProxyUser Has a reminder, to have ProxyUser working, you will require on both Machines (Cluster and JupyterHub): Java 1.8 and same Spark version, for this example it will be used the 2.2.0. [Cluster] Confirm Cluster Spark & Hadoop Version spark-shell hadoop version [JupyterHub] Download Spark & Create Symbolic link cd /tmp/ wget https://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.6.tgz tar zxvf spark-2.2.0-bin-hadoop2.6.tgz mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0 ln -s /opt/spark-2.2.0 /opt/spark Note: change with your Spark and Hadoop version in the highlighted field.
  • 15. Jupyter Hub with LDAP + ProxyUser [Cluster] Copy Hadoop/Hive/Spark Config files cd /etc/spark2/conf.cloudera.spark2_on_yarn/ scp * [email protected]:/etc/hadoop/conf/ [Cluster] HDFS ProxyUser Note: change with your IP and directory’s in the highlighted field. [JupyterHub] Create hadoop config files directory mkdir -p /etc/hadoop/conf/ ln -s /etc/hadoop/conf/ conf.cloudera.yarn [JupyterHub] Create spark-events directory mkdir /tmp/spark-events chown spark:spark spark-events chmod 777 /tmp/spark-events [JupyterHub] Test Spark 2 spark-submit --class org.apache.spark.examples.SparkPi --master yarn --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 --deploy-mode cluster --proxy-user tpsimoes --keytab /root/jupyter.keytab --conf spark.eventLog.enabled=true /opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar 10;
  • 16. Check available kernel specs jupyter kernelspec list Install PySpark Kernel conda install -c conda-forge pyspark Confirm kernel installation jupyter kernelspec list Edit PySpark kernel nano /opt/anaconda3/share/jupyter/kernels/pyspark/kernel.json {"argv": ["/opt/anaconda3/share/jupyter/kernels/pyspark/python.sh", "-f", "{connection_file}"], "display_name": "PySpark (Spark 2.2.0)", "language":"python" } Create PySpark Script cd /opt/anaconda3/share/jupyter/kernels/pyspark; touch python.sh; chmod a+x python.sh; Jupyter Hub with LDAP + ProxyUser
  • 17. Jupyter Hub with LDAP + ProxyUser The python.sh script was elaborated due to the limitations on JupyterHub Kernel configurations that isn't able to get the Kerberos Credentials and also due to LDAP package that doesn't allow the proxyUser has is possible with Zeppelin. Therefore with this architecture solution you are able to: ● Add a new step of security, that requires the IDE keytab ● Enable the usage of proxyUser by using the flag from spark --proxy-user ${KERNEL_USERNAME} Edit PySpark Script touch /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh; nano /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh; # !/usr/bin/env bash # setup environment variable, etc. PROXY_USER="$(whoami)" export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export SPARK_MASTER_IP=10.111.22.333 export HADOOP_HOME=/etc/hadoop/conf
  • 18. Jupyter Hub with LDAP + ProxyUser Edit PySpark Script export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export PYSPARK_SUBMIT_ARGS="-v --master yarn --deploy-mode client --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 2 --driver-memory 1024m --executor-memory 1024m --executor-cores 2 --proxy-user "${PROXY_USER}" --keytab /tmp/jupyter.keytab pyspark-shell" # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/jupyter.keytab jupyter/[email protected] # run the ipykernel exec /opt/anaconda3/bin/python -m ipykernel $@ Note: change with your IP and directories in the highlighted field.
  • 20. To use JupyterLab without it being the default interface, you just have to swap on your browser url the “tree” with Lab! http://10.111.22.333:9001/user/tpsimoes/lab JupyterLab JupyterLab it’s the next-generation web-based interface for Jupyter. Install JupyterLab conda install -c conda-forge jupyterlab Install JupyterLab Launcher conda install -c conda-forge jupyterlab_launcher
  • 21. JupyterLab To be able to use the JupyterLab interface as default on Jupyter it requires additional changes. ● Change the JupyterHub Config File ● Additional extensions (for the Hub Menu) ● Create config file for JupyterLab Edit PySpark Script nano /opt/anaconda3/jupyterhub_config.py ... # Change the values on this Flags c.Spawner.default_url = '/lab' c.Spawner.notebook_dir = '/home/{username}' # Add this Flag c.Spawner.cmd = ['jupyter-labhub']
  • 22. JupyterLab Install jupyterlab-hub extension jupyter labextension install @jupyterlab/hub-extension Create JupyterLab Config File cd /opt/anaconda3/share/jupyter/lab/settings/ nano page_config.json { "hub_prefix": "/jupyter" }
  • 24. R, Hive and Impala on JupyterHub On this section the focus will reside on R, Hive, Impala and Kerberized Kernel. With R Kernel, it requires libs on both Machines (Cluster and Jupyter) [Cluster & Jupyter] Install R Libs yum install -y openssl-devel openssl libcurl-devel libssh2-devel [Jupyter] Create SymLinks for R libs ln -s /opt/anaconda3/lib/libssl.so.1.0.0 /usr/lib64/libssl.so.1.0.0; ln -s /opt/anaconda3/lib/libcrypto.so.1.0.0 /usr/lib64/libcrypto.so.1.0.0; [Cluster & Jupyter] To use SparkR devtools::install_github('apache/[email protected]', subdir='R/pkg') Note: Change with your values in the highlighted field. [Cluster & Jupyter] Start R & Install Packages R install.packages('git2r') install.packages('devtools') install.packages('repr') install.packages('IRdisplay') install.packages('crayon') install.packages('pbdZMQ')
  • 25. R, Hive and Impala on JupyterHub To interact with Hive metadata and the direct use of the sintax, the my recommendation is the HiveQL. Install Developer Toolset Libs yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++ Install Python + Hive interface (SQLAlchemy interface for Hive) pip install pyhive Install HiveQL Kernel pip install --upgrade hiveqlKernel jupyter hiveql install Confirm HiveQL Kernel installation jupyter kernelspec list
  • 26. R, Hive and Impala on JupyterHub Edit HiveQL Kernel cd /usr/local/share/jupyter/kernels/hiveql nano kernel.json {"argv": ["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"], "display_name": "HiveQL", "language": "hiveql", "name": "hiveql"} Create and Edit HiveQL script touch /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh; nano /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh; # !/usr/bin/env bash # setup environment variable, etc. PROXY_USER="$(whoami)"
  • 27. R, Hive and Impala on JupyterHub Edit HiveQL script export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export HADOOP_HOME=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true" # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/jupyter.keytab jupyter/[email protected] # run the ipykernel exec /opt/anaconda3/bin/python -m ipykernel $@ Note 1: change with your IP. directories and versions in the highlighted field. Note 2: add your users keytab to a chosen directory so that is possible to run with proxyuser
  • 28. R, Hive and Impala on JupyterHub To interact with Impala metadata, my recommendation is the Impyla, but there’s a catch, because due to a specific version of a lib (thrift_sasl), the HiveQL will stop working, because hiveqlkernel 1.0.13 has the requirement thrift-sasl==0.3.*. Install Developer Toolset Libs yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++ Install additional Libs for Impyla pip install thrift_sasl==0.2.1: pip install sasl; Install ipython-sql conda install -c conda-forge ipython-sql Install impyla pip install impyla==0.15a1 Note: it was installed a alfa version for impyla due to an incompatibility with python versions superior to 3.7.
  • 29. R, Hive and Impala on JupyterHub If you require to have access to Hive & Impala metadata, you can use Python + Hive with a kerberized custom kernel. Install Jaydebeapi package conda install -c conda-forge jaydebeapi Create Python Kerberized Kernel mkdir -p /usr/share/jupyter/kernels/pythonKerb cd /usr/share/jupyter/kernels/pythonKerb touch kernel.json touch pythonKerb.sh chmod a+x /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh Note: Change with your values in the highlighted field. Edit Kerberized Kernel nano /usr/share/jupyter/kernels/kernel.json {"argv": ["/usr/local/share/jupyter/kernels/pythonKerb/pythonKerb.sh ", "-f", "{connection_file}"], "display_name": "PythonKerberized", "language": "python", "name": "pythonKerb"} Edit Kerberized Kernel script nano /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
  • 30. R, Hive and Impala on JupyterHub Edit Kerberized Kernel script PROXY_USER="$(whoami)" export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export HADOOP_HOME=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true" export CLASSPATH=$CLASSPATH:`hadoop classpath`:/etc/hadoop/*:/tmp/* export PYTHONPATH=$PYTHONPATH:/opt/anaconda3/lib/python3.7/site-packages/jaydebeapi # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM # run the ipykernel exec /opt/anaconda3/bin/python -m ipykernel_launcher $@
  • 31. R, Hive and Impala on JupyterHub Assuming that you don't have Impyla installed, or if so, you have created an environment for it! HiveQL it’s the best Kernel to access to hive metadata and it has support. Install Developer Toolset Libs yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++ Install Hive interface & HiveQL Kernel pip install pyhive; pip install --upgrade hiveqlKernel; Jupyter Install Kernel jupyter hiveql install Check kernel installation jupyter kernelspec list
  • 32. R, Hive and Impala on JupyterHub To access to a kerberized Cluster you will require a Kerberos Ticket in cache, therefore the solution will be the following: Edit Kerberized Kernel nano /usr/local/share/jupyter/kernels/hiveql/kernel.json {"argv": ["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"], "display_name": "HiveQL", "language": "hiveql", "name": "hiveql"} Edit Kerberized Kernel script touch /usr/local/share/jupyter/kernels/hiveql/hiveql.sh nano /usr/local/share/jupyter/kernels/hiveql/hiveql.sh Note: Change with your values in the highlighted field.
  • 33. R, Hive and Impala on JupyterHub Edit Kerberized Kernel script PROXY_USER="$(whoami)" export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64 export SPARK_HOME=/opt/spark export HADOOP_HOME=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HIVE_HOME=/etc/hadoop/conf export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true" # Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM # run the ipykernel exec /opt/anaconda3/bin/python -m hiveql $@ Note: Change with your values in the highlighted field.
  • 34. Interact with JupyterHub Kernels The following information will serve as base of knowledge, how to interact with previous configured kernels with a kerberized Cluster. [HiveQL] Create Connection $$ url=hive://[email protected]:10000/ $$ connect_args={"auth": "KERBEROS","kerberos_service_name": "hive"} $$ pool_size=5 $$ max_overflow=10 [Impyla] Create Connection from impala.dbapi import connect conn = connect(host='cm1.localdomain', port=21050, kerberos_service_name='impala', auth_mechanism='GSSAPI') Note: Change with your values in the highlighted field.
  • 35. Interact with JupyterHub Kernels [Impyla] Create Connection via SQLMagic %load_ext sql %config SqlMagic.autocommit=False %sql impala://tpsimoes:[email protected]:21050/db?kerberos_service_name=impala&auth_mechanism=GSSAPI [Python] Create Connection import jaydebeapi import pandas as pd conn_hive = jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver","jdbc:hive2://cm1.localdomain:10000/db;AuthMech=1;KrbRealm=DOMAIN. COM;KrbHostFQDN=cm1.localdomain;KrbServiceName=hive;KrbAuthType=2") [Python] Kinit Keytab import subprocess result = subprocess.run(['kinit', '-kt','/tmp/tpsimoes.keytab',tpsimoes/[email protected]'], stdout=subprocess.PIPE) result.stdout Note: Change with your values in the highlighted field.