WP Install Apache Zeppelin and connect it to AWS Athena for data exploration, visualization and collaboration | Imperva

Install Apache Zeppelin and connect it to AWS Athena for data exploration, visualization and collaboration

Install Apache Zeppelin and connect it to AWS Athena for data exploration, visualization and collaboration

Introduction

Apache Zeppelin is a Web-based, open source, notebook system that enables data-driven, interactive data analytics and collaborative documents with SQL. At Imperva Research Group we use it on a daily basis to query data from the Threat Research Data Lake using AWS Athena query engine.

Zeppelin and Athena give our researchers and data scientists a great power – here are our main use cases:

  • Data exploration and quick visualization of data.
  • Common use cases dashboards.
  • Queries and visualizations and collaboration.

Here is how the data gets from S3 to the web client:

Flow from S3 to client

Recently we wanted to install a new, secure, zeppelin server on an EC2 instance and connect it to Athena. Since the installation was not as smooth as we wanted it to be and we had to use multiple documentation sources and tried many options to make it work – we decided to share our script and document what we have done for others. 

Read more if you want to quickly install Zeppelin and connect it to Athena. You will be able to use it and benefit from it right away.

Resources

Before you begin you will need an EC2 instance with:

  • SSH access 
  • Docker installed
  • Open 443 port (for HTTPs access)

Access to the open ports on the instance should be limited and restricted. You can add authentication and authorization in addition to the network restrictions. It can be done in many ways and won’t be a part of this post’s scope.

Access rights

For security reasons we wanted to avoid using credentials and used an instance role. The instance role should have the following access rights:

  • S3 read access to all of the data you want to access by Athena
  • Athena and Glue access
  • S3 write access to:
    • Temp location for writing query results by Athena. Example: S3://my-bucket/temp
    • S3 folder for writing notebooks. Example: S3://my-bucket/zeppelin-notebooks

Once you have the instance ready with and the role set – you are ready to bring up zeppelin.

Zeppelin installation

SSL configuration

Keystore file is needed, and you can create one following the instructions from the zeppelin documentation: Apache Zeppelin Configuration

To create an obfuscated password java is needed, so I created a container with java, downloaded Jetty jar and run an obfuscation command:

[root@host]# docker run --rm -it openjdk:alpine /bin/ash
/ # wget https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-util/9.4.27.v20200227/jetty-util-9.4.27.v20200227.jar
Connecting to repo1.maven.org (151.101.200.209:443)
jetty-util-9.4.27.v2 100% |*********************************************************************************************|   538k  0:00:00 ETA
/ # java -cp jetty-util-9.4.27.v20200227.jar org.eclipse.jetty.util.security.Password my-password
2020-12-24 09:08:53.753:INFO::main: Logging initialized @166ms to org.eclipse.jetty.util.log.StdErrLog
my-password
OBF:1uh41zly1jy51yt41v1p1ym71v2p1yti1k1t1zlk1ugm
MD5:bedc452793b9d8fb2de4b26256759777
/ # exit
[root@host]# 

Once you have the obfuscated password and the keystore file you can create the zeppelin container

Container creation

The following bash script gets the obfuscated password, and spans a new zeppelin container:

echo -n Enter Obfuscated Password including OBF: prefix:
read -s obf_password

docker run --name zep -d -p 443:8443 \
        -v /your-path/keystore:/zeppelin/keystore:ro \
        -e ZEPPELIN_ADDR=0.0.0.0 \
        -e ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.S3NotebookRepo \
        -e ZEPPELIN_NOTEBOOK_S3_BUCKET=bucket-to-save-notebooks \
        -e ZEPPELIN_NOTEBOOK_S3_USER=zeppelin-notebooks \
        -e ZEPPELIN_SSL=true \
        -e ZEPPELIN_SSL_KEYSTORE_PATH=/zeppelin/keystore \
        -e ZEPPELIN_SSL_KEYSTORE_PASSWORD=$obf_password \
        -e HOME=/zeppelin \
        -e ZEPPELIN_HELIUM_REGISTRY=helium,https://s3.amazonaws.com/helium-package/helium.json \
        apache/zeppelin:0.9.0

Update the script with the path to the keystore, bucket to save notebooks and folder to save notebooks (default is “zeppelin-notebooks”).

After running the script Zeppelin will be running with SSL. You can verify it by browsing to the instance. Next you have to configure interpreters and other components using Zeppelin REST api.

Athena Interpreter(s)

The following script adds interpreters to zeppelin. It first downloads the athena JDBC jar and copies it to the container, and later configures interpreters according to the template and parameters.

Different interpreters are needed for different reasons – for example running athena on different regions.

Write a file based on the template below called athena.json, update the script with your buckets, regions and names for the interpreters:

# download jar and copy it to container
wget https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC-2.0.16.1000/AthenaJDBC42.jar
mv AthenaJDBC42.jar AthenaJDBC.jar
docker cp AthenaJDBC.jar zep:/usr/local/

# create interpreter(s)
sed -e s/'${INT_SUFFIX}'/_us/ -e s/'${REGION}'/us-east-1/ -e s/'${BUCKET}'/my-first-bucket/ athena.json > tmp_athena.json
curl --header "Content-Type: application/json" -X POST --insecure --data @tmp_athena.json  https://localhost/api/interpreter/setting/

sed -e s/'${INT_SUFFIX}'/_eu/ -e s/'${REGION}'/eu-west-1/ -e s/'${BUCKET}'/my-second-bucket/ athena.json > tmp_athena.json
curl --header "Content-Type: application/json" -X POST --insecure --data @tmp_athena.json  https://localhost/api/interpreter/setting/

Interpreter configuration template

The template is a JSON file with the minimal configuration required by the driver. The bash script replaces the placeholders with the actual parameters:

{
    "group": "jdbc",
    "id": "athena${INT_SUFFIX}",
    "name": "athena${INT_SUFFIX}",
    "option": {
        "perNote": "shared",
        "perUser": "shared"
    },
    "properties": {
        "default.aws_credentials_provider_class": {
            "name": "default.aws_credentials_provider_class",
            "type": "string",
            "value": "com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider"
        },
        "default.driver": {
            "name": "default.driver",
            "type": "string",
            "value": "com.simba.athena.jdbc.Driver"
        },
        "default.s3_staging_dir": {
            "name": "default.s3_staging_dir",
            "type": "string",
            "value": "s3://${BUCKET}/temp/zeppelin/"
        },
        "default.url": {
            "name": "default.url",
            "type": "string",
            "value": "jdbc:awsathena://athena.${REGION}.amazonaws.com:443/"
        }
    }
    "dependencies": [
        {
            "groupArtifactVersion": "/usr/local/AthenaJDBC.jar"
        }
    ]
}

The default.s3_staging_dir parameters value must be S3 folder under a bucket from the same region you query athena, and with write permissions.

Now you can go back to the UI, create a new notebook and try to query Athena. You can do by choosing the interpreter and running a simple SQL query. Here is an example:

Zeppelin Query Example

The notebook is stored in S3, under the folder you chose in the container creation section. You can verify the object exists and look at it.

Helium components

Helium components are an optional set of visual components you can enable from zeppelin UI or using the API. Here is a small script to enable some components automatically:

for comp_name in 'zeppelin-number' 'ultimate-heatmap-chart' 'ultimate-scatter-chart' 'ultimate-range-chart' \  
                              'ultimate-column-chart' 'ultimate-pie-chart' 'ultimate-dual-column-chart' 'ultimate-line-chart'
do
  curl --insecure -X "POST" https://localhost/api/helium/enable/$comp_name
done

Now you have more visualizations you can use like a bar chart with more options than the basic one, or a heat map.

Upgrade

If you upgrade from an older version to zeppelin 0.9 you will need to upgrade your notebooks. The upgrade guide can help you. 

To migrate your notebooks Copy the old notebook data to a subfolder called notebook under your notebooks folder. Example: /zeppelin-notebooks/notebook/. You can do it from the S3 console.

Later run the following command:

docker run docker exec -it zep /bin/upgrade-note.sh -d

The -d option will delete the old notebooks (they are not needed in S3 any longer). You can go to zeppelin main page and click on the notebooks refresh button (Zeppelin Refresh Button) and try to open one of your notebooks.

Summary

Apache zeppelin is a great tool for data exploration, visualization and collaboration. Together with Athena we made our data lake more accessible, encouraged collaboration, and improved productivity. 

If you are already using Athena you can get there too.Once you have it ready, next thing you can do is to improve your SQL skills and get the best of it.