Apache Spark - Try it using docker!




INTRODUCTION


Apache Spark is an opensource tool for data analysis and transformation. It uses RDD (resilient distributed dataset) a read only dataset distributed over a cluster of machine that is mantained is a fault tolerant way.

It obtain high performance for batch and streaming data. It uses:

  • DAG scheduler
  • Query Optimizer
  • A Physical Execution Engine.


DOCKER-SPARK


Docker is a tool to create an isolated system into a host computer

I use docker every day to create and simulate more scenario. Today i will explain how to create a master and worker apache spark node.

docker-compose is a tool to create an environment consist of a set of container

docker-spark is big-data-europe's github repository. We can create an environment using docker-compose.yaml file:


version: "3.3"

services:
  spark-master:
    image: bde2020/spark-master:2.4.0-hadoop2.7
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      - INIT_DAEMON_STEP=setup_spark
  spark-worker-1:
    image: bde2020/spark-worker:2.4.0-hadoop2.7
    container_name: spark-worker-1
    depends_on:
      - spark-master
    ports:
      - "8081:8080"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
  spark-worker-2:
    image: bde2020/spark-worker:2.4.0-hadoop2.7
    container_name: spark-worker-2
    depends_on:
      - spark-master
    ports:
      - "8082:8080"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"

Now, we can create a cluster into our machine using the command:




Now my spark cluster is running!

A LITTLE DEEPER

Apache Spark runs on Master-Slave architecture, into our cluster there are:
  1. one master node
    • this node exposes WebUI (see below) and manage the jobs into workers
  2. two workers

RUN AN EXAMPLE

Now we are able to send a new job to the spark cluster or open a spark-shell linked to our cluster (see below):


It is so easy use spark with docker but.....

WARNING

java 8 version is not able to retrieve right resource informations, read this post to fix it (external resource)

PRODUCTION?


This guide talks about a local cluster uses for test or develop purpose. But, i am sure, you are asking: "Can i use this configurations into real production cluster?"

Docker has a cluster manager called swarm to link different machine. It is stable and easy to use!!!

YES: docker swarm and apache spark can be used into a real production cluster!

Good bye and have fun with Apache Spark!

Commenti

  1. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training

    RispondiElimina
  2. I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Online Training India

    RispondiElimina

Posta un commento

Post popolari in questo blog

Hadoop, how to create a single node cluster using docker

How to install IOTA node with docker