Post

Visualizzazione dei post da Aprile, 2019

Apache Spark - Try it using docker!

Immagine
INTRODUCTION
Apache Spark is an opensource tool for data analysis and transformation. It uses RDD (resilient distributed dataset) a read only dataset distributed over a cluster of machine that is mantained is a fault tolerant way.

It obtain high performance for batch and streaming data. It uses:

DAG schedulerQuery OptimizerA Physical Execution Engine.

DOCKER-SPARK
Docker is a tool to create an isolated system into a host computer
I use docker every day to create and simulate more scenario. Today i will explain how to create a master and worker apache spark node.
docker-compose is a tool to create an environment consist of a set of container
docker-spark is big-data-europe's github repository. We can create an environment using docker-compose.yaml file:

version: "3.3"

services:
  spark-master:
    image: bde2020/spark-master:2.4.0-hadoop2.7
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      - INIT_DAEMON_STEP…