Post

Visualizzazione dei post da aprile, 2019

Apache Spark - Try it using docker!

Immagine
INTRODUCTION Apache Spark is an opensource tool for data analysis and transformation. It uses RDD (resilient distributed dataset) a read only dataset distributed over a cluster of machine that is mantained is a fault tolerant way. It obtain high performance for batch and streaming data. It uses: DAG scheduler Query Optimizer A Physical Execution Engine. DOCKER-SPARK Docker is a tool to create an isolated system into a host computer I use docker every day to create and simulate more scenario. Today i will explain how to create a master and worker apache spark node. docker-compose is a tool to create an environment consist of a set of container docker-spark is  big-data-europe 's github repository. We can create an environment using docker-compose.yaml file: version: "3.3" services:   spark-master:     image: bde2020/spark-master:2.4.0-hadoop2.7     container_name: spark-master     ports:       - "80