FabryProg's Blog

Post

Visualizzazione dei post da aprile, 2019

Apache Spark - Try it using docker!

- aprile 23, 2019

INTRODUCTION Apache Spark is an opensource tool for data analysis and transformation. It uses RDD (resilient distributed dataset) a read only dataset distributed over a cluster of machine that is mantained is a fault tolerant way. It obtain high performance for batch and streaming data. It uses: DAG scheduler Query Optimizer A Physical Execution Engine. DOCKER-SPARK Docker is a tool to create an isolated system into a host computer I use docker every day to create and simulate more scenario. Today i will explain how to create a master and worker apache spark node. docker-compose is a tool to create an environment consist of a set of container docker-spark is big-data-europe 's github repository. We can create an environment using docker-compose.yaml file: version: "3.3" services: spark-master: image: bde2020/spark-master:2.4.0-hadoop2.7 container_name: spark-master ...

Continua a leggere