Spark introduction

Mohamed K.

Order the writing of a tailor-made Computer science Presentation

Free quote online

Presentation Format .pdf

Spark introduction

Download

Read an extract

Themes

Apache Spark, Big data, Spark installation, Map Reduce, SparkContext, Mattei Zaharia, Scala, Machine Learning, Batch, Streaming, framework, Fault tolerance

Reader
Abstract
Contents
Extract

Abstract

The document is a presentation about using Apache Spark for Big Data.

Spark installation
Spark vs. Map Reduce
SparkContext
Spark & Fault tolerance
Spark Apis
RDD

Get this table of contents for free after login.

Extract

[...] Let's start the spark-shell to check the installation is ok. In a command window or terminal opened in spark's bin directory, type in and execute the following statement ./spark-shell or spark-shell.cmd (for windows users) Spark installation • TODO Windows : • HADOOP_HOME = \spark-3.0.1hadoop-2.7.1 • PATH = %SPARK_HOME%/bin • HADOOP_HOME = SPARK_HOME Spark vs. Map Reduce • MapReduce is the combination of two operations : Source Data Map Data Map Data Map Reduce Data • MapReduce can only be used with HDFS • Intermediary results are written to and read from HDFS Spark vs. [...]

[...] myRDD.map(line line.split(",")).map(t t(1)).filter(t t = "survived").map(t t.toInt).reduce((a, a + RDD • Exercise • Calculate the sum of the integers from 1 to 100 using Spark • Exercise • Read the file wolves.csv and display it in the console • Create a case class Wolf and turn the RDD[String] in RDD[Wolf] • Display the RDD to the console. • Exercise Given the text from "Robur le conquérant", Jules Verne 1. read from the text file and create a first RDD 2. [...]

[...] • Reminder : MapReduce use iterative writes on HDFS and replication Spark introduce a new type of distributed object : RDD • Distribution and replication has a cost : network infrastructure is sophisticated with high bandwidth Spark Apis RDD • • • • • • Creation Structure Distribution & partitions Transformations Actions Examples of transformations and actions RDD • SparkContext Creates an RDD from • A data source (file, database, Kafka, ) • A collection (parallelized) • A transformation RDD • From a file data source : Using the textFile("/path/to/myFile") method HDFS RDD RDD • textFile Method : local val rdd1 = sc.textFile("/my/local/file.txt") rdd1: spark.RDD[String] = . Hadoop / HDFS val rdd2 = sc.textFile("hdfs://my/ . /file.txt") rdd2: spark.RDD[String] = . [...]

[...] val rdd2 = rdd1.filter(x x [...]

[...] create a transformation pipeline to count the total number of words in this text. [...]

pdf