Apache Spark, Big data, Spark installation, Map Reduce, SparkContext, Mattei Zaharia, Scala, Machine Learning, Batch, Streaming, framework, Fault tolerance
The document is a presentation about using Apache Spark for Big Data.
[...] Let's start the spark-shell to check the installation is ok. In a command window or terminal opened in spark's bin directory, type in and execute the following statement ./spark-shell or spark-shell.cmd (for windows users) Spark installation • TODO Windows : • HADOOP_HOME = \spark-3.0.1hadoop-2.7.1 • PATH = %SPARK_HOME%/bin • HADOOP_HOME = SPARK_HOME Spark vs. Map Reduce • MapReduce is the combination of two operations : Source Data Map Data Map Data Map Reduce Data • MapReduce can only be used with HDFS • Intermediary results are written to and read from HDFS Spark vs. [...]
[...] myRDD.map(line line.split(",")).map(t t(1)).filter(t t = "survived").map(t t.toInt).reduce((a, a + RDD • Exercise • Calculate the sum of the integers from 1 to 100 using Spark • Exercise • Read the file wolves.csv and display it in the console • Create a case class Wolf and turn the RDD[String] in RDD[Wolf] • Display the RDD to the console. • Exercise Given the text from "Robur le conquérant", Jules Verne 1. read from the text file and create a first RDD 2. [...]
[...] • Reminder : MapReduce use iterative writes on HDFS and replication Spark introduce a new type of distributed object : RDD • Distribution and replication has a cost : network infrastructure is sophisticated with high bandwidth Spark Apis RDD • • • • • • Creation Structure Distribution & partitions Transformations Actions Examples of transformations and actions RDD • SparkContext Creates an RDD from • A data source (file, database, Kafka, ) • A collection (parallelized) • A transformation RDD • From a file data source : Using the textFile("/path/to/myFile") method HDFS RDD RDD • textFile Method : local val rdd1 = sc.textFile("/my/local/file.txt") rdd1: spark.RDD[String] = . Hadoop / HDFS val rdd2 = sc.textFile("hdfs://my/ . /file.txt") rdd2: spark.RDD[String] = . [...]
[...] val rdd2 = rdd1.filter(x x [...]
[...] create a transformation pipeline to count the total number of words in this text. [...]
APA Style reference
For your bibliographyOnline reading
with our online readerContent validated
by our reading committee