Programming with RDDs

SparkContext

This is the main entry point to Spark functionality in Scala.

In shell, available as variable sc
In standalone programs, must be constructed

RDD Creation

# Turn Python collection into RDD
sc.paralellize([1, 2, 3])

# Load text file from local FS, HDFS, or s3
sc.textFile("file.txt")
sc.textFile("directory/*.txt")
sc.textFile("hdfs://namenode:9000/path/file")

Basic Transformations

nums = sc.parallelize([1, 2, 3])

# pass elements trhough function
squares = nums.map(lambda x: x*x)

even = squares.filter(lambda x: x % 2 == 0)

# like map, but expects you might get a bunch of values - 
nums.flatMap(lambda x: => range(x))
# 0, 0, 1, 0, 1, 2

Basic Actions

nums = sc.parallelize([1, 2, 3])

# retrieve RDD contents as local collection
nums.collect() # => [1, 2, 3]

# return first k elements
nums.take(2) # => [1, 2]

# count
nums.count() # => 3

#reduce
nums.reduce(lambda x, y: x + y) # => 6

# NOTE WE CAN DO .SUM -> returns double

# to save to file:
nums.saveAsTextFile("hdfs://file.txt") # this generates more than 1 file

Key-Value Pairs

Spark’s distributed reduce transformations can operate on RDDs of key-value pairs.

pairs can be unpacked via pair._<n>, but can be unpacked via pattern matching/case lambdas

e.g. Word Count in Scala Spark (note on right, can just use _)