SparkContext
This is the main entry point to Spark functionality in Scala.
sc
RDD Creation
# Turn Python collection into RDD
sc.paralellize([1, 2, 3])
# Load text file from local FS, HDFS, or s3
sc.textFile("file.txt")
sc.textFile("directory/*.txt")
sc.textFile("hdfs://namenode:9000/path/file")
Basic Transformations
nums = sc.parallelize([1, 2, 3])
# pass elements trhough function
squares = nums.map(lambda x: x*x)
even = squares.filter(lambda x: x % 2 == 0)
# like map, but expects you might get a bunch of values -
nums.flatMap(lambda x: => range(x))
# 0, 0, 1, 0, 1, 2
Basic Actions
nums = sc.parallelize([1, 2, 3])
# retrieve RDD contents as local collection
nums.collect() # => [1, 2, 3]
# return first k elements
nums.take(2) # => [1, 2]
# count
nums.count() # => 3
#reduce
nums.reduce(lambda x, y: x + y) # => 6
# NOTE WE CAN DO .SUM -> returns double
# to save to file:
nums.saveAsTextFile("hdfs://file.txt") # this generates more than 1 file
Key-Value Pairs
Spark’s distributed reduce transformations can operate on RDDs of key-value pairs.
pair._<n>
, but can be unpacked via pattern matching/case lambdase.g. Word Count in Scala Spark (note on right, can just use _)