pyspark.SparkContext.newAPIHadoopRDD¶

SparkContext.newAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for SparkContext.sequenceFile().

Parameters

inputFormatClassstr: fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
keyClassstr: fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClassstr: fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverterstr, optional: fully qualified name of a function returning key WritableConverter (None by default)
valueConverterstr, optional: fully qualified name of a function returning value WritableConverter (None by default)
confdict, optional: Hadoop configuration, passed in as a dict (None by default)
batchSizeint, optional: The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

pyspark.SparkContext.newAPIHadoopFile pyspark.SparkContext.parallelize