Cloudera CCD-410 Exam Practice Questions (P. 2)
- Full Access (60 questions)
- Six months of Premium Access
- Access to one million comments
- Seamless ChatGPT Integration
- Ability to download PDF files
- Anki Flashcard files for revision
- No Captcha & No AdSense
- Advanced Exam Configuration
Question #6
Assuming default settings, which best describes the order of data provided to a reducer's reduce method:
- AThe keys given to a reducer aren't in a predictable order, but the values associated with those keys always are.
- BBoth the keys and values passed to a reducer always appear in sorted order.
- CNeither keys nor values are in any predictable order.
- DThe keys given to a reducer are in sorted order but the values associated with each key are in no predictable order
Correct Answer:
D
Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort -
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
D
Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort -
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
send
light_mode
delete
Question #7
You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters.
Indentify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:
Indentify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:
- AYou will have forty-eight failed task attempts
- BYou will have seventeen failed task attempts
- CYou will have five failed task attempts
- DYou will have twelve failed task attempts
- EYou will have twenty failed task attempts
Correct Answer:
E
There will be four failed task attempts for each of the five file splits.
Note:
E
There will be four failed task attempts for each of the five file splits.
Note:

send
light_mode
delete
Question #8
You want to populate an associative array in order to perform a map-side join. You've decided to put this information in a text file, place that file into the
DistributedCache and read it in your Mapper before any records are processed.
Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?
DistributedCache and read it in your Mapper before any records are processed.
Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?
- Acombine
- Bmap
- Cinit
- Dconfigure
Correct Answer:
D
See 3) below.
Here is an illustrative example on how to use the DistributedCache:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
Mapper -
3. Use the cached files in the
Reducer -
or
:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
localArchives = DistributedCache.getLocalCacheArchives(job);
localFiles = DistributedCache.getLocalCacheFiles(job);
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
Reference: org.apache.hadoop.filecache , Class DistributedCache
D
See 3) below.
Here is an illustrative example on how to use the DistributedCache:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
Mapper -
3. Use the cached files in the
Reducer -
or
:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
localArchives = DistributedCache.getLocalCacheArchives(job);
localFiles = DistributedCache.getLocalCacheFiles(job);
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
Reference: org.apache.hadoop.filecache , Class DistributedCache
send
light_mode
delete
Question #9
You've written a MapReduce job that will process 500 million input records and generated 500 million key-value pairs. The data is not uniformly distributed. Your
MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network?
MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network?
- APartitioner
- BOutputFormat
- CWritableComparable
- DWritable
- EInputFormat
- FCombiner
Correct Answer:
F
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job?
F
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job?
send
light_mode
delete
Question #10
Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that the two tables are formatted as comma-separated files in
HDFS.
HDFS.
- AYes.
- BYes, but only if one of the tables fits into memory
- CYes, so long as both tables fit into memory.
- DNo, MapReduce cannot perform relational operations.
- ENo, but it can be done with either Pig or Hive.
Correct Answer:
A
Note:
* Join Algorithms in MapReduce
A) Reduce-side join -
B) Map-side join -
C) In-memory join -
/ Striped Striped variant variant
/ Memcached variant
* Which join to use?
/ In-memory join > map-side join > reduce-side join
/ Limitations of each?
In-memory join: memory -
Map-side join: sort order and partitioning
Reduce-side join: general purpose
A
Note:
* Join Algorithms in MapReduce
A) Reduce-side join -
B) Map-side join -
C) In-memory join -
/ Striped Striped variant variant
/ Memcached variant
* Which join to use?
/ In-memory join > map-side join > reduce-side join
/ Limitations of each?
In-memory join: memory -
Map-side join: sort order and partitioning
Reduce-side join: general purpose
send
light_mode
delete
All Pages