Cloudera CCD-410 Exam Practice Questions (P. 5)
- Full Access (60 questions)
- Six months of Premium Access
- Access to one million comments
- Seamless ChatGPT Integration
- Ability to download PDF files
- Anki Flashcard files for revision
- No Captcha & No AdSense
- Advanced Exam Configuration
Question #21
What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?
- AYou will not be able to compress the intermediate data.
- BYou will longer be able to take advantage of a Combiner.
- CBy using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
- DThere are no concerns with this approach. It is always advisable to use multiple reduces.
Correct Answer:
C
Multiple reducers and total ordering
If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you've used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single sorted output file. To do this you'll need total ordering,
Reference: Sorting text files with MapReduce
C
Multiple reducers and total ordering
If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you've used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single sorted output file. To do this you'll need total ordering,
Reference: Sorting text files with MapReduce
send
light_mode
delete
Question #22
Given a directory of files with the following structure: line number, tab character, string:
Example:
1 abialkjfjkaoasdfjksdlkjhqweroij
2 kadfjhuwqounahagtnbvaswslmnbfgy
3 kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?
Example:
1 abialkjfjkaoasdfjksdlkjhqweroij
2 kadfjhuwqounahagtnbvaswslmnbfgy
3 kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?
- ASequenceFileAsTextInputFormat
- BSequenceFileInputFormat
- CKeyValueFileInputFormat
- DBDBInputFormat
Correct Answer:
B
Note:
The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat.
Reference: How to parse CustomWritable from text in Hadoop
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop
(see answer 1 and then see the comment #1 for it)
B
Note:
The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat.
Reference: How to parse CustomWritable from text in Hadoop
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop
(see answer 1 and then see the comment #1 for it)
send
light_mode
delete
Question #23
You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?
- AHave your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
- BHave your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
- CWhen submitting the job on the command line, specify the ""libjars option followed by the JAR file path.
- DPackage your code and the Apache Commands Math library into a zip file named JobJar.zip
Correct Answer:
C
The usage of the jar command is like this,
Usage: hadoop jar <jar> [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one of these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.
C
The usage of the jar command is like this,
Usage: hadoop jar <jar> [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one of these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.
send
light_mode
delete
Question #24
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:
- ACombine
- BIdentityMapper
- CIdentityReducer
- DDefault Partitioner
- ESpeculative Execution
Correct Answer:
E
Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform.
This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Reference: Apache Hadoop, Module 4: MapReduce
Note:
* Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.
Failed tasks are tasks that error out.
* There are a few reasons Hadoop can kill tasks by his own decisions: a) Task does not report progress during timeout (default is 10 minutes) b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler). c) Speculative execution causes results of task not to be needed since it has completed on other place.
Reference: Difference failed tasks vs killed tasks
E
Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform.
This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Reference: Apache Hadoop, Module 4: MapReduce
Note:
* Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.
Failed tasks are tasks that error out.
* There are a few reasons Hadoop can kill tasks by his own decisions: a) Task does not report progress during timeout (default is 10 minutes) b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler). c) Speculative execution causes results of task not to be needed since it has completed on other place.
Reference: Difference failed tasks vs killed tasks
send
light_mode
delete
Question #25
For each intermediate key, each reducer task can emit:
- AAs many final key-value pairs as desired. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
- BAs many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
- CAs many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
- DOne final key-value pair per value associated with the key; no restrictions on the type.
- EOne final key-value pair per key; no restrictions on the type.
Correct Answer:
E
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.
Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce
E
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.
Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce
send
light_mode
delete
All Pages