Cloudera CCD-410 Exam Practice Questions (P. 3)
- Full Access (60 questions)
- Six months of Premium Access
- Access to one million comments
- Seamless ChatGPT Integration
- Ability to download PDF files
- Anki Flashcard files for revision
- No Captcha & No AdSense
- Advanced Exam Configuration
Question #11
You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper's map method?
- AIntermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
- BInto in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
- CInto in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
- DInto in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
- EInto in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.
Correct Answer:
D
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the Mapper Output (intermediate kay-value data) stored ?
D
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the Mapper Output (intermediate kay-value data) stored ?
send
light_mode
delete
Question #12
You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?
- AIngest the server web logs into HDFS using Flume.
- BWrite a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
- CImport all users' clicks from your OLTP databases into Hadoop, using Sqoop.
- DChannel these clickstreams inot Hadoop using Hadoop Streaming.
- ESample the weblogs from the web servers, copying them into Hadoop using curl.
Correct Answer:
B
Hadoop MapReduce for Parsing Weblogs
Here are the steps for parsing a log file using Hadoop MapReduce:
Load log files into the HDFS location using this Hadoop command: hadoop fs -put <local file path of weblogs> <hadoop HDFS location>
The Opencsv2.3.jar framework is used for parsing log records.
Below is the Mapper program for parsing the log file from the HDFS location. public static class ParseMapper extends Mapper<Object, Text, NullWritable,Text >{ private Text word = new Text(); public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
CSVParser parse = new CSVParser(' ','\"');
String sp[]=parse.parseLine(value.toString());
int spSize=sp.length;
StringBuffer rec= new StringBuffer();
for(int i=0;i<spSize;i++){
rec.append(sp[i]);
if(i!=(spSize-1))
rec.append(",");
}
word.set(rec.toString());
context.write(NullWritable.get(), word);
}
}
The command below is the Hadoop-based log parse execution. TheMapReduce program is attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system. hadoop jar <path of logparse jar> <hadoop HDFS logfile path> <output path of parsed log file>
The output file is stored in the HDFS location, and the output file name starts with "part-".
B
Hadoop MapReduce for Parsing Weblogs
Here are the steps for parsing a log file using Hadoop MapReduce:
Load log files into the HDFS location using this Hadoop command: hadoop fs -put <local file path of weblogs> <hadoop HDFS location>
The Opencsv2.3.jar framework is used for parsing log records.
Below is the Mapper program for parsing the log file from the HDFS location. public static class ParseMapper extends Mapper<Object, Text, NullWritable,Text >{ private Text word = new Text(); public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
CSVParser parse = new CSVParser(' ','\"');
String sp[]=parse.parseLine(value.toString());
int spSize=sp.length;
StringBuffer rec= new StringBuffer();
for(int i=0;i<spSize;i++){
rec.append(sp[i]);
if(i!=(spSize-1))
rec.append(",");
}
word.set(rec.toString());
context.write(NullWritable.get(), word);
}
}
The command below is the Hadoop-based log parse execution. TheMapReduce program is attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system. hadoop jar <path of logparse jar> <hadoop HDFS logfile path> <output path of parsed log file>
The output file is stored in the HDFS location, and the output file name starts with "part-".
send
light_mode
delete
Question #13
MapReduce v2 (MRv2/YARN) is designed to address which two issues?
- ASingle point of failure in the NameNode.
- BResource pressure on the JobTracker.
- CHDFS latency.
- DAbility to run frameworks other than MapReduce, such as MPI.
- EReduce complexity of the MapReduce APIs.
- FStandardize on a single MapReduce API.
Correct Answer:
BD
YARN (Yet Another Resource Negotiator), as an aspect of Hadoop, has two major kinds of benefits:
* (D) The ability to use programming frameworks other than MapReduce.
/ MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative
* Scalability, no matter what programming framework you use.
Note:
* The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
* (B) The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.
The current Hadoop MapReduce system is fairly scalable "" Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 "" 2 millions jobs/ cluster/month. Still, YARN will remove scalability bottlenecks
Reference: Apache Hadoop YARN "" Concepts & Applications
BD
YARN (Yet Another Resource Negotiator), as an aspect of Hadoop, has two major kinds of benefits:
* (D) The ability to use programming frameworks other than MapReduce.
/ MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative
* Scalability, no matter what programming framework you use.
Note:
* The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
* (B) The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.
The current Hadoop MapReduce system is fairly scalable "" Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 "" 2 millions jobs/ cluster/month. Still, YARN will remove scalability bottlenecks
Reference: Apache Hadoop YARN "" Concepts & Applications
send
light_mode
delete
Question #14
You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you've decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.
Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?
Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?
- Ahadoop "mapred.job.name=Example" MyDriver input output
- Bhadoop MyDriver mapred.job.name=Example input output
- Chadoop MyDrive ""D mapred.job.name=Example input output
- Dhadoop setproperty mapred.job.name=Example MyDriver input output
- Ehadoop setproperty ("mapred.job.name=Example") MyDriver input output
Correct Answer:
C
Configure the property using the -D key=value notation:
-D mapred.job.name='My Job'
You can list a whole bunch of options by calling the streaming jar with just the -info argument
Reference: Python hadoop streaming : Setting a job name
C
Configure the property using the -D key=value notation:
-D mapred.job.name='My Job'
You can list a whole bunch of options by calling the streaming jar with just the -info argument
Reference: Python hadoop streaming : Setting a job name
send
light_mode
delete
Question #15
You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies (Text).
Indentify what determines the data types used by the Mapper for a given job.
Indentify what determines the data types used by the Mapper for a given job.
- AThe key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
- BThe data types specified in HADOOP_MAP_DATATYPES environment variable
- CThe mapper-specification.xml file submitted with the job determine the mapper's input key and value types.
- DThe InputFormat used by the job determines the mapper's input key and value types.
Correct Answer:
D
The input types fed to the mapper are controlled by the InputFormat used. The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.
Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these.
Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD
D
The input types fed to the mapper are controlled by the InputFormat used. The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.
Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these.
Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD
send
light_mode
delete
All Pages