Databricks Certified Data Engineer Professional Exam Practice Questions (P. 4)
- Full Access (227 questions)
- Six months of Premium Access
- Access to one million comments
- Seamless ChatGPT Integration
- Ability to download PDF files
- Anki Flashcard files for revision
- No Captcha & No AdSense
- Advanced Exam Configuration
Question #16
A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?
- AAll logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
- BAll logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.Most Voted
- CResults will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.
- DAll logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
- EThe versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
Correct Answer:
D
D
send
light_mode
delete
Question #17
A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?
Which of the following likely explains these smaller file sizes?
- ADatabricks has autotuned to a smaller target file size to reduce duration of MERGE operationsMost Voted
- BZ-order indices calculated on the table are preventing file compaction
- CBloom filter indices calculated on the table are preventing file compaction
- DDatabricks has autotuned to a smaller target file size based on the overall size of data in the table
- EDatabricks has autotuned to a smaller target file size based on the amount of data in each partition
Correct Answer:
A
A

Databricks employs Auto Optimize to achieve more efficient merge operations by dynamically adjusting the target file size based on specific workload patterns. In scenarios involving frequent data updates or merge operations, smaller file sizes can significantly speed up these processes by reducing the amount of data that needs to be scanned and shuffled. Hence, observing files substantially smaller than the initial 1 GB after enabling Auto Optimize and Auto Compaction is both expected and optimal for performance in high-update environments.
send
light_mode
delete
Question #18
Which statement regarding stream-static joins and static Delta tables is correct?
- AEach microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.Most Voted
- BEach microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.
- CThe checkpoint directory will be used to track state information for the unique keys present in the join.
- DStream-static joins cannot use static Delta tables because of consistency issues.
- EThe checkpoint directory will be used to track updates to the static Delta table.
Correct Answer:
A
A

Stream-static joins in Databricks involve each microbatch of the streaming data using the latest valid version of the static Delta table available at the time of processing that particular microbatch. This design ensures that the most current snapshot of the static data is used for real-time analytics, aligns with the architectures aiming at low-latency data operations, and effectively supports scenarios where updates to static data are frequent and critical to the join's output accuracy.
send
light_mode
delete
Question #19
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.
- Ato_interval("event_time", "5 minutes").alias("time")
- Bwindow("event_time", "5 minutes").alias("time")Most Voted
- C"event_time"
- Dwindow("event_time", "10 minutes").alias("time")
- Elag("event_time", "10 minutes").alias("time")
Correct Answer:
B
B
send
light_mode
delete
Question #20
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.
The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?
The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?
- ANo; Delta Lake manages streaming checkpoints in the transaction log.
- BYes; both of the streams can share a single checkpoint directory.
- CNo; only one stream can write to a Delta Lake table.
- DYes; Delta Lake supports infinite concurrent writers.
- ENo; each of the streams needs to have its own checkpoint directory.Most Voted
Correct Answer:
E
E
send
light_mode
delete
All Pages