Professional Data Engineer Exam - Free Exam Q&As, Page 2

Question #11

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
✑ No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket

✑ Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

A
Use a fixed-time window with a duration of 60 minutes.
B
Use a sliding time window with a duration of 60 minutes.
C
Use a session window with a gap time duration of 60 minutes.
Most Voted
D
Use a global window with a time based trigger with a delay of 60 minutes.

Correct Answer:
C

Show Answer

send

light_mode delete

Question #12

Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's data. You want to ensure appropriate access to the data.
Which three steps should you take? (Choose three.)

A
Load data into different partitions.
B
Load data into a different dataset for each client.
Most Voted
C
Put each client's BigQuery dataset into a different table.
D
Restrict a client's dataset to approved users.
Most Voted
E
Only allow a service account to access the datasets.
F
Use the appropriate identity and access management (IAM) roles for each client's users.
Most Voted

Correct Answer:
BDF

GPT-4o - Answer

For securing client data in a multi-client setup using Google BigQuery, the ideal strategy includes segmenting each client's data into separate datasets, restricting access to these datasets to only approved users or groups, and utilizing specific IAM roles to manage permissions finely. This approach ensures data isolation while providing tailored access based on individual client requirements, thereby securing and optimizing data access management within a diverse client environment. This method effectively prevents clients from accessing each other’s data, aligning with best practices for data security and privacy on cloud platforms.

Show Answer

send

light_mode delete

Question #13

You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling.
Which Google database service should you use?

A
Cloud SQL
B
BigQuery
C
Cloud Bigtable
D
Cloud Datastore
Most Voted

Correct Answer:
A

GPT-4o - Answer

Opting for Cloud SQL as the best choice for a point-of-sale application is primarily due to its ability to handle complex transactions integral to payment processing. This is crucial in environments that require not only atomic transactions but also more comprehensive operations involving updates to multiple related data items, like inventory and sales information. Moreover, while Cloud SQL does require some manual scaling regarding computation, it supports automatic scaling in storage capacity, which is a significant advantage for managing potentially exponential increases in data without needing extensive manual infrastructure management.

Show Answer

send

light_mode delete

Question #14

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

A
There are very few occurrences of mutations relative to normal samples.
Most Voted
B
There are roughly equal occurrences of both normal and mutated samples in the database.
C
You expect future mutations to have different features from the mutated samples in the database.
Most Voted
D
You expect future mutations to have similar features to the mutated samples in the database.
E
You already have labels for which samples are mutated and which are normal in the database.

Correct Answer:
BC

GPT-4o - Answer

While the option indicating future mutations having different features from mutated samples already in the database seems inconsistent with traditional anomaly detection, where anomalies are detected based on deviations from a "normal" pattern, it’s crucial to remember the context of anomaly detection in rapidly evolving datasets like mutations. It can be reasoned that anticipating novel characteristics in mutations aligns with the unsupervised nature of anomaly detection, which often must adapt to new, unseen anomalies. However, there is a strong argument made that future mutations will likely resemble existing anomalies, making option D also a practical choice in this scenario.

Show Answer

send

light_mode delete

Question #15

You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?

A
Re-write the application to load accumulated data every 2 minutes.
B
Convert the streaming insert code to batch load for individual messages.
Most Voted
C
Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.
D
Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.

Correct Answer:
D

GPT-4o - Answer

To mitigate the consistency issues observed with streaming inserts in BigQuery when running subsequent aggregative queries, the optimal approach is to understand and estimate the average latency time it takes for streamed data to be written to storage. By determining this latency, you can schedule your queries to execute after a delay that is twice the average latency period. This allows sufficient time for all the in-flight data to be securely written and available for querying, thus, reducing the likelihood of missing or incomplete data in the generated reports. Although this might seem like an estimation rather than a technical fix, it efficiently addresses the issue within the constraints of near real-time data processing requirements without needing significant architectural changes. This method is most apt for maintaining near real-time analysis while ensuring data completeness.

Show Answer

send

light_mode delete

Question #16

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?

A
Use Google Stackdriver Audit Logs to review data access.
Most Voted
B
Get the identity and access management IIAM) policy of each table
C
Use Stackdriver Monitoring to see the usage of BigQuery query slots.
D
Use the Google Cloud Billing API to see what account the warehouse is being billed to.

Correct Answer:
A

GPT-4o - Answer

To effectively secure your data warehouse in Google BigQuery, the initial step is to audit data access, which is best accomplished using Google Stackdriver Audit Logs. This tool allows you to track who accesses your data, when, and what operations they perform, providing essential insights for identifying current usage patterns and potential security vulnerabilities. This foundational knowledge is crucial before implementing more targeted access controls. Note, while auditing IAM policies per table provides detailed access control, it usually follows an audit to pinpoint specific access needs.

Show Answer

send

light_mode delete

Question #17

Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

A
Create a Google Cloud Dataflow job to process the data.
B
Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
C
Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
D
Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
Most Voted
E
Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.

Correct Answer:
D

GPT-4o - Answer

Choosing Cloud Dataproc with the Google Cloud Storage connector for migrating a Hadoop cluster to the cloud, as suggested in option D, aligns perfectly with the need to reuse existing Hadoop jobs while minimizing management overhead. Dataproc, being a fully managed service, helps seamlessly run Hadoop and Spark jobs. The use of Google Cloud Storage through its connector not only provides persistent storage for data beyond the lifespan of the cluster but also offers scalability and cost-efficiency over using persistent disks. This makes it an optimal solution for cloud migrations, ensuring that your data remains secure and easily accessible even after cluster termination.

Show Answer

send

light_mode delete

Question #18

Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use? (Choose three.)

A
Supervised learning to determine which transactions are most likely to be fraudulent.
B
Unsupervised learning to determine which transactions are most likely to be fraudulent.
Most Voted
C
Clustering to divide the transactions into N categories based on feature similarity.
Most Voted
D
Supervised learning to predict the location of a transaction.
Most Voted
E
Reinforcement learning to predict the location of a transaction.
F
Unsupervised learning to predict the location of a transaction.

Correct Answer:
BCE

GPT-4o - Answer

For the provided data on bank transactions, unsupervised learning (Option B) can effectively identify potential fraud activities by detecting unusual patterns without pre-labeled outcomes. Additionally, clustering (Option C) is applicable as it can group transactions based on similarities in features such as amount or location, providing valuable insights into transaction categories. The choice of reinforcement learning (Option E) to predict transaction locations, although unconventional, might explore optimizing policies for transaction routing or fraud prevention if understood as part of a dynamic system with rewards based systems. Hence, the chosen applications are B, C, and E.

Show Answer

send

light_mode delete

Question #19

Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for- like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

A
Put the data into Google Cloud Storage.
Most Voted
B
Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
C
Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
D
Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.

Correct Answer:
A

GPT-4o - Answer

Opting for Google Cloud Storage, as indicated in option A, aligns well with Google Cloud Dataproc's architecture, which is engineered to integrate seamlessly with GCS. This approach not only reduces costs associated with large persistent disk requirements but also enhances durability and flexibility by allowing data sharing across various clusters. Moreover, it addresses the CIO's concern about the expense linked to block storage, thereby presenting a solution that is both economically and technically advantageous.

Show Answer

send

light_mode delete

Question #20

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom
HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

A
The message body for the sensor event is too large.
B
Your custom endpoint has an out-of-date SSL certificate.
C
The Cloud Pub/Sub topic has too many messages published to it.
D
Your custom endpoint is not acknowledging messages within the acknowledgement deadline.
Most Voted

Correct Answer:
D

Show Answer

send

light_mode delete

Google Professional Data Engineer Exam Practice Questions (P. 2)

Get Contributor Access

Download Demo PDF

Question #11

Question #12

Question #13

Question #14

Question #15

Question #16

Question #17

Question #18

Question #19

Question #20

Best prices & offers

Latest Question

Expert Verified

Instant Download

High Success Rate

Follow Us

Google Professional Data Engineer Exam Practice Questions (P. 2)

Get Contributor Access

Download Demo PDF

Question #11

Question #12

Question #13

Question #14

Question #15

Question #16

Question #17

Question #18

Question #19

Question #20