Google Professional Data Engineer Exam Practice Questions (P. 3)
- Full Access (319 questions)
- Six months of Premium Access
- Access to one million comments
- Seamless ChatGPT Integration
- Ability to download PDF files
- Anki Flashcard files for revision
- No Captcha & No AdSense
- Advanced Exam Configuration
Question #21
Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?
- AAssign global unique identifiers (GUID) to each data entry.
- BCompute the hash value of each data entry, and compare it with all historical data.
- CStore each data entry as the primary key in a separate database and apply an index.
- DMaintain a database table to store the hash value and other metadata for each data entry.
Correct Answer:
D
D
send
light_mode
delete
Question #22
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a
Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?
Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?
- ARun a local version of Jupiter on the laptop.
- BGrant the user access to Google Cloud Shell.
- CHost a visualization tool on a VM on Google Compute Engine.
- DDeploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.Most Voted
Correct Answer:
B
B
send
light_mode
delete
Question #23
You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time. What should you do?
- ASend the data to Google Cloud Datastore and then export to BigQuery.
- BSend the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.Most Voted
- CSend the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.
- DExport logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.
Correct Answer:
B
B
send
light_mode
delete
Question #24
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
- ADelete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
- BAdd a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.
- CCreate a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
- DAdd two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
- EConstruct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.Most Voted
Correct Answer:
D
D
send
light_mode
delete
Question #25
You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?
- AMake a call to the Stackdriver API to list all logs, and apply an advanced filter.
- BIn the Stackdriver logging admin interface, and enable a log sink export to BigQuery.
- CIn the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.
- DUsing the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.Most Voted
Correct Answer:
B
B
send
light_mode
delete
Question #26
You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users' privacy?
- AGrant the consultant the Viewer role on the project.
- BGrant the consultant the Cloud Dataflow Developer role on the project.Most Voted
- CCreate a service account and allow the consultant to log on with it.
- DCreate an anonymized sample of the data for the consultant to work with in a different project.
Correct Answer:
C
C
send
light_mode
delete
Question #27
You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?
- AEliminate features that are highly correlated to the output labels.
- BCombine highly co-dependent features into one representative feature.Most Voted
- CInstead of feeding in each feature individually, average their values in batches of 3.
- DRemove the features that have null values for more than 50% of the training records.
Correct Answer:
B
B
send
light_mode
delete
Question #28
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.

You want to improve the performance of this data read. What should you do?
The data scientists have written the following code to read the data for a new key features in the logs.

You want to improve the performance of this data read. What should you do?
- ASpecify the TableReference object in the code.
- BUse .fromQuery operation to read specific fields from the table.Most Voted
- CUse of both the Google BigQuery TableSchema and TableFieldSchema classes.
- DCall a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.
Correct Answer:
D
D
send
light_mode
delete
Question #29
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
- AUse a row key of the form <timestamp>.
- BUse a row key of the form <sensorid>.
- CUse a row key of the form <timestamp>#<sensorid>.
- DUse a row key of the form >#<sensorid>#<timestamp>.Most Voted
Correct Answer:
D
D
send
light_mode
delete
Question #30
Your company's customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations.
The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?
The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?
- AAdd a node to the MySQL cluster and build an OLAP cube there.
- BUse an ETL tool to load the data from MySQL into Google BigQuery.Most Voted
- CConnect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
- DMount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.
Correct Answer:
C
C
send
light_mode
delete
All Pages