Amazon AWS Certified Data Engineer - Associate DEA-C01 Exam Practice Questions (P. 3)
- Full Access (208 questions)
- Six months of Premium Access
- Access to one million comments
- Seamless ChatGPT Integration
- Ability to download PDF files
- Anki Flashcard files for revision
- No Captcha & No AdSense
- Advanced Exam Configuration
Question #11
A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column.
Which solution will MOST speed up the Athena query performance?
Which solution will MOST speed up the Athena query performance?
- AChange the data format from .csv to JSON format. Apply Snappy compression.
- BCompress the .csv files by using Snappy compression.
- CChange the data format from .csv to Apache Parquet. Apply Snappy compression.Most Voted
- DCompress the .csv files by using gzip compression.
Correct Answer:
D
D

Switching from CSV to Apache Parquet and applying Snappy compression notably improves Athena query speeds, particularly for column-based queries due to Parquet's columnar storage feature. This setup enables efficient column pruning and predicate pushdown optimizations, making it a highly suitable solution for enhancing performance in scenarios where specific columns are predominantly accessed.
send
light_mode
delete
Question #12
A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket.
The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility.
Which solution will meet these requirements with the LOWEST latency?
The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility.
Which solution will meet these requirements with the LOWEST latency?
- AUse Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.Most Voted
- BConfigure the S3 bucket to send a notification to an AWS Lambda function when any new object is created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to create an Amazon QuickSight dashboard.
- CUse Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Create a new Data Firehose delivery stream to publish data directly to an Amazon Timestream database. Use the Timestream database as a source to create an Amazon QuickSight dashboard.
- DUse AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
Correct Answer:
D
D

Option D is indeed the correct choice for achieving low latency in real-time dashboard updates. AWS Glue bookmarks, despite common misconceptions, can effectively track data processing positions in near real-time scenarios when configured correctly. This feature, combined with the use of Amazon Timestream for time-series data management and Grafana for visualization, enables rapid operational insights directly from sensor data stored in S3, crucial for decision-making in manufacturing environments. Such setups minimize delays between data generation, processing, and visualization, essential for real-time monitoring.
send
light_mode
delete
Question #13
A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data.
The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.
Which solution will meet these requirements?
The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.
Which solution will meet these requirements?
- ACreate an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Configure the output destination to a new path in the existing S3 bucket.
- BCreate an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Specify a database name for the output.Most Voted
- CCreate an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Specify a database name for the output.
- DCreate an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Configure the output destination to a new path in the existing S3 bucket.
Correct Answer:
B
B

Option B leverages the AWSGlueServiceRole policy, which is designed specifically for AWS Glue operations, providing the necessary permissions without excessive privileges. Associating this role with the crawler and scheduling it to run daily ensures consistent data accessibility in the AWS Glue Data Catalog. Importantly, specifying a database name for the output organizes the S3 source data into a structured format in the Data Catalog, facilitating easier data management and query operations. This setup is both efficient and secure, closely aligning with AWS best practices for Glue crawlers.
send
light_mode
delete
Question #14
A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded.
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
- AUse a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.
- BUse the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.Most Voted
- CUse the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.
- DUse a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.
Correct Answer:
D
D

The correct means to efficiently handle the invocation of a Lambda function for updating load statuses in DynamoDB, especially in the context of Amazon Redshift transactions, is to utilize the capabilities of Amazon EventBridge, triggered by events published via the Redshift Data API. This method provides a seamless and scalable approach, avoiding the complexity and potential overhead of managing additional Lambda functions or other intermediary services. By leveraging an EventBridge rule, you ensure immediate response to data load actions, thereby enhancing real-time tracking and operational efficiency. This reflects a more applicable integration within AWS service ecosystem for such tasks.
send
light_mode
delete
Question #15
A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically.
Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?
Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?
- AAWS DataSyncMost Voted
- BAWS Glue
- CAWS Direct Connect
- DAmazon S3 Transfer Acceleration
Correct Answer:
C
C

AWS Direct Connect is the optimal choice for the mentioned requirements due to its ability to establish a dedicated network connection from on-premises to AWS, which is crucial for transferring large datasets like 5 TB. This method ensures high transfer speeds and reduced network costs, which is more operationally efficient for such large and continuous data transfers compared to internet-based transfers. Furthermore, Direct Connect supports robust, secure connections, which is vital for maintaining data integrity during frequent transfers of significant data amounts.
send
light_mode
delete
All Pages