Latest Google Professional-Data-Engineer Dumps PDF Questions Answers 2025

Google Professional Data Engineer Exam Questions and Answers

Question 1

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

Options:

Redefine the schema by evenly distributing reads and writes across the row space of the table.

The performance issue should be resolved over time as the site of the BigDate cluster is increased.

Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.

Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Buy Now

Question 2

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

Options:

Assign global unique identifiers (GUID) to each data entry.

Compute the hash value of each data entry, and compare it with all historical data.

Store each data entry as the primary key in a separate database and apply an index.

Maintain a database table to store the hash value and other metadata for each data entry.

Question 3

Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use? (Choose three.)

Options:

Supervised learning to determine which transactions are most likely to be fraudulent.

Unsupervised learning to determine which transactions are most likely to be fraudulent.

Clustering to divide the transactions into N categories based on feature similarity.

Supervised learning to predict the location of a transaction.

Reinforcement learning to predict the location of a transaction.

Unsupervised learning to predict the location of a transaction.

Question 4

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

Options:

Disable caching by editing the report settings.

Disable caching in BigQuery by editing table details.

Refresh your browser tab showing the visualizations.

Clear your browser history for the past hour then reload the tab showing the virtualizations.

Question 5

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.

The data scientists have written the following code to read the data for a new key features in the logs.

BigQueryIO.Read

.named(“ReadLogData”)

.from(“clouddataflow-readonly:samples.log_data”)

You want to improve the performance of this data read. What should you do?

Options:

Specify the TableReference object in the code.

Use .fromQuery operation to read specific fields from the table.

Use of both the Google BigQuery TableSchema and TableFieldSchema classes.

Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.

Question 6

Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks. What should you do?

Options:

Run a local version of Jupiter on the laptop.

Grant the user access to Google Cloud Shell.

Host a visualization tool on a VM on Google Compute Engine.

Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.

Question 7

Your company’s customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?

Options:

Add a node to the MySQL cluster and build an OLAP cube there.

Use an ETL tool to load the data from MySQL into Google BigQuery.

Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.

Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.

Question 8

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

Options:

There are very few occurrences of mutations relative to normal samples.

There are roughly equal occurrences of both normal and mutated samples in the database.

You expect future mutations to have different features from the mutated samples in the database.

You expect future mutations to have similar features to the mutated samples in the database.

You already have labels for which samples are mutated and which are normal in the database.

Question 9

Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)

Options:

Disable writes to certain tables.

Restrict access to tables by role.

Ensure that the data is encrypted at all times.

Restrict BigQuery API access to approved users.

Segregate data across multiple tables or databases.

Use Google Stackdriver Audit Logging to determine policy violations.

Question 10

You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?

Options:

Make a call to the Stackdriver API to list all logs, and apply an advanced filter.

In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.

In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.

Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

Question 11

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?

Options:

Use Google Stackdriver Audit Logs to review data access.

Get the identity and access management IIAM) policy of each table

Use Stackdriver Monitoring to see the usage of BigQuery query slots.

Use the Google Cloud Billing API to see what account the warehouse is being billed to.

Question 12

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:

No interaction by the user on the site for 1 hour

Has added more than $30 worth of products to the basket

Has not completed a transaction

You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

Options:

Use a fixed-time window with a duration of 60 minutes.

Use a sliding time window with a duration of 60 minutes.

Use a session window with a gap time duration of 60 minutes.

Use a global window with a time based trigger with a delay of 60 minutes.

Question 13

You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?

Options:

Linear regression

Logistic classification

Recurrent neural network

Feedforward neural network

Question 14

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patientrecords. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

Options:

Add capacity (memory and disk space) to the database server by the order of 200.

Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.

Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Question 15

You are building a model to make clothing recommendations. You know a user’s fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

Options:

Continuously retrain the model on just the new data.

Continuously retrain the model on a combination of existing data and the new data.

Train on the existing data while using the new data as your test set.

Train on the new data while using the existing data as your test set.

Question 16

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

Options:

The message body for the sensor event is too large.

Your custom endpoint has an out-of-date SSL certificate.

The Cloud Pub/Sub topic has too many messages published to it.

Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Question 17

Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in thedashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully. What should you do next?

Options:

Check the dashboard application to see if it is not displaying correctly.

Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.

Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.

Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.

Question 18

You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?

Options:

Re-write the application to load accumulated data every 2 minutes.

Convert the streaming insert code to batch load for individual messages.

Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.

Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.

Question 19

Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:

# Syntax error : Expected end of statement but got “-“ at [4:11]

SELECT age

FROM

bigquery-public-data.noaa_gsod.gsod

WHERE

age != 99

AND_TABLE_SUFFIX = ‘1929’

ORDER BY

age DESC

Which table name will make the SQL statement work correctly?

Options:

‘bigquery-public-data.noaa_gsod.gsod‘

bigquery-public-data.noaa_gsod.gsod*

‘bigquery-public-data.noaa_gsod.gsod’*

‘bigquery-public-data.noaa_gsod.gsod*`

Question 20

Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other’s data. You want to ensure appropriate access to the data. Which three steps should you take? (Choose three.)

Options:

Load data into different partitions.

Load data into a different dataset for each client.

Put each client’s BigQuery dataset into a different table.

Restrict a client’s dataset to approved users.

Only allow a service account to access the datasets.

Use the appropriate identity and access management (IAM) roles for each client’s users.

Question 21

What are two methods that can be used to denormalize tables in BigQuery?

Options:

1) Split table into multiple tables; 2) Use a partitioned table

1) Join tables into one table; 2) Use nested repeated fields

1) Use a partitioned table; 2) Join tables into one table

1) Use nested repeated fields; 2) Use a partitioned table

Question 22

Which of the following IAM roles does your Compute Engine account require to be able to run pipeline jobs?

Options:

dataflow.worker

dataflow.compute

dataflow.developer

dataflow.viewer

Question 23

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error.

SELECT person FROM `project1.example.table1` WHERE city = "London"

How would you correct the error?

Options:

Add ", UNNEST(person)" before the WHERE clause.

Change "person" to "person.city".

Change "person" to "city.person".

Add ", UNNEST(city)" before the WHERE clause.

Question 24

How would you query specific partitions in a BigQuery table?

Options:

Use the DAY column in the WHERE clause

Use the EXTRACT(DAY) clause

Use the __PARTITIONTIME pseudo-column in the WHERE clause

Use DATE BETWEEN in the WHERE clause

Question 25

Which Java SDK class can you use to run your Dataflow programs locally?

Options:

LocalRunner

DirectPipelineRunner

MachineRunner

LocalPipelineRunner

Question 26

What are the minimum permissions needed for a service account used with Google Dataproc?

Options:

Execute to Google Cloud Storage; write to Google Cloud Logging

Write to Google Cloud Storage; read to Google Cloud Logging

Execute to Google Cloud Storage; execute to Google Cloud Logging

Read and write to Google Cloud Storage; write to Google Cloud Logging

Question 27

How can you get a neural network to learn about relationships between categories in a categorical feature?

Options:

Create a multi-hot column

Create a one-hot column

Create a hash bucket

Create an embedding column

Question 28

Which of these sources can you not load data into BigQuery from?

Options:

File upload

Google Drive

Google Cloud Storage

Google Cloud SQL

Question 29

To give a user read permission for only the first three columns of a table, which access control method would you use?

Options:

Primitive role

Predefined role

Authorized view

It's not possible to give access to only the first three columns of a table.

Question 30

Which of these is not a supported method of putting data into a partitioned table?

Options:

If you have existing data in a separate file for each day, then create a partitioned table and upload each file into the appropriate partition.

Run a query to get the records for a specific day from an existing table and for the destination table, specify a partitioned table ending with the day in the format "$YYYYMMDD".

Create a partitioned table and stream new records to it every day.

Use ORDER BY to put a table's rows into chronological order and then change the table's type to "Partitioned".

Question 31

What are all of the BigQuery operations that Google charges for?

Options:

Storage, queries, and streaming inserts

Storage, queries, and loading data from a file

Storage, queries, and exporting data

Queries and streaming inserts

Question 32

In order to securely transfer web traffic data from your computer's web browser to the Cloud Dataproc cluster you should use a(n) _____.

Options:

VPN connection

Special browser

SSH tunnel

FTP connection

Question 33

Which of the following is not possible using primitive roles?

Options:

Give a user viewer access to BigQuery and owner access to Google Compute Engine instances.

Give UserA owner access and UserB editor access for all datasets in a project.

Give a user access to view all datasets in a project, but not run queries on them.

Give GroupA owner access and GroupB editor access for all datasets in a project.

Question 34

The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster ____.

Options:

application node

conditional node

master node

worker node

Question 35

Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

Options:

Field promotion

Randomization

Salting

Hashing

Question 36

If a dataset contains rows with individual people and columns for year of birth, country, and income, how many of the columns are continuous and how many are categorical?

Options:

1 continuous and 2 categorical

3 categorical

3 continuous

2 continuous and 1 categorical

Question 37

If you're running a performance test that depends upon Cloud Bigtable, all the choices except one below are recommended steps. Which is NOT a recommended step to follow?

Options:

Do not use a production instance.

Run your test for at least 10 minutes.

Before you test, run a heavy pre-test for several minutes.

Use at least 300 GB of data.

Question 38

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?

Options:

PCollection

Transform

Pipeline

Sink API

Question 39

You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?

Options:

Both batch and streaming

BigQuery cannot be used as a sink

Only batch

Only streaming

Question 40

Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)

Options:

The wide model is used for memorization, while the deep model is used for generalization.

A good use for the wide and deep model is a recommender system.

The wide model is used for generalization, while the deep model is used for memorization.

A good use for the wide and deep model is a small-scale linear regression problem.

Question 41

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

Options:

Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.

Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.

Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.

Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.

Question 42

You have a BigQuery dataset named "customers". All tables will be tagged by using a Data Catalog tag template named "gdpr". The template contains one mandatory field, "has sensitive data~. with a boolean value. All employees must be able to do a simple search and find tables in the dataset that have either true or false in the "has sensitive data" field. However, only the Human Resources (HR) group should be able to see the data inside the tables for which "hass-ensitive-data" is true. You give the all employees group the bigquery.metadataViewer and bigquery.connectionUser roles on the dataset. You want to minimize configuration overhead. What should you do next?

Options:

Create the "gdpr" tag template with private visibility. Assign the bigquery -dataViewer role to the HR group on the tables that contain sensitive data.

Create the ~gdpr" tag template with private visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employeesgroup, and assign the bigquery.dataViewer role to the HR group on the tables that contain sensitive data.

Create the "gdpr" tag template with public visibility. Assign the bigquery. dataViewer role to the HR group on the tables that containsensitive data.

Create the "gdpr" tag template with public visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employees.group, and assign the bijquery.dataViewer role to the HR group on the tables that contain sensitive data.

Question 43

You want to encrypt the customer data stored in BigQuery. You need to implement for-user crypto-deletion on data stored in your tables. You want to adopt native features in Google Cloud to avoid custom solutions. What should you do?

Options:

Create a customer-managed encryption key (CMEK) in Cloud KMS. Associate the key to the table while creating the table.

Create a customer-managed encryption key (CMEK) in Cloud KMS. Use the key to encrypt data before storing in BigQuery.

Implement Authenticated Encryption with Associated Data (AEAD) BigQuery functions while storing your data in BigQuery.

Encrypt your data during ingestion by using a cryptographic library supported by your ETL pipeline.

Question 44

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects. What should you do?

Options:

Enable data access logs in each Data Analyst’s project. Restrict access to Stackdriver Logging via Cloud IAM roles.

Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts’ projects. Restrict access to the Cloud Storage bucket.

Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs.

Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.

Question 45

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?

Options:

Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.

Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.

Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.

Create a Cloud Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.

Question 46

You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)

Options:

Get more training examples

Reduce the number of training examples

Use a smaller set of features

Use a larger set of features

Increase the regularization parameters

Decrease the regularization parameters

Question 47

You are designing a system that requires an ACID-compliant database. You must ensure that the system requires minimal human intervention in case of a failure. What should you do?

Options:

Configure a Cloud SQL for MySQL instance with point-in-time recovery enabled.

Configure a Cloud SQL for PostgreSQL instance with high availability enabled.

Configure a Bigtable instance with more than one cluster.

Configure a BJgQuery table with a multi-region configuration.

Question 48

You need to set access to BigQuery for different departments within your company. Your solution should comply with the following requirements:

Each department should have access only to their data.

Each department will have one or more leads who need to be able to create and update tables and provide them to their team.

Each department has data analysts who need to be able to query but not modify data.

How should you set access to the data in BigQuery?

Options:

Create a dataset for each department. Assign the department leads the role of OWNER, and assign the data analysts the role of WRITER on their dataset.

Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.

Create a table for each department. Assign the department leads the role of Owner, and assign the data analysts the role of Editor on the project the table is in.

Create a table for each department. Assign the department leads the role of Editor, and assign the data analysts the role of Viewer on the project the table is in.

Question 49

A TensorFlow machine learning model on Compute Engine virtual machines (n2-standard -32) takes two days to complete framing. The model has custom TensorFlow operations that must run partially on a CPU You want to reduce the training time in a cost-effective manner. What should you do?

Options:

Change the VM type to n2-highmem-32

Change the VM type to e2 standard-32

Train the model using a VM with a GPU hardware accelerator

Train the model using a VM with a TPU hardware accelerator

Question 50

You are developing a model to identify the factors that lead to sales conversions for your customers. You have completed processing your data. You want to continue through the model development lifecycle. What should you do next?

Options:

Use your model to run predictions on fresh customer input data.

Test and evaluate your model on your curated data to determine how well the model performs.

Monitor your model performance, and make any adjustments needed.

Delineate what data will be used for testing and what will be used for training the model.

Question 51

You have a variety of files in Cloud Storage that your data science team wants to use in their models Currently, users do not have a method to explore, cleanse, and validate the data in Cloud Storage. You are looking for a low code solution that can be used by your data science team to quickly cleanse and explore data within Cloud Storage. What should you do?

Options:

Load the data into BigQuery and use SQL to transform the data as necessary Provide the data science team access to staging tables to explore the raw data.

Provide the data science team access to Dataflow to create a pipeline to prepare and validate the raw data and load data into BigQuery for data exploration.

Provide the data science team access to Dataprep to prepare, validate, and explore the data within Cloud Storage.

Create an external table in BigQuery and use SQL to transform the data as necessary Provide the data science team access to the external tables to explore the raw data.

Question 52

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?

Options:

Store and process the entire dataset in BigQuery.

Store and process the entire dataset in Cloud Bigtable.

Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.

Store the warm data as files in Cloud Storage, and store theactive data inBigQuery. Keep this ratio as 80% warm and 20% active.

Question 53

You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster’s local Hadoop Distributed File System (HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc? (Choose two.)

Options:

Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.

Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.

Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.

Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones.

Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.

Question 54

You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don’t get slots to execute their query and you need to correct this. You’d like to avoid introducing new projects to your account.

What should you do?

Options:

Convert your batch BQ queries into interactive BQ queries.

Create an additional project to overcome the 2K on-demand per-project quota.

Switch to flat-rate pricing and establish a hierarchical priority model for your projects.

Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.

Question 55

You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runsvery slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?

Options:

Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory

Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS

Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up

Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage

Question 56

You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.

What should you do?

Options:

Use Cloud Dataflow with Beam to detect errors and perform transformations.

Use Cloud Dataprep with recipes to detect errors and perform transformations.

Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations.

Use federated tables in BigQuery with queries to detect errors and perform transformations.

Question 57

You need to modernize your existing on-premises data strategy. Your organization currently uses.

• Apache Hadoop clusters for processing multiple large data sets, including on-premises Hadoop Distributed File System (HDFS) for data replication.

• Apache Airflow to orchestrate hundreds of ETL pipelines with thousands of job steps.

You need to set up a new architecture in Google Cloud that can handle your Hadoop workloads and requires minimal changes to your existing orchestration processes. What should you do?

Options:

Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases Convert your ETL pipelines to Dataflow.

Use Bigtable for your large workloads, with connections to Cloud Storage to handle any HDFS use cases Orchestrate your pipelines with Cloud Composer.

Use Dataproc to migrate your Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Use Cloud Data Fusion to visually design and deploy your ETL pipelines.

Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases.Orchestrate your pipelines with Cloud Composer..

Answer:

Explanation:

Dataproc is a fully managed service that allows you to run Apache Hadoop and Spark workloads on Google Cloud. It is compatible with the open source ecosystem, so you can migrate your existing Hadoop clusters to Dataproc with minimal changes. Cloud Storage is a scalable, durable, and cost-effective object storage service that can replace HDFS for storing and accessing data. Cloud Storage offers interoperability with Hadoop through connectors, so you can use it as a data source or sink for your Dataproc jobs. Cloud Composer is a fully managed service that allowsyou to create, schedule, and monitor workflows using Apache Airflow. It is integrated with Google Cloud services, such as Dataproc, BigQuery, Dataflow, and Pub/Sub, so you can orchestrate your ETL pipelines across different platforms. Cloud Composer is compatible with your existing Airflow code, so you can migrate your existing orchestration processes to Cloud Composer with minimal changes.

The other options are not as suitable as Dataproc and Cloud Composer for this use case, because they either require more changes to your existing code, or do not meet your requirements. Dataflow is a fully managed service that allows you to create and run scalable data processing pipelines using Apache Beam. However, Dataflow is not compatible with your existing Hadoop code, so you would need to rewrite your ETL pipelines using Beam. Bigtable is a fully managed NoSQL database service that can handle large and complex data sets. However, Bigtable is not compatible with your existing Hadoop code, so you would need to rewrite your queries and applications using Bigtable APIs. Cloud Data Fusion is a fully managed service that allows you to visually design and deploy data integration pipelines using a graphical interface. However, Cloud Data Fusion is not compatible with your existing Airflow code, so you would need to recreate your orchestration processes using Cloud Data Fusion UI. References:

Dataproc overview

Cloud Storage connector for Hadoop

Cloud Composer overview

Question 58

You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?

Options:

Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.

Redact all Pll data, and store a version of the unredacted data in a locked-down bucket

Scan every table in BigQuery, and mask the data it finds that has Pll

Create a pseudonym by replacing Pll data with a cryptographic format-preserving token

Question 59

You need to deploy additional dependencies to all of a Cloud Dataproc cluster at startup using an existing initialization action. Company security policies require that Cloud Dataproc nodes do not have access to the Internet so public initialization actions cannot fetch resources. What should you do?

Options:

Deploy the Cloud SQL Proxy on the Cloud Dataproc master

Use an SSH tunnel to give the Cloud Dataproc cluster access to the Internet

Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter

Use Resource Manager to add the service account used by the Cloud Dataproc cluster to the Network User role

Question 60

Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data. The data are imported to Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have

asked you to fix the problem. You want to maximize transfer speeds. Which action should you take?

Options:

Increase the CPU size on your server.

Increase the size of the Google Persistent Disk on your server.

Increase your network bandwidth from your datacenter to GCP.

Increase your network bandwidth from Compute Engine to Cloud Storage.

Question 61

MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

Options:

Rowkey: date#device_idColumn data: data_point

Rowkey: dateColumn data: device_id, data_point

Rowkey: device_idColumn data: date, data_point

Rowkey: data_pointColumn data: device_id, date

Rowkey: date#data_pointColumn data: device_id

Question 62

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

Options:

The zone

The number of workers

The disk size per worker

The maximum number of workers

Question 63

Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion. What should you do?

Options:

Create a table called tracking_table and include a DATE column.

Create a partitioned table called tracking_table and include a TIMESTAMP column.

Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.

Create a table called tracking_table with a TIMESTAMP column to represent the day.

Question 64

You need to compose visualizations for operations teams with the following requirements:

Which approach meets the requirements?

Options:

Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.

Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.

Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.

Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.

Question 65

You need to compose visualization for operations teams with the following requirements:

Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)

The report must not be more than 3 hours delayed from live data.

The actionable report should only show suboptimal links.

Most suboptimal links should be sorted to the top.

Suboptimal links can be grouped and filtered by regional geography.

User response time to load the report must be <5 seconds.

You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

Options:

Look through the current data and compose a series of charts and tables, one for each possiblecombination of criteria.

Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.

Export the data to a spreadsheet, compose a series of charts and tables, one for each possiblecombination of criteria, and spread them across multiple tabs.

Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

Question 66

You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.

Which two actions should you take? (Choose two.)

Options:

Ensure all the tables are included in global dataset.

Ensure each table is included in a dataset for a region.

Adjust the settings for each table to allow a related region-based security group view access.

Adjust the settings for each view to allow a related region-based security group view access.

Adjust the settings for each dataset to allow a related region-based security group view access.

Question 67

MJTelco is building a custom interface to share data. They have these requirements:

They need to do aggregations over their petabyte-scale datasets.

They need to scan specific time range rows with a very fast response time (milliseconds).

Which combination of Google Cloud Platform products should you recommend?

Options:

Cloud Datastore and Cloud Bigtable

Cloud Bigtable and Cloud SQL

BigQuery and Cloud Bigtable

BigQuery and Cloud Storage

Question 68

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all thedata in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Options:

Export the data into a Google Sheet for virtualization.

Create an additional table with only the necessary columns.

Create a view on the table to present to the virtualization tool.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Question 69

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

Options:

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

Use the NOW () function in BigQuery to record the event’s time.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Question 70

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Options:

Store the common data in BigQuery as partitioned tables.

Store the common data in BigQuery and expose authorized views.

Store the common data encoded as Avro in Google Cloud Storage.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Question 71

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

Options:

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Question 72

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

Options:

Rewrite the job in Pig.

Rewrite the job in Apache Spark.

Increase the size of the Hadoop cluster.

Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Question 73

Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?

Options:

The CSV data loaded in BigQuery is not flagged as CSV.

The CSV data has invalid rows that were skipped on import.

The CSV data loaded in BigQuery is not using BigQuery’s default encoding.

The CSV data has not gone through an ETL phase before loading into BigQuery.

Question 74

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.

You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)

Options:

Introduce data compression for each file to increase the rate file of file transfer.

Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.

Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.

Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.

Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.

Question 75

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?

Options:

Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.

Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.

Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.

Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.

Question 76

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:

The user profile: What the user likes and doesn’t like to eat

The user account information: Name, address, preferred meal times

The order information: When orders are made, from where, to whom

The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

Options:

BigQuery

Cloud SQL

Cloud Bigtable

Cloud Datastore

Question 77

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’ the property ‘actors’ and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

Options:

Option A

Option B.

Option C

Option D

Question 78

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?

Options:

Load the data every 30 minutes into a new partitioned table in BigQuery.

Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery

Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore

Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Question 79

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.

You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

Options:

Redis

HBase

MySQL

MongoDB

Cassandra

HDFS with Hive

Question 80

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

Options:

Change the processing job to use Google Cloud Dataproc instead.

Manually start the Cloud Dataflow job each morning when you get into the office.

Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Exam Detail

Vendor: Google

Certification: Google Cloud Certified

Exam Code: Professional-Data-Engineer

Exam Name: Google Professional Data Engineer Exam

Last Update: May 6, 2026

Professional-Data-Engineer Question Answers

Spring Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Free and Premium Google Professional-Data-Engineer Dumps Questions Answers

Google Professional Data Engineer Exam Questions and Answers

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: