Winter Sale - Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: top65certs

Free and Premium Google Professional-Data-Engineer Dumps Questions Answers

Google Professional Data Engineer Exam Questions and Answers

Question 1

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

Options:

A.

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

B.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

C.

Use the NOW () function in BigQuery to record the event’s time.

D.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Buy Now
Question 2

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Options:

A.

Store the common data in BigQuery as partitioned tables.

B.

Store the common data in BigQuery and expose authorized views.

C.

Store the common data encoded as Avro in Google Cloud Storage.

D.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Question 3

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all thedata in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Options:

A.

Export the data into a Google Sheet for virtualization.

B.

Create an additional table with only the necessary columns.

C.

Create a view on the table to present to the virtualization tool.

D.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Question 4

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

Options:

A.

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

B.

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

C.

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

D.

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Question 5

Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion. What should you do?

Options:

A.

Create a table called tracking_table and include a DATE column.

B.

Create a partitioned table called tracking_table and include a TIMESTAMP column.

C.

Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.

D.

Create a table called tracking_table with a TIMESTAMP column to represent the day.

Question 6

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

Options:

A.

The zone

B.

The number of workers

C.

The disk size per worker

D.

The maximum number of workers

Question 7

You need to compose visualizations for operations teams with the following requirements:

Which approach meets the requirements?

Options:

A.

Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.

B.

Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.

C.

Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.

D.

Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.

Question 8

MJTelco is building a custom interface to share data. They have these requirements:

They need to do aggregations over their petabyte-scale datasets.

They need to scan specific time range rows with a very fast response time (milliseconds).

Which combination of Google Cloud Platform products should you recommend?

Options:

A.

Cloud Datastore and Cloud Bigtable

B.

Cloud Bigtable and Cloud SQL

C.

BigQuery and Cloud Bigtable

D.

BigQuery and Cloud Storage

Question 9

MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

Options:

A.

Rowkey: date#device_idColumn data: data_point

B.

Rowkey: dateColumn data: device_id, data_point

C.

Rowkey: device_idColumn data: date, data_point

D.

Rowkey: data_pointColumn data: device_id, date

E.

Rowkey: date#data_pointColumn data: device_id

Question 10

You need to compose visualization for operations teams with the following requirements:

Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)

The report must not be more than 3 hours delayed from live data.

The actionable report should only show suboptimal links.

Most suboptimal links should be sorted to the top.

Suboptimal links can be grouped and filtered by regional geography.

User response time to load the report must be <5 seconds.

You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

Options:

A.

Look through the current data and compose a series of charts and tables, one for each possiblecombination of criteria.

B.

Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.

C.

Export the data to a spreadsheet, compose a series of charts and tables, one for each possiblecombination of criteria, and spread them across multiple tabs.

D.

Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

Question 11

You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.

Which two actions should you take? (Choose two.)

Options:

A.

Ensure all the tables are included in global dataset.

B.

Ensure each table is included in a dataset for a region.

C.

Adjust the settings for each table to allow a related region-based security group view access.

D.

Adjust the settings for each view to allow a related region-based security group view access.

E.

Adjust the settings for each dataset to allow a related region-based security group view access.

Question 12

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?

Options:

A.

Load the data every 30 minutes into a new partitioned table in BigQuery.

B.

Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery

C.

Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore

D.

Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Question 13

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:

The user profile: What the user likes and doesn’t like to eat

The user account information: Name, address, preferred meal times

The order information: When orders are made, from where, to whom

The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

Options:

A.

BigQuery

B.

Cloud SQL

C.

Cloud Bigtable

D.

Cloud Datastore

Question 14

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?

Options:

A.

Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.

B.

Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.

C.

Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.

D.

Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.

Question 15

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’ the property ‘actors’ and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

Options:

A.

Option A

B.

Option B.

C.

Option C

D.

Option D

Question 16

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

Options:

A.

Rewrite the job in Pig.

B.

Rewrite the job in Apache Spark.

C.

Increase the size of the Hadoop cluster.

D.

Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Question 17

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.

You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

Options:

A.

Redis

B.

HBase

C.

MySQL

D.

MongoDB

E.

Cassandra

F.

HDFS with Hive

Question 18

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.

You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)

Options:

A.

Introduce data compression for each file to increase the rate file of file transfer.

B.

Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.

C.

Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.

D.

Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.

E.

Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.

Question 19

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

Options:

A.

Change the processing job to use Google Cloud Dataproc instead.

B.

Manually start the Cloud Dataflow job each morning when you get into the office.

C.

Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

D.

Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Question 20

Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?

Options:

A.

The CSV data loaded in BigQuery is not flagged as CSV.

B.

The CSV data has invalid rows that were skipped on import.

C.

The CSV data loaded in BigQuery is not using BigQuery’s default encoding.

D.

The CSV data has not gone through an ETL phase before loading into BigQuery.

Question 21

You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?

Options:

A.

Make a call to the Stackdriver API to list all logs, and apply an advanced filter.

B.

In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.

C.

In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.

D.

Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

Question 22

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:

No interaction by the user on the site for 1 hour

Has added more than $30 worth of products to the basket

Has not completed a transaction

You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

Options:

A.

Use a fixed-time window with a duration of 60 minutes.

B.

Use a sliding time window with a duration of 60 minutes.

C.

Use a session window with a gap time duration of 60 minutes.

D.

Use a global window with a time based trigger with a delay of 60 minutes.

Question 23

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?

Options:

A.

Use a row key of the form .

B.

Use a row key of the form .

C.

Use a row key of the form #.

D.

Use a row key of the form >##.

Question 24

Your company’s customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?

Options:

A.

Add a node to the MySQL cluster and build an OLAP cube there.

B.

Use an ETL tool to load the data from MySQL into Google BigQuery.

C.

Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.

D.

Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.

Question 25

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

Options:

A.

Assign global unique identifiers (GUID) to each data entry.

B.

Compute the hash value of each data entry, and compare it with all historical data.

C.

Store each data entry as the primary key in a separate database and apply an index.

D.

Maintain a database table to store the hash value and other metadata for each data entry.

Question 26

Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

Options:

A.

Create a Google Cloud Dataflow job to process the data.

B.

Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.

C.

Create a Hadoop cluster on Google Compute Engine that uses persistent disks.

D.

Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

E.

Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.

Question 27

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patientrecords. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

Options:

A.

Add capacity (memory and disk space) to the database server by the order of 200.

B.

Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

C.

Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.

D.

Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Question 28

You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time. What should you do?

Options:

A.

Send the data to Google Cloud Datastore and then export to BigQuery.

B.

Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.

C.

Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.

D.

Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.

Question 29

You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling.

Which Google database service should you use?

Options:

A.

Cloud SQL

B.

BigQuery

C.

Cloud Bigtable

D.

Cloud Datastore

Question 30

Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:

# Syntax error : Expected end of statement but got “-“ at [4:11]

SELECT age

FROM

bigquery-public-data.noaa_gsod.gsod

WHERE

age != 99

AND_TABLE_SUFFIX = ‘1929’

ORDER BY

age DESC

Which table name will make the SQL statement work correctly?

Options:

A.

‘bigquery-public-data.noaa_gsod.gsod‘

B.

bigquery-public-data.noaa_gsod.gsod*

C.

‘bigquery-public-data.noaa_gsod.gsod’*

D.

‘bigquery-public-data.noaa_gsod.gsod*`

Question 31

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

Options:

A.

Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.

B.

Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.

C.

Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.

D.

Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.

E.

Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Question 32

Your company’s on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

Options:

A.

Put the data into Google Cloud Storage.

B.

Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.

C.

Tune the Cloud Dataproc cluster so that there is just enough disk for all data.

D.

Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.

Question 33

Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks. What should you do?

Options:

A.

Run a local version of Jupiter on the laptop.

B.

Grant the user access to Google Cloud Shell.

C.

Host a visualization tool on a VM on Google Compute Engine.

D.

Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.

Question 34

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

Options:

A.

Disable caching by editing the report settings.

B.

Disable caching in BigQuery by editing table details.

C.

Refresh your browser tab showing the visualizations.

D.

Clear your browser history for the past hour then reload the tab showing the virtualizations.

Question 35

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

Options:

A.

Use federated data sources, and check data in the SQL query.

B.

Enable BigQuery monitoring in Google Stackdriver and create an alert.

C.

Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.

D.

Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Question 36

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

Options:

A.

Redefine the schema by evenly distributing reads and writes across the row space of the table.

B.

The performance issue should be resolved over time as the site of the BigDate cluster is increased.

C.

Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.

D.

Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Question 37

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?

Options:

A.

Use Google Stackdriver Audit Logs to review data access.

B.

Get the identity and access management IIAM) policy of each table

C.

Use Stackdriver Monitoring to see the usage of BigQuery query slots.

D.

Use the Google Cloud Billing API to see what account the warehouse is being billed to.

Question 38

You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users’ privacy?

Options:

A.

Grant the consultant the Viewer role on the project.

B.

Grant the consultant the Cloud Dataflow Developer role on the project.

C.

Create a service account and allow the consultant to log on with it.

D.

Create an anonymized sample of the data for the consultant to work with in a different project.

Question 39

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

Options:

A.

Include ORDER BY DESK on timestamp column and LIMIT to 1.

B.

Use GROUP BY on the unique ID column and timestamp column and SUM on the values.

C.

Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.

D.

Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Question 40

Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use? (Choose three.)

Options:

A.

Supervised learning to determine which transactions are most likely to be fraudulent.

B.

Unsupervised learning to determine which transactions are most likely to be fraudulent.

C.

Clustering to divide the transactions into N categories based on feature similarity.

D.

Supervised learning to predict the location of a transaction.

E.

Reinforcement learning to predict the location of a transaction.

F.

Unsupervised learning to predict the location of a transaction.

Question 41

Your organization stores employee information in a BigQuery dataset. Your human resources (HR) admin team requires full access to the data, but the HR analyst team needs to conduct salary analysis without being able to access personally identifiable information (PII). You want to ensure that users have the correct level of access for their role managed through Dataplex, while reducing data duplication. What should you do?

Options:

A.

Create an authorized view to limit data access based on a user role.

B.

Create and assign policy tags based on user role to the PII columns in BigQuery.

C.

Create a new dataset for salary analysis and use data masking to obfuscate all fields related to an individual.

D.

Create a new dataset and use Cloud Data Loss Prevention (Cloud DLP) to mask PII in the BigQuery table.

Question 42

You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?

Options:

A.

Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.

B.

Redact all Pll data, and store a version of the unredacted data in a locked-down bucket

C.

Scan every table in BigQuery, and mask the data it finds that has Pll

D.

Create a pseudonym by replacing Pll data with a cryptographic format-preserving token

Question 43

You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub. processes the data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization. What should you do?

Options:

A.

Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore

B.

Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore

C.

Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.

D.

Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.

Question 44

You are on the data governance team and are implementing security requirements to deploy resources. You need to ensure that resources are limited to only the europe-west 3 region You want to follow Google-recommended practices What should you do?

Options:

A.

Deploy resources with Terraform and implement a variable validation rule to ensure that the region is set to the europe-west3 region for all resources.

B.

Set the constraints/gcp. resourceLocations organization policy constraint to in:eu-locations.

C.

Create a Cloud Function to monitor all resources created and automatically destroy the ones created outside the europe-west3 region.

D.

Set the constraints/gcp. resourceLocations organization policy constraint to in: europe-west3-locations.

Question 45

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?

Options:

A.

Store and process the entire dataset in BigQuery.

B.

Store and process the entire dataset in Cloud Bigtable.

C.

Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.

D.

Store the warm data as files in Cloud Storage, and store theactive data inBigQuery. Keep this ratio as 80% warm and 20% active.

Question 46

You need to look at BigQuery data from a specific table multiple times a day. The underlying table you are querying is several petabytes in size, but you want to filter your data and provide simple aggregations to downstream users. You want to run queries faster and get up-to-date insights quicker. What should you do?

Options:

A.

Run a scheduled query to pull the necessary data at specific intervals daily.

B.

Create a materialized view based off of the query being run.

C.

Use a cached query to accelerate time to results.

D.

Limit the query columns being pulled in the final result.

Question 47

You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuory. The security team has enabled anorganizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses. What should you do?

Options:

A.

Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only internal IPs.

B.

Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.

C.

Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow. Cloud Storage, and BigQuery as allowedservices in the perimeter. Use Dataflow with only internal IP addresses.

D.

Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.

Question 48

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do?

Options:

A.

Use Cloud TPUs without any additional adjustment to your code.

B.

Use Cloud TPUs after implementing GPU kernel support for your customs ops.

C.

Use Cloud GPUs after implementing GPU kernel support for your customs ops.

D.

Stay on CPUs, and increase the size of the cluster you’re training your model on.

Question 49

You are configuring networking for a Dataflow job. The data pipeline uses custom container images with the libraries that are required for the transformation logic preinstalled. The data pipeline reads the data from Cloud Storage and writes the data to BigQuery. You need to ensure cost-effective and secure communication between the pipeline and Google APIs and services. What should you do?

Options:

A.

Leave external IP addresses assigned to worker VMs while enforcing firewall rules.

B.

Disable external IP addresses and establish a Private Service Connect endpoint IP address.

C.

Disable external IP addresses from worker VMs and enable Private Google Access.

D.

Enable Cloud NAT to provide outbound internet connectivity while enforcing firewall rules.

Question 50

Your organization has two Google Cloud projects, project A and project B. In project A, you have a Pub/Sub topic that receives data from confidential sources. Only the resources in project A should be able to access the data in that topic. You want to ensure that project B and any future project cannot access data in the project A topic. What should you do?

Options:

A.

Configure VPC Service Controls in the organization with a perimeter around the VPC of project A.

B.

Add firewall rules in project A so only traffic from the VPC in project A is permitted.

C.

Configure VPC Service Controls in the organization with a perimeter around project A.

D.

Use Identity and Access Management conditions to ensure that only users and service accounts in project A can access resources in project.

Question 51

You are developing an application on Google Cloud that will automatically generate subject labels for users’ blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?

Options:

A.

Call the Cloud Natural Language API from your application. Process the generated Entity Analysis aslabels.

B.

Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.

C.

Build and train a text classification model using TensorFlow. Deploy the model using Cloud MachineLearning Engine. Call the model from your application and process the results as labels.

D.

Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.

Question 52

Different teams in your organization store customer and performance data in BigOuery. Each team needs to keep full control of their collected data, be able to query data within their projects, and be able to exchange their data with other teams. You need to implement an organization-wide solution, while minimizing operational tasks and costs. What should you do?

Options:

A.

Create a BigQuery scheduled query to replicate all customer data into team projects.

B.

Enable each team to create materialized views of the data they need to access in their projects.

C.

Ask each team to publish their data in Analytics Hub. Direct the other teams to subscribe to them.

D.

Ask each team to create authorized views of their data. Grant the biquery. jobUser role to each team.

Question 53

You designed a data warehouse in BigQuery to analyze sales data. You want a self-serving, low-maintenance, and cost-effective solution to share the sales dataset to other business units in your organization. What should you do?

Options:

A.

Enable the other business units' projects to access the authorized views of the sales dataset.

B.

Use the BigQuery Data Transfer Service to create a schedule that copies the sales dataset to the other business units’ projects.

C.

Create an Analytics Hub private exchange, and publish the sales dataset.

D.

Create and share views with the users in the other business units.

Question 54

You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

Options:

A.

Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.

B.

Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.

C.

Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.

D.

Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

Question 55

You currently have transactional data stored on-premises in a PostgreSQL database. To modernize your data environment, you want to run transactional workloads and support analytics needs with a single database. You need to move to Google Cloud without changing database management systems, and minimize cost and complexity. What should you do?

Options:

A.

Migrate your workloads to AlloyDB for PostgreSQL.

B.

Migrate to BigQuery to optimize analytics.

C.

Migrate and modernize your database with Cloud Spanner.

D.

Migrate your PostgreSQL database to Cloud SQL for PostgreSQL.

Question 56

Your United States-based company has created an application for assessing and responding to user actions. The primary table’s data volume grows by 250,000 records per second. Many third parties use your application’s APIs to build the functionality into their own frontend applications. Your application’s APIs should comply with the following requirements:

Single global endpoint

ANSI SQL support

Consistent access to the most up-to-date data

What should you do?

Options:

A.

Implement BigQuery with no region selected for storage or processing.

B.

Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.

C.

Implement Cloud SQL for PostgreSQL with the master in Norht America and read replicas in Asia and Europe.

D.

Implement Cloud Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.

Question 57

You are designing a messaging system by using Pub/Sub to process clickstream data with an event-driven consumer app that relies on a push subscription. You need to configure the messaging system that is reliable enough to handle temporary downtime of the consumer app. You also need the messaging system to store the input messages that cannot be consumed by the subscriber. The system needs to retry failed messages gradually, avoiding overloading the consumer app, and store the failed messages after a maximum of 10 retries in a topic. How should you configure the Pub/Sub subscription?

Options:

A.

Increase the acknowledgement deadline to 10 minutes.

B.

Use immediate redelivery as the subscription retry policy, and configure dead lettering to a different topic with maximum delivery attempts set to 10.

C.

Use exponential backoff as the subscription retry policy, and configure dead lettering to the same source topic with maximum delivery attempts set to 10.

D.

Use exponential backoff as the subscription retry policy, and configure dead lettering to a different topic with maximum delivery attempts set to 10.

Question 58

You are architecting a data transformation solution for BigQuery. Your developers are proficient with SOL and want to use the ELT development technique. In addition, your developers need an intuitive coding environment and the ability to manage SQL as code. You need to identify a solution for your developers to build these pipelines. What should you do?

Options:

A.

Use Cloud Composer to load data and run SQL pipelines by using the BigQuery job operators.

B.

Use Dataflow jobs to read data from Pub/Sub, transform the data, and load the data to BigQuery.

C.

Use Dataform to build, manage, and schedule SQL pipelines.

D.

Use Data Fusion to build and execute ETL pipelines

Question 59

You want to rebuild your batch pipeline for structured data on Google Cloud You are using PySpark to conduct data transformations at scale, but your pipelines are taking over twelve hours to run. To expedite development and pipeline run time, you want to use a serverless tool and SQL syntax You have already moved your raw data into Cloud Storage How should you build the pipeline on Google Cloud while meeting speed and processing requirements?

Options:

A.

Convert your PySpark commands into SparkSQL queries to transform the data; and then run your pipeline on Dataproc to write the data into BigQuery

B.

Ingest your data into Cloud SQL, convert your PySpark commands into SparkSQL queries to transform the data, and then use federated queries from BigQuery for machine learning.

C.

Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQLqueries to transform the data, and then write the transformations to a new table

D.

Use Apache Beam Python SDK to build the transformation pipelines, and write the data into BigQuery

Question 60

You are a BigQuery admin supporting a team of data consumers who run ad hoc queries and downstream reporting in tools such as Looker. All data and users are combined under a single organizational project. You recently noticed some slowness in query results and want to troubleshoot where the slowdowns are occurring. You think that there might be some job queuing or slot contention occurring as users run jobs, which slows down access to results. You need to investigate the query job information and determine where performance is being affected. What should you do?

Options:

A.

Use Cloud Monitoring to view BigQuery metrics and set up alerts that let you know when a certain percentage of slots were used.

B.

Use slot reservations for your project to ensure that you have enough query processing capacity and are able to allocate available slots to the slower queries.

C.

Use Cloud Logging to determine if any users or downstream consumers are changing or deleting access grants on tagged resources.

D.

Use available administrative resource charts to determine how slots are being used and how jobs are performing over time. Run a query on the INFORMATION_SCHEMA to review query performance.

Question 61

Which of these are examples of a value in a sparse vector? (Select 2 answers.)

Options:

A.

[0, 5, 0, 0, 0, 0]

B.

[0, 0, 0, 1, 0, 0, 1]

C.

[0, 1]

D.

[1, 0, 0, 0, 0, 0, 0]

Question 62

Which action can a Cloud Dataproc Viewer perform?

Options:

A.

Submit a job.

B.

Create a cluster.

C.

Delete a cluster.

D.

List the jobs.

Question 63

Which of the following statements about Legacy SQL and Standard SQL is not true?

Options:

A.

Standard SQL is the preferred query language for BigQuery.

B.

If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.

C.

One difference between the two query languages is how you specify fully-qualified table names (i.e. table names that include their associated project name).

D.

You need to set a query language for each dataset and the default is Standard SQL.

Question 64

Which TensorFlow function can you use to configure a categorical column if you don't know all of the possible values for that column?

Options:

A.

categorical_column_with_vocabulary_list

B.

categorical_column_with_hash_bucket

C.

categorical_column_with_unknown_values

D.

sparse_column_with_keys

Question 65

Which of these is not a supported method of putting data into a partitioned table?

Options:

A.

If you have existing data in a separate file for each day, then create a partitioned table and upload each file into the appropriate partition.

B.

Run a query to get the records for a specific day from an existing table and for the destination table, specify a partitioned table ending with the day in the format "$YYYYMMDD".

C.

Create a partitioned table and stream new records to it every day.

D.

Use ORDER BY to put a table's rows into chronological order and then change the table's type to "Partitioned".

Question 66

Why do you need to split a machine learning dataset into training data and test data?

Options:

A.

So you can try two different sets of features

B.

To make sure your model is generalized for more than just the training data

C.

To allow you to create unit tests in your code

D.

So you can use one dataset for a wide model and one for a deep model

Question 67

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.

Options:

A.

zone

B.

node

C.

label

D.

type

Question 68

Which methods can be used to reduce the number of rows processed by BigQuery?

Options:

A.

Splitting tables into multiple tables; putting data in partitions

B.

Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause

C.

Putting data in partitions; using the LIMIT clause

D.

Splitting tables into multiple tables; using the LIMIT clause

Question 69

You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?

Options:

A.

Both batch and streaming

B.

BigQuery cannot be used as a sink

C.

Only batch

D.

Only streaming

Question 70

Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?

Options:

A.

A sequential numeric ID

B.

A timestamp followed by a stock symbol

C.

A non-sequential numeric ID

D.

A stock symbol followed by a timestamp

Question 71

You have a job that you want to cancel. It is a streaming pipeline, and you want to ensure that any data that is in-flight is processed and written to the output. Which of the following commands can you use on the Dataflow monitoring console to stop the pipeline job?

Options:

A.

Cancel

B.

Drain

C.

Stop

D.

Finish

Question 72

When running a pipeline that has a BigQuery source, on your local machine, you continue to get permission denied errors. What could be the reason for that?

Options:

A.

Your gcloud does not have access to the BigQuery resources

B.

BigQuery cannot be accessed from local machines

C.

You are missing gcloud on your machine

D.

Pipelines cannot be run locally

Question 73

What are two of the characteristics of using online prediction rather than batch prediction?

Options:

A.

It is optimized to handle a high volume of data instances in a job and to run more complex models.

B.

Predictions are returned in the response message.

C.

Predictions are written to output files in a Cloud Storage location that you specify.

D.

It is optimized to minimize the latency of serving predictions.

Question 74

What is the HBase Shell for Cloud Bigtable?

Options:

A.

The HBase shell is a GUI based interface that performs administrative tasks, such as creating and deleting tables.

B.

The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables.

C.

The HBase shell is a hypervisor based shell that performs administrative tasks, such as creating and deleting new virtualized instances.

D.

The HBase shell is a command-line tool that performs only user account management functions to grant access to Cloud Bigtable instances.

Question 75

When a Cloud Bigtable node fails, ____ is lost.

Options:

A.

all data

B.

no data

C.

the last transaction

D.

the time dimension

Question 76

Google Cloud Bigtable indexes a single value in each row. This value is called the _______.

Options:

A.

primary key

B.

unique key

C.

row key

D.

master key

Question 77

Does Dataflow process batch data pipelines or streaming data pipelines?

Options:

A.

Only Batch Data Pipelines

B.

Both Batch and Streaming Data Pipelines

C.

Only Streaming Data Pipelines

D.

None of the above

Question 78

Which is not a valid reason for poor Cloud Bigtable performance?

Options:

A.

The workload isn't appropriate for Cloud Bigtable.

B.

The table's schema is not designed correctly.

C.

The Cloud Bigtable cluster has too many nodes.

D.

There are issues with the network connection.

Question 79

Which of the following are examples of hyperparameters? (Select 2 answers.)

Options:

A.

Number of hidden layers

B.

Number of nodes in each hidden layer

C.

Biases

D.

Weights

Question 80

Cloud Dataproc charges you only for what you really use with _____ billing.

Options:

A.

month-by-month

B.

minute-by-minute

C.

week-by-week

D.

hour-by-hour