Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Online Access

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Question 37

A data analyst wants to add a column date derived from a timestamp column.

Options:

Options:

A.

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

B.

dates_df.withColumn("date", f.to_date("timestamp")).show()

C.

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

D.

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()

Question 38

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

Options:

A.

A job contains multiple stages, and each stage contains multiple tasks.

B.

A job contains multiple tasks, and each task contains multiple stages.

C.

A stage contains multiple jobs, and each job contains multiple tasks.

D.

A stage contains multiple tasks, and each task contains multiple jobs.

Question 39

You have:

DataFrame A: 128 GB of transactions

DataFrame B: 1 GB user lookup table

Which strategy is correct for broadcasting?

Options:

A.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling itself

B.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling DataFrame A

C.

DataFrame A should be broadcasted because it is larger and will eliminate the need for shuffling DataFrame B

D.

DataFrame A should be broadcasted because it is smaller and will eliminate the need for shuffling itself

Question 40

3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.

To remove the duplicates, the engineer adds the code:

df = df.withWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

A.

It removes all duplicates regardless of when they arrive.

B.

It accepts watermarks in seconds and the code results in an error.

C.

It removes duplicates that arrive within the 30-minute window specified by the watermark.

D.

It is not able to handle deduplication in this scenario.