Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

New Release Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certification Questions

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Question 29

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

Options:

A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

Question 30

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns user_id, product_id, and purchase_amount and needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Question 31

4 of 55.

A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance.

Which action should the developer take to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Enable dynamic resource allocation to scale resources as needed

D.

Increase the size of the dataset to create more partitions

Question 32

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

A.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

B.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

C.

Shuffle join because no broadcast hints were provided.

D.

Broadcast join, as df2 is smaller than the default broadcast threshold.