Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Book

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Question 33

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

C)

D)

Options:

A.

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Question 34

23 of 55.

A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.

Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)

Options:

A.

It can distribute data processing tasks across a cluster of machines, enabling horizontal scalability.

B.

It requires specialized hardware to run, making it unsuitable for commodity hardware clusters.

C.

It processes data solely on disk storage, reducing the need for memory resources.

D.

It eliminates the need to write any code, automatically handling all data processing.

E.

It has built-in fault tolerance, allowing it to recover seamlessly from node failures during computation.

Question 35

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:

A.

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

B.

Increase the executor memory allocation in the Spark configuration.

C.

Reduce the size of the data partitions to improve task scheduling.

D.

Increase the number of executor instances to handle more concurrent tasks.

Question 36

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed