Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Free and Premium Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Dumps Questions Answers

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Question 1

26 of 55.

A data scientist at an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user.

Before further processing, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns.

The PII columns in df_user are name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:

A.

df_user_non_pii = df_user.drop("name", "email", "birthdate")

B.

df_user_non_pii = df_user.dropFields("name", "email", "birthdate")

C.

df_user_non_pii = df_user.select("name", "email", "birthdate")

D.

df_user_non_pii = df_user.remove("name", "email", "birthdate")

Buy Now
Question 2

39 of 55.

A Spark developer is developing a Spark application to monitor task performance across a cluster.

One requirement is to track the maximum processing time for tasks on each worker node and consolidate this information on the driver for further analysis.

Which technique should the developer use?

Options:

A.

Broadcast a variable to share the maximum time among workers.

B.

Configure the Spark UI to automatically collect maximum times.

C.

Use an RDD action like reduce() to compute the maximum time.

D.

Use an accumulator to record the maximum time on the driver.

Question 3

17 of 55.

A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications.

Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.

Which operation should AQE be implementing to automatically improve the Spark application performance?

Options:

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Question 4

21 of 55.

What is the behavior of the function date_sub(start, days) if a negative value is passed into the days parameter?

Options:

A.

The number of days specified will be added to the start date.

B.

An error message of an invalid parameter will be returned.

C.

The same start date will be returned.

D.

The number of days specified will be removed from the start date.

Question 5

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Options:

A.

Convert the Pandas UDF to a PySpark UDF

B.

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF

C.

Run the in_spanish_inner() function in a mapInPandas() function call

D.

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

Question 6

A data engineer is working on the DataFrame:

(Referring to the table image: it has columns Id, Name, count, and timestamp.)

Which code fragment should the engineer use to extract the unique values in the Name column into an alphabetically ordered list?

Options:

A.

df.select("Name").orderBy(df["Name"].asc())

B.

df.select("Name").distinct().orderBy(df["Name"])

C.

df.select("Name").distinct()

D.

df.select("Name").distinct().orderBy(df["Name"].desc())

Question 7

49 of 55.

In the code block below, aggDF contains aggregations on a streaming DataFrame:

aggDF.writeStream \

.format("console") \

.outputMode("???") \

.start()

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

AGGREGATE

B.

COMPLETE

C.

REPLACE

D.

APPEND

Question 8

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

Options:

A.

df1.orderBy(col("count").desc(), col("Name").asc())

B.

df1.sort("Name", "count")

C.

df1.orderBy("Name", "count")

D.

df1.orderBy(col("Name").desc(), col("count").asc())

Question 9

35 of 55.

A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off.

How can this be achieved?

Options:

A.

By configuring the option recoveryLocation during SparkSession initialization.

B.

By configuring the option checkpointLocation during readStream.

C.

By configuring the option checkpointLocation during writeStream.

D.

By configuring the option recoveryLocation during writeStream.

Question 10

Given this code:

.withWatermark("event_time", "10 minutes")

.groupBy(window("event_time", "15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Options:

A.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

B.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Question 11

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

Options:

A.

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

B.

It is available only with Python, thereby reducing the learning curve.

C.

It runs on a single node only, utilizing memory efficiently.

D.

It computes results immediately using eager execution.

Question 12

A data engineer is streaming data from Kafka and requires:

Minimal latency

Exactly-once processing guarantees

Which trigger mode should be used?

Options:

A.

.trigger(processingTime='1 second')

B.

.trigger(continuous=True)

C.

.trigger(continuous='1 second')

D.

.trigger(availableNow=True)

Question 13

25 of 55.

A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.

Additionally, the DataFrame contains the column age, which is not needed.

Which code fragment adds the tax column and removes the age column?

Options:

A.

employees_df = employees_df.withColumn("tax", col("salary") * 0.1).drop("age")

B.

employees_df = employees_df.withColumn("tax", lit(0.1)).drop("age")

C.

employees_df = employees_df.dropField("age").withColumn("tax", col("salary") * 0.1)

D.

employees_df = employees_df.withColumn("tax", col("salary") + 0.1).drop("age")

Question 14

Which command overwrites an existing JSON file when writing a DataFrame?

Options:

A.

df.write.mode("overwrite").json("path/to/file")

B.

df.write.overwrite.json("path/to/file")

C.

df.write.json("path/to/file", overwrite=True)

D.

df.write.format("json").save("path/to/file", mode="overwrite")

Question 15

Given the schema:

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

Options:

A.

dropDuplicates on all columns (wrong criteria)

B.

dropDuplicates with no arguments (removes based on all columns)

C.

groupBy without aggregation (invalid use)

D.

dropDuplicates on the exact matching fields

Question 16

A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

Options:

A.

df.orderBy(col("age").asc(), col("salary").asc()).show()

B.

df.sort("age", "salary", ascending=[True, True]).show()

C.

df.sort("age", "salary", ascending=[False, True]).show()

D.

df.orderBy("age", "salary", ascending=[True, False]).show()

Question 17

48 of 55.

A data engineer needs to join multiple DataFrames and has written the following code:

from pyspark.sql.functions import broadcast

data1 = [(1, "A"), (2, "B")]

data2 = [(1, "X"), (2, "Y")]

data3 = [(1, "M"), (2, "N")]

df1 = spark.createDataFrame(data1, ["id", "val1"])

df2 = spark.createDataFrame(data2, ["id", "val2"])

df3 = spark.createDataFrame(data3, ["id", "val3"])

df_joined = df1.join(broadcast(df2), "id", "inner") \

.join(broadcast(df3), "id", "inner")

What will be the output of this code?

Options:

A.

The code will work correctly and perform two broadcast joins simultaneously to join df1 with df2, and then the result with df3.

B.

The code will fail because only one broadcast join can be performed at a time.

C.

The code will fail because the second join condition (df2.id == df3.id) is incorrect.

D.

The code will result in an error because broadcast() must be called before the joins, not inline.

Question 18

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

Options:

A.

Use a single SparkSession instance for the entire application.

B.

Avoid using a SparkSession and rely on SparkContext only.

C.

Create a new SparkSession instance before each transformation.

D.

Stop and restart the SparkSession after each action.

Question 19

38 of 55.

A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.

The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:

    Reads directly from /data/input.json.

    Infers the schema automatically.

    Merges differing schemas.

Which code snippet should the engineer use?

Options:

A.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeSchema 'true');

B.

CREATE TABLE users

USING json

OPTIONS (path '/data/input.json');

C.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', inferSchema 'true');

D.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeAll 'true');

Question 20

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

id

name

count

timestamp

1

Delhi

20

2024-09-19T10:11

1

Delhi

50

2024-09-19T10:12

2

London

50

2024-09-19T10:15

3

Paris

30

2024-09-19T10:18

3

Paris

20

2024-09-19T10:20

4

Washington

10

2024-09-19T10:22

Which operation is supported with streaming_df?

Options:

A.

streaming_df.count()

B.

streaming_df.filter("count < 30")

C.

streaming_df.select(countDistinct("name"))

D.

streaming_df.show()

Question 21

6 of 55.

Which components of Apache Spark’s Architecture are responsible for carrying out tasks when assigned to them?

Options:

A.

Driver Nodes

B.

Executors

C.

CPU Cores

D.

Worker Nodes

Question 22

24 of 55.

Which code should be used to display the schema of the Parquet file stored in the location events.parquet?

Options:

A.

spark.sql("SELECT * FROM events.parquet").show()

B.

spark.read.format("parquet").load("events.parquet").show()

C.

spark.read.parquet("events.parquet").printSchema()

D.

spark.sql("SELECT schema FROM events.parquet").show()

Question 23

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

Options:

A.

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

B.

num_df.select(cube_func("num")).show()

C.

spark.createDataFrame(cube_func("num")).show()

D.

num_df.register("cube_func").select("num").show()

Question 24

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set for spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

Options:

A.

A Cartesian join

B.

A shuffled hash join

C.

A broadcast nested loop join

D.

A sort-merge join

Question 25

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options:

A.

The conversion will automatically distribute the data across worker nodes

B.

The operation will fail if the Pandas DataFrame exceeds 1000 rows

C.

Data will be lost during conversion

D.

The operation will load all data into the driver's memory, potentially causing memory overflow

Question 26

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

Options:

A.

final_df \

.sort("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

B.

final_df \

.orderBy("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

C.

final_df \

.sort("market_time") \

.coalesce(1) \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

D.

final_df \

.sortWithinPartitions("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

Question 27

A data scientist is working with a Spark DataFrame called customerDF that contains customer information. The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

Options:

A.

customerDF.select(

col("email").substr(0, 5).alias("username"),

col("email").substr(-5).alias("domain")

)

B.

customerDF.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

C.

customerDF.withColumn("username", substring_index(col("email"), "@", 1)) \

.withColumn("domain", substring_index(col("email"), "@", -1))

D.

customerDF.select(

regexp_replace(col("email"), "@", "").alias("username"),

regexp_replace(col("email"), "@", "").alias("domain")

)

Question 28

27 of 55.

A data engineer needs to add all the rows from one table to all the rows from another, but not all the columns in the first table exist in the second table.

The error message is:

AnalysisException: UNION can only be performed on tables with the same number of columns.

The existing code is:

au_df.union(nz_df)

The DataFrame au_df has one extra column that does not exist in the DataFrame nz_df, but otherwise both DataFrames have the same column names and data types.

What should the data engineer fix in the code to ensure the combined DataFrame can be produced as expected?

Options:

A.

df = au_df.unionByName(nz_df, allowMissingColumns=True)

B.

df = au_df.unionAll(nz_df)

C.

df = au_df.unionByName(nz_df, allowMissingColumns=False)

D.

df = au_df.union(nz_df, allowMissingColumns=True)

Question 29

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

Options:

A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

Question 30

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns user_id, product_id, and purchase_amount and needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Question 31

4 of 55.

A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance.

Which action should the developer take to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Enable dynamic resource allocation to scale resources as needed

D.

Increase the size of the dataset to create more partitions

Question 32

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

A.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

B.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

C.

Shuffle join because no broadcast hints were provided.

D.

Broadcast join, as df2 is smaller than the default broadcast threshold.

Question 33

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

C)

D)

Options:

A.

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Question 34

23 of 55.

A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.

Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)

Options:

A.

It can distribute data processing tasks across a cluster of machines, enabling horizontal scalability.

B.

It requires specialized hardware to run, making it unsuitable for commodity hardware clusters.

C.

It processes data solely on disk storage, reducing the need for memory resources.

D.

It eliminates the need to write any code, automatically handling all data processing.

E.

It has built-in fault tolerance, allowing it to recover seamlessly from node failures during computation.

Question 35

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:

A.

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

B.

Increase the executor memory allocation in the Spark configuration.

C.

Reduce the size of the data partitions to improve task scheduling.

D.

Increase the number of executor instances to handle more concurrent tasks.

Question 36

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Question 37

A data analyst wants to add a column date derived from a timestamp column.

Options:

Options:

A.

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

B.

dates_df.withColumn("date", f.to_date("timestamp")).show()

C.

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

D.

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()

Question 38

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

Options:

A.

A job contains multiple stages, and each stage contains multiple tasks.

B.

A job contains multiple tasks, and each task contains multiple stages.

C.

A stage contains multiple jobs, and each job contains multiple tasks.

D.

A stage contains multiple tasks, and each task contains multiple jobs.

Question 39

You have:

DataFrame A: 128 GB of transactions

DataFrame B: 1 GB user lookup table

Which strategy is correct for broadcasting?

Options:

A.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling itself

B.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling DataFrame A

C.

DataFrame A should be broadcasted because it is larger and will eliminate the need for shuffling DataFrame B

D.

DataFrame A should be broadcasted because it is smaller and will eliminate the need for shuffling itself

Question 40

3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.

To remove the duplicates, the engineer adds the code:

df = df.withWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

A.

It removes all duplicates regardless of when they arrive.

B.

It accepts watermarks in seconds and the code results in an error.

C.

It removes duplicates that arrive within the 30-minute window specified by the watermark.

D.

It is not able to handle deduplication in this scenario.