Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Dumps

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Question 17

48 of 55.

A data engineer needs to join multiple DataFrames and has written the following code:

from pyspark.sql.functions import broadcast

data1 = [(1, "A"), (2, "B")]

data2 = [(1, "X"), (2, "Y")]

data3 = [(1, "M"), (2, "N")]

df1 = spark.createDataFrame(data1, ["id", "val1"])

df2 = spark.createDataFrame(data2, ["id", "val2"])

df3 = spark.createDataFrame(data3, ["id", "val3"])

df_joined = df1.join(broadcast(df2), "id", "inner") \

.join(broadcast(df3), "id", "inner")

What will be the output of this code?

Options:

A.

The code will work correctly and perform two broadcast joins simultaneously to join df1 with df2, and then the result with df3.

B.

The code will fail because only one broadcast join can be performed at a time.

C.

The code will fail because the second join condition (df2.id == df3.id) is incorrect.

D.

The code will result in an error because broadcast() must be called before the joins, not inline.

Question 18

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

Options:

A.

Use a single SparkSession instance for the entire application.

B.

Avoid using a SparkSession and rely on SparkContext only.

C.

Create a new SparkSession instance before each transformation.

D.

Stop and restart the SparkSession after each action.

Question 19

38 of 55.

A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.

The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:

    Reads directly from /data/input.json.

    Infers the schema automatically.

    Merges differing schemas.

Which code snippet should the engineer use?

Options:

A.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeSchema 'true');

B.

CREATE TABLE users

USING json

OPTIONS (path '/data/input.json');

C.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', inferSchema 'true');

D.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeAll 'true');

Question 20

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

id

name

count

timestamp

1

Delhi

20

2024-09-19T10:11

1

Delhi

50

2024-09-19T10:12

2

London

50

2024-09-19T10:15

3

Paris

30

2024-09-19T10:18

3

Paris

20

2024-09-19T10:20

4

Washington

10

2024-09-19T10:22

Which operation is supported with streaming_df?

Options:

A.

streaming_df.count()

B.

streaming_df.filter("count < 30")

C.

streaming_df.select(countDistinct("name"))

D.

streaming_df.show()