Latest Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Dumps PDF Questions Answers 2025

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 1

Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of

DataFrame transactionsDf, and null if predError is null?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

1.def count_to_target(target):

2. if target is None:

3. return

5. result = [range(target)]

6. return result

8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])

10.transactionsDf.select(count_to_target_udf(col('predError')))

1.def count_to_target(target):

2. if target is None:

3. return

5. result = list(range(target))

6. return result

8.transactionsDf.select(count_to_target(col('predError')))

1.def count_to_target(target):

2. if target is None:

3. return

5. result = list(range(target))

6. return result

8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

10.transactionsDf.select(count_to_target_udf('predError'))

(Correct)

1.def count_to_target(target):

2. result = list(range(target))

3. return result

5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

7.df = transactionsDf.select(count_to_target_udf('predError'))

1.def count_to_target(target):

2. if target is None:

3. return

5. result = list(range(target))

6. return result

8.count_to_target_udf = udf(count_to_target)

10.transactionsDf.select(count_to_target_udf('predError'))

Buy Now

Question 2

Which of the following DataFrame operators is never classified as a wide transformation?

Options:

DataFrame.sort()

DataFrame.aggregate()

DataFrame.repartition()

DataFrame.select()

DataFrame.join()

Answer:

Explanation:

Explanation

As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel

free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide

transformation.

DataFrame.select()

Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select()

operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each

partition can be worked on independently. Thus, you do not cause a wide transformation.

DataFrame.repartition()

Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is

known as a shuffle and, in turn, is classified as a wide transformation.

DataFrame.aggregate()

No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more

input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.

DataFrame.join()

Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join()

counts as a wide transformation.

DataFrame.sort()

False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are

formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation.

More info: Understanding Apache Spark Shuffle | Philipp Brunenberg

Question 3

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

Options:

itemsDf.persist(StorageLevel.MEMORY_ONLY)

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

itemsDf.store()

itemsDf.cache()

itemsDf.write.option('destination', 'memory').save()

Question 4

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

Options:

1. select

2. "storeId"

3. print_schema()

1. limit

2. 1

3. columns

1. select

2. "storeId"

3. printSchema()

1. limit

2. "storeId"

3. printSchema()

1. select

2. storeId

3. dtypes

Question 5

Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

Options:

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))

itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))

itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

Question 6

The code block displayed below contains an error. The code block below is intended to add a column itemNameElements to DataFrame itemsDf that includes an array of all words in column

itemName. Find the error.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+

2.|itemId|itemName |supplier |

3.+------+----------------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |YetiX |

6.|3 |Outdoors Backpack |Sports Company Inc.|

7.+------+----------------------------------+-------------------+

Code block:

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

Options:

All column names need to be wrapped in the col() operator.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "," needs to be passed to the split method.

Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument " " needs to be passed to the split method.

The expressions "itemNameElements" and split("itemName") need to be swapped.

Question 7

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

Options:

itemsDf.cache().count()

itemsDf.cache(eager=True)

cache(itemsDf)

itemsDf.cache().filter()

itemsDf.rdd.storeCopy()

Question 8

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

Options:

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Question 9

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

Options:

spark.read.json(filePath)

spark.read.path(filePath, source="json")

spark.read().path(filePath)

spark.read().json(filePath)

spark.read.path(filePath)

Question 10

The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier

whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

Code block:

itemsDf.__1__(__2__).select(__3__, __4__)

Options:

1. filter

2. col("supplier").isin("Sports")

3. "itemName"

4. explode(col("attributes"))

1. where

2. col("supplier").contains("Sports")

3. "itemName"

4. "attributes"

1. where

2. col(supplier).contains("Sports")

3. explode(attributes)

4. itemName

1. where

2. "Sports".isin(col("Supplier"))

3. "itemName"

4. array_explode("attributes")

1. filter

2. col("supplier").contains("Sports")

3. "itemName"

4. explode("attributes")

Answer:

Explanation:

Explanation

Output of correct code block:

+----------------------------------+------+

|itemName |col |

+----------------------------------+------+

|Thick Coat for Walking in the Snow|blue |

|Thick Coat for Walking in the Snow|winter|

|Thick Coat for Walking in the Snow|cozy |

|Outdoors Backpack |green |

|Outdoors Backpack |summer|

|Outdoors Backpack |travel|

+----------------------------------+------+

The key to solving this QUESTION NO: is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through

the

answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the

first gap, but can also exclude some answers based on obvious problems you see with them.

The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do

not help us in selecting the right answer.

The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option

contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col

("supplier").contains("Sports") and col("supplier").isin("Sports"). The QUESTION NO: states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator

here.

We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.

Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode

("attributes") will help us achieve our goal. Specifically, the QUESTION NO: asks for one attribute from column attributes per row - this is what the explode() operator does.

One answer option also includes array_explode() which is not a valid operator in PySpark.

More info: pyspark.sql.functions.explode — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 39 (Databricks import instructions)

Question 11

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?

Options:

spark.mode("parquet").read("/FileStore/imports.parquet")

spark.read.path("/FileStore/imports.parquet", source="parquet")

spark.read().parquet("/FileStore/imports.parquet")

spark.read.parquet("/FileStore/imports.parquet")

spark.read().format('parquet').open("/FileStore/imports.parquet")

Question 12

Which of the following code blocks produces the following output, given DataFrame transactionsDf?

Output:

1.root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.schema.print()

transactionsDf.rdd.printSchema()

transactionsDf.rdd.formatSchema()

transactionsDf.printSchema()

print(transactionsDf.schema)

Question 13

Which of the following statements about data skew is incorrect?

Options:

Spark will not automatically optimize skew joins by default.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

To mitigate skew, Spark automatically disregards null values in keys when joining.

Salting can resolve data skew.

Answer:

Explanation:

Explanation

To mitigate skew, Spark automatically disregards null values in keys when joining.

This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew.

In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily

skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small

partitions) of the non-null-key records.

Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link

below).

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for

a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed.

Salting can resolve data skew.

This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key.

A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your

desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size.

Spark does not automatically optimize skew joins by default.

This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE). By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled

configuration option needs to be set to true instead of leaving it at the default false.

To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default.

When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit

into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join.

The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data,

the amount of data, and thus the slowdown, is particularly big.

Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative

to the sort-merge join.

It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger

than the 10 MB set by default.

More info:

- Performance Tuning - Spark 3.0.0 Documentation

- Data Skew and Garbage Collection to Improve Spark Performance

- Section 1.2 - Joins on Skewed Data • GitBook

Question 14

The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that

correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)

Options:

1. save

2. mode

3. "ignore"

4. "compression"

5. path

1. store

2. with

3. "replacement"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. "compression"

5. save

(Correct)

1. save

2. mode

3. "replace"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. compression

5. parquet

Question 15

Which of the following statements about DAGs is correct?

Options:

DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

DAG stands for "Directing Acyclic Graph".

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

In contrast to transformations, DAGs are never lazily executed.

DAGs can be decomposed into tasks that are executed in parallel.

Question 16

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.|1 |3 |4 |25 |1 |null|

5.|2 |6 |7 |2 |2 |null|

6.|3 |3 |null |25 |3 |null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.withColumnRemoved("predError", "productId")

transactionsDf.drop(["predError", "productId", "associateId"])

transactionsDf.drop("predError", "productId", "associateId")

transactionsDf.dropColumns("predError", "productId", "associateId")

transactionsDf.drop(col("predError", "productId"))

Question 17

Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type

double?

Options:

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

1. from pyspark.sql import types as T

2. spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Answer:

Explanation:

Explanation

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

Correct. This command uses the Spark Session's createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified

as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string.

Find out more about SparkSession.createDataFrame() via the link below.

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

No, the SparkSession does not have a newDataFrame method.

from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python's pandas package, in which this would be correct syntax. To create

a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame — PySpark 3.1.1 documentation and Data Types - Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1, QUESTION NO: 41 (Databricks import instructions)

Question 18

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

Options:

The commas in the tuples with the colors should be eliminated.

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.

Instead of color, a data type should be specified.

The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Question 19

The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame

transactionsDf. Find the errors.

Code block:

transactionsDf.select([col(productId), col(f)])

Sample of transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

The column names should be listed directly as arguments to the operator and not as a list.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed

as strings without being wrapped in a col() operator.

The select operator should be replaced by a drop operator.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and

f should be replaced by transactionId, predError, value and storeId.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Answer:

Explanation:

Explanation

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code

block

includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION

NO: will

make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as

strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given

the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list.

Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f

should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named

productId instead of telling Spark to use the column productId - for that, you need to express it as a string.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as Python variables

(see above).

More info: pyspark.sql.DataFrame.drop — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 30 (Databricks import instructions)

Question 20

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

Options:

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Question 21

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col('predError').in([3, 6])).count()

Options:

The number of rows cannot be determined with the count() operator.

Instead of filter, the select method should be used.

The method used on column predError is incorrect.

Instead of a list, the values need to be passed as single arguments to the in operator.

Numbers 3 and 6 need to be passed as string variables.

Question 22

The code block displayed below contains an error. The code block should return a DataFrame where all entries in column supplier contain the letter combination et in this order. Find the error.

Code block:

itemsDf.filter(Column('supplier').isin('et'))

Options:

The Column operator should be replaced by the col operator and instead of isin, contains should be used.

The expression inside the filter parenthesis is malformed and should be replaced by isin('et', 'supplier').

Instead of isin, it should be checked whether column supplier contains the letters et, so isin should be replaced with contains. In addition, the column should be accessed using col['supplier'].

The expression only returns a single column and filter should be replaced by select.

Question 23

The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame

itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

__1__.__2__(__3__, __4__, __5__)

Options:

1. transactionsDf

2. join

3. broadcast(itemsDf)

4. transactionsDf.transactionId==itemsDf.transactionId

5. "outer"

1. transactionsDf

2. join

3. itemsDf

4. transactionsDf.transactionId==itemsDf.transactionId

5. "anti"

1. transactionsDf

2. join

3. broadcast(itemsDf)

4. "transactionId"

5. "left_semi"

1. itemsDf

2. broadcast

3. transactionsDf

4. "transactionId"

5. "left_semi"

1. itemsDf

2. join

3. broadcast(transactionsDf)

4. "transactionId"

5. "left_semi"

Question 24

In which order should the code blocks shown below be run in order to create a table of all values in column attributes next to the respective values in column supplier in DataFrame itemsDf?

1. itemsDf.createOrReplaceView("itemsDf")

2. spark.sql("FROM itemsDf SELECT 'supplier', explode('Attributes')")

3. spark.sql("FROM itemsDf SELECT supplier, explode(attributes)")

4. itemsDf.createOrReplaceTempView("itemsDf")

Options:

4, 3

1, 3

4, 2

1, 2

Question 25

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

Options:

transactionsDf.repartition(transactionsDf.getNumPartitions()+2)

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

transactionsDf.coalesce(10)

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

transactionsDf.repartition(transactionsDf._partitions+2)

Question 26

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code

block is run twice?

Options:

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

itemsDf.sample(fraction=0.1, seed=87238)

itemsDf.sample(fraction=1000, seed=98263)

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

itemsDf.sample(fraction=0.1)

Answer:

Explanation:

Explanation

itemsDf.sample(fraction=0.1, seed=87238)

Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning

duplicates, you should leave the withReplacement parameter at False, which is the default. Since the QUESTION NO: specifies that the same rows should be returned even if the code block is run

twice,

you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.

Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the

question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there

would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999

balls.

itemsDf.sample(fraction=1000, seed=98263)

Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1.

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from

the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should

depend on.

itemsDf.sample(fraction=0.1)

Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to

specify a seed.

More info:

- pyspark.sql.DataFrame.sample — PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.sampleBy — PySpark 3.1.2 documentation

- Types of Samplings in PySpark 3. The explanations of the sampling… | by Pinar Ersoy | Towards Data Science

Question 27

Which of the following describes the conversion of a computational query into an execution plan in Spark?

Options:

Spark uses the catalog to resolve the optimized logical plan.

The catalog assigns specific resources to the optimized memory plan.

The executed physical plan depends on a cost optimization from a previous stage.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

The catalog assigns specific resources to the physical plan.

Exam Detail

Vendor: Databricks

Certification: Databricks Certification

Exam Code: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Last Update: Jun 15, 2025

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Question Answers

Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Free and Premium Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Dumps Questions Answers

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: