Labour Day Special - Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: top65certs

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Dumps

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 1

Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of

DataFrame transactionsDf, and null if predError is null?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

A.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = [range(target)]

6. return result

7.

8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])

9.

10.transactionsDf.select(count_to_target_udf(col('predError')))

B.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.transactionsDf.select(count_to_target(col('predError')))

C.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

9.

10.transactionsDf.select(count_to_target_udf('predError'))

(Correct)

D.

1.def count_to_target(target):

2. result = list(range(target))

3. return result

4.

5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

6.

7.df = transactionsDf.select(count_to_target_udf('predError'))

E.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.count_to_target_udf = udf(count_to_target)

9.

10.transactionsDf.select(count_to_target_udf('predError'))

Question 2

Which of the following DataFrame operators is never classified as a wide transformation?

Options:

A.

DataFrame.sort()

B.

DataFrame.aggregate()

C.

DataFrame.repartition()

D.

DataFrame.select()

E.

DataFrame.join()

Question 3

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

Options:

A.

itemsDf.persist(StorageLevel.MEMORY_ONLY)

B.

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

C.

itemsDf.store()

D.

itemsDf.cache()

E.

itemsDf.write.option('destination', 'memory').save()

Question 4

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

Options:

A.

1. select

2. "storeId"

3. print_schema()

B.

1. limit

2. 1

3. columns

C.

1. select

2. "storeId"

3. printSchema()

D.

1. limit

2. "storeId"

3. printSchema()

E.

1. select

2. storeId

3. dtypes

Question 5

Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

2.|itemId|itemName |attributes |supplier |

3.+------+----------------------------------+-----------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |

6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|

7.+------+----------------------------------+-----------------------------+-------------------+

Options:

A.

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))

B.

itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))

C.

itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))

D.

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

E.

itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

Question 6

The code block displayed below contains an error. The code block below is intended to add a column itemNameElements to DataFrame itemsDf that includes an array of all words in column

itemName. Find the error.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+

2.|itemId|itemName |supplier |

3.+------+----------------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |YetiX |

6.|3 |Outdoors Backpack |Sports Company Inc.|

7.+------+----------------------------------+-------------------+

Code block:

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

Options:

A.

All column names need to be wrapped in the col() operator.

B.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "," needs to be passed to the split method.

C.

Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.

D.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument " " needs to be passed to the split method.

E.

The expressions "itemNameElements" and split("itemName") need to be swapped.

Question 7

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

Options:

A.

itemsDf.cache().count()

B.

itemsDf.cache(eager=True)

C.

cache(itemsDf)

D.

itemsDf.cache().filter()

E.

itemsDf.rdd.storeCopy()

Question 8

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

Options:

A.

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

B.

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

C.

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

D.

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

E.

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Question 9

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

Options:

A.

spark.read.json(filePath)

B.

spark.read.path(filePath, source="json")

C.

spark.read().path(filePath)

D.

spark.read().json(filePath)

E.

spark.read.path(filePath)

Question 10

The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier

whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

2.|itemId|itemName |attributes |supplier |

3.+------+----------------------------------+-----------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |

6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|

7.+------+----------------------------------+-----------------------------+-------------------+

Code block:

itemsDf.__1__(__2__).select(__3__, __4__)

Options:

A.

1. filter

2. col("supplier").isin("Sports")

3. "itemName"

4. explode(col("attributes"))

B.

1. where

2. col("supplier").contains("Sports")

3. "itemName"

4. "attributes"

C.

1. where

2. col(supplier).contains("Sports")

3. explode(attributes)

4. itemName

D.

1. where

2. "Sports".isin(col("Supplier"))

3. "itemName"

4. array_explode("attributes")

E.

1. filter

2. col("supplier").contains("Sports")

3. "itemName"

4. explode("attributes")

Question 11

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?

Options:

A.

spark.mode("parquet").read("/FileStore/imports.parquet")

B.

spark.read.path("/FileStore/imports.parquet", source="parquet")

C.

spark.read().parquet("/FileStore/imports.parquet")

D.

spark.read.parquet("/FileStore/imports.parquet")

E.

spark.read().format('parquet').open("/FileStore/imports.parquet")

Question 12

Which of the following code blocks produces the following output, given DataFrame transactionsDf?

Output:

1.root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

A.

transactionsDf.schema.print()

B.

transactionsDf.rdd.printSchema()

C.

transactionsDf.rdd.formatSchema()

D.

transactionsDf.printSchema()

E.

print(transactionsDf.schema)

Question 13

Which of the following statements about data skew is incorrect?

Options:

A.

Spark will not automatically optimize skew joins by default.

B.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

C.

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

D.

To mitigate skew, Spark automatically disregards null values in keys when joining.

E.

Salting can resolve data skew.

Question 14

The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that

correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)

Options:

A.

1. save

2. mode

3. "ignore"

4. "compression"

5. path

B.

1. store

2. with

3. "replacement"

4. "compression"

5. path

C.

1. write

2. mode

3. "overwrite"

4. "compression"

5. save

(Correct)

D.

1. save

2. mode

3. "replace"

4. "compression"

5. path

E.

1. write

2. mode

3. "overwrite"

4. compression

5. parquet

Question 15

Which of the following statements about DAGs is correct?

Options:

A.

DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

B.

DAG stands for "Directing Acyclic Graph".

C.

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

D.

In contrast to transformations, DAGs are never lazily executed.

E.

DAGs can be decomposed into tasks that are executed in parallel.

Question 16

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId|f |

3.+-------------+---------+-----+-------+---------+----+

4.|1 |3 |4 |25 |1 |null|

5.|2 |6 |7 |2 |2 |null|

6.|3 |3 |null |25 |3 |null|

7.+-------------+---------+-----+-------+---------+----+

Options:

A.

transactionsDf.withColumnRemoved("predError", "productId")

B.

transactionsDf.drop(["predError", "productId", "associateId"])

C.

transactionsDf.drop("predError", "productId", "associateId")

D.

transactionsDf.dropColumns("predError", "productId", "associateId")

E.

transactionsDf.drop(col("predError", "productId"))

Question 17

Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type

double?

Options:

A.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

B.

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

C.

1. from pyspark.sql import types as T

2. spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D.

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Question 18

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

Options:

A.

The commas in the tuples with the colors should be eliminated.

B.

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.

C.

Instead of color, a data type should be specified.

D.

The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Question 19

The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame

transactionsDf. Find the errors.

Code block:

transactionsDf.select([col(productId), col(f)])

Sample of transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

A.

The column names should be listed directly as arguments to the operator and not as a list.

B.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed

as strings without being wrapped in a col() operator.

C.

The select operator should be replaced by a drop operator.

D.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and

f should be replaced by transactionId, predError, value and storeId.

E.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Question 20

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

Options:

A.

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

B.

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

C.

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

D.

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

E.

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Question 21

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col('predError').in([3, 6])).count()

Options:

A.

The number of rows cannot be determined with the count() operator.

B.

Instead of filter, the select method should be used.

C.

The method used on column predError is incorrect.

D.

Instead of a list, the values need to be passed as single arguments to the in operator.

E.

Numbers 3 and 6 need to be passed as string variables.

Question 22

The code block displayed below contains an error. The code block should return a DataFrame where all entries in column supplier contain the letter combination et in this order. Find the error.

Code block:

itemsDf.filter(Column('supplier').isin('et'))

Options:

A.

The Column operator should be replaced by the col operator and instead of isin, contains should be used.

B.

The expression inside the filter parenthesis is malformed and should be replaced by isin('et', 'supplier').

C.

Instead of isin, it should be checked whether column supplier contains the letters et, so isin should be replaced with contains. In addition, the column should be accessed using col['supplier'].

D.

The expression only returns a single column and filter should be replaced by select.

Question 23

The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame

itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

__1__.__2__(__3__, __4__, __5__)

Options:

A.

1. transactionsDf

2. join

3. broadcast(itemsDf)

4. transactionsDf.transactionId==itemsDf.transactionId

5. "outer"

B.

1. transactionsDf

2. join

3. itemsDf

4. transactionsDf.transactionId==itemsDf.transactionId

5. "anti"

C.

1. transactionsDf

2. join

3. broadcast(itemsDf)

4. "transactionId"

5. "left_semi"

D.

1. itemsDf

2. broadcast

3. transactionsDf

4. "transactionId"

5. "left_semi"

E.

1. itemsDf

2. join

3. broadcast(transactionsDf)

4. "transactionId"

5. "left_semi"

Question 24

In which order should the code blocks shown below be run in order to create a table of all values in column attributes next to the respective values in column supplier in DataFrame itemsDf?

1. itemsDf.createOrReplaceView("itemsDf")

2. spark.sql("FROM itemsDf SELECT 'supplier', explode('Attributes')")

3. spark.sql("FROM itemsDf SELECT supplier, explode(attributes)")

4. itemsDf.createOrReplaceTempView("itemsDf")

Options:

A.

4, 3

B.

1, 3

C.

2

D.

4, 2

E.

1, 2

Question 25

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

Options:

A.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2)

B.

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

C.

transactionsDf.coalesce(10)

D.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

E.

transactionsDf.repartition(transactionsDf._partitions+2)

Question 26

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code

block is run twice?

Options:

A.

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

B.

itemsDf.sample(fraction=0.1, seed=87238)

C.

itemsDf.sample(fraction=1000, seed=98263)

D.

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

E.

itemsDf.sample(fraction=0.1)

Question 27

Which of the following describes the conversion of a computational query into an execution plan in Spark?

Options:

A.

Spark uses the catalog to resolve the optimized logical plan.

B.

The catalog assigns specific resources to the optimized memory plan.

C.

The executed physical plan depends on a cost optimization from a previous stage.

D.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

E.

The catalog assigns specific resources to the physical plan.