What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?
A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.
Which line of Spark code will produce a Parquet table that meets these requirements?
A data scientist is working with a Spark DataFrame called customerDF that contains customer information. The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.
Which code snippet splits the email column into username and domain columns?
27 of 55.
A data engineer needs to add all the rows from one table to all the rows from another, but not all the columns in the first table exist in the second table.
The error message is:
AnalysisException: UNION can only be performed on tables with the same number of columns.
The existing code is:
au_df.union(nz_df)
The DataFrame au_df has one extra column that does not exist in the DataFrame nz_df, but otherwise both DataFrames have the same column names and data types.
What should the data engineer fix in the code to ensure the combined DataFrame can be produced as expected?