NCP-AII Premium Exam Questions

NVIDIA AI Infrastructure Questions and Answers

Question 29

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?

Options:

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

Use watts used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

Question 30

During cluster deployment, the UFM Cable Validation Tool reports " Wrong-neighbor " errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?

Options:

Reboot all leaf switches to force LLDP rediscovery.

Replace all affected cables with higher-grade OM5 fiber optics.

Verify LLDP data against topology files and remediate.

Disable FEC on all switches to bypass neighbor validation.

Question 31

You are training a deep neural network using NCCL to coordinate communication across four GPUs in a single node. During early performance testing, you notice inconsistent scaling and longer-than-expected training times, even though all GPUs are being used. Which strategy would most effectively improve NCCL efficiency and collective operation performance in this setting?

Options:

Adjust the batch size so that each GPU receives an equal-sized portion of the batch, ensuring all GPUs process similar workloads and communication is evenly distributed.

Assign the largest possible workload to the first GPU to maximize its utilization, and allow the remaining GPUs to process smaller or variable batch sizes as needed.

Disable automatic load balancing so that the deep learning framework can dynamically assign samples to any GPU available during each iteration.

Increase the communication frequency between GPUs while allowing workloads to be unevenly split, so synchronization is more frequent and model updates happen faster.

Question 32

An InfiniBand server stops working, and a system administrator runs the " ibstat " command that provides the following output:

CA ' mlx5_1 '

CA type: MT4115

Number of ports: 2

Firmware version: 10.20.1010

Hardware version: 0

Node GUID: 0x0002c90300002f78

System image GUID: 0x0002c90300002f7b

Port 1:

State: Initializing

Physical state: Linkup

Rate: 100

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0251086a

Port GUID: 0x0002c90300002f79

Link layer: InfiniBand

What is the cause of the issue?

Options:

The HCA port is faulty.

There is no running SM in the fabric.

The neighboring switch port is faulty.

The cable is disconnected.

Summer Certification Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

NVIDIA AI Infrastructure Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

CompTIA

Fortinet

Microsoft

Salesforce