Spring Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Free and Premium NVIDIA NCP-AII Dumps Questions Answers

Page: 1 / 5
Total 71 questions

NVIDIA AI Infrastructure Questions and Answers

Question 1

For an NVIDIA Enterprise AI Factory with 256 GPUs, which storage solution characteristic is most critical to validate during scaling tests?

Options:

A.

Consistent per-node throughput >8 GiB/s.

B.

Single-node write performance during idle clusters.

C.

RAID rebuild times under disk failure.

D.

Maximum 4K random read IOPS exceeding 1 million.

Buy Now
Question 2

A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?

Options:

A.

Multi-protocol data access with low latency.

B.

High capacity with moderate speed.

C.

Tape backup systems.

D.

Low-cost HDD solutions.

Question 3

You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?

Options:

A.

Reboot the host system to apply the repository changes and proceed.

B.

Install the nvidia-container-toolkit package using your package manager.

C.

Format the disk to clear any existing NVIDIA-related dependencies first.

D.

Download the CUDA toolkit installer from NVIDIA'S official website.

Question 4

A system administrator is installing a GPU into a server and needs to avoid damaging the device. What item should be used?

Options:

A.

Anti-ESD strap

B.

Gloves

C.

Protective film

D.

Electric screwdriver

Question 5

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

Options:

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Question 6

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices" > select a switch > "Cables' tab to see ASIC firmware and transceiver versions.

B.

Use "Topology’ view to visually inspect cable icons.

C.

Run mlxlink -d lid- -m on each port manually.

D.

Export all switch logs and grep for ’FW Version".

Question 7

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?

Options:

A.

NCCL_TESTS_SPLIT="OR 0x7" ./all_reduce_perf -g 8

B.

Run without splits and analyze per-rack averages.

C.

NCCL_TESTS_SPLIT="MOD 2" ./all_reduce_perf -g 8

D.

NCCL_TESTS_SPLIT="DIV 8" ./all_reduce_perf -g 1

Question 8

After ClusterKit reports "GPU-Host latency exceeds threshold," which NVIDIA diagnostic tool should be used to isolate hardware faults?

Options:

A.

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

B.

nvidia-smi topo -m to inspect GPU topology connections

C.

DCGM Diags dcgmi diag -r 2

D.

ib_write_bw to measure InfiniBand bandwidth between nodes

Question 9

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

Options:

A.

Implement redundant switches with spanning tree protocol.

B.

MLAG for bonded interfaces across redundant switches.

C.

Use only one switch for all management and storage traffic.

D.

Disable VLANs and use unmanaged switches.

Question 10

ClusterKit's NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Critical failure; expected is >390 GB/s for HDR InfiniBand.

D.

Inconclusive; rerun with --stress=cpu to validate.

Question 11

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Options:

A.

Reduction of problem size (N) to accelerate computation.

B.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

C.

Doubling of GPU clock speeds through firmware updates and relevant configuration.

D.

Automatic NVLink bandwidth doubling via driver updates.

Question 12

An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?

Options:

A.

cmsh status to check HA status and active/standby roles.

B.

nvsm show health to validate GPU status on both head nodes.

C.

systemctl restart cmdaemon to force a failover test.

D.

ping to test basic connectivity.

Question 13

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

Options:

A.

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.

Power drain then restart the DGX and check if the performance degradation resolves.

D.

Increase the fan speed to maximum and check whether the performance improves.

Question 14

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.

Inconclusive; rerun with point-to-point tests.

B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.

C.

Critical failure; bus bandwidth exceeds hardware capabilities.

D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.

Question 15

A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)

Options:

A.

Helm is installed on the installer machine.

B.

Ensure Kubernetes is running on the cluster.

C.

All cluster nodes have NVIDIA GPUs installed.

D.

NTP is disabled to simplify time synchronization.

Question 16

You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)

Options:

A.

Use the command sudo mlxconfig -d /dev/mst/ set LINK_TYPE_P1=2 to enable Ethernet on the Bluefield-3 devices.

B.

Use the command sudo mlxconfig -d /dev/mst/ set DISABLE_SPECTRUM_X=1 to reduce overhead.

C.

Use the command sudo mlxconfig -d /dev/mst/ set INTERNAL_CPU_OFFLOAD_ENGINE=1 to configure the SuperNIC to operate in NIC mode.

D.

Use the command sudo mlxconfig -d /dev/mst/ set DPU_MODE=1 to set up the Bluefield-3 devices in DPU mode.

Question 17

One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?

Options:

A.

lspci | grep NVIDIA

B.

nvidia-smi

C.

nvidia-gpu-status

D.

iblinkinfo

Question 18

A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?

Options:

A.

Enable remote access to the BMC over the internet using the default admin credentials for initial troubleshooting.

B.

Connect the BMC port directly to the production network and retain default admin credentials for convenience.

C.

Leave the BMC port disconnected until after the operating system is fully configured and in production.

D.

Connect the BMC port to a dedicated and firewalled network and change the default admin credentials.

Question 19

A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?

Options:

A.

esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RmPVMRL=0x01"

B.

esxcli graphics module parameters set -m nvidia -p "NVreg_RegistryDwords=RmPVMRL=0x01"

C.

esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=FRL=0x01"

D.

esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RmPVMRL=0x00"

Question 20

After configuring NGC CLI with ngc config set, a user receives ”Authentication failed” errors when pulling containers. What step was most likely omitted?

Options:

A.

Installing the CLI with apt-get instead of manual extraction.

B.

Entering the API key during ngc config set or storing it in ~/.ngc/config.

C.

Setting --format_type=json to enable API interactions.

D.

Running sudo systemctl restart docker after configuration.

Question 21

An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?

Options:

A.

Lane power variance < 3dB across all transceivers.

B.

Transceiver model matching QSFP-DD specifications.

C.

Temperature fluctuations > 5°C during validation.

D.

Effective BER > 1.5E-254 during a <6-hour monitoring window.

Page: 1 / 5
Total 71 questions