Spring Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Free and Premium NVIDIA NCP-AII Dumps Questions Answers

Page: 1 / 9
Total 123 questions

NVIDIA AI Infrastructure Questions and Answers

Question 1

A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?

Options:

A.

Enroll the MOK and sign NVIDIA kernel modules.

B.

Reinstall Docker without the NVIDIA runtime.

C.

Disable SELinux to relax unnecessary security policies.

D.

Run Docker with sudo for elevated privileges.

Buy Now
Question 2

A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?

Options:

A.

ipmitool raw 0x32 0x6a 1

B.

systemctl restart rshim

C.

systemctl enable bmc-rshim.service

D.

scp < path_to_bfb > root@ < bmc_ip > :/dev/rshim0/boot

Question 3

You are responsible for ensuring interoperability between AI applications deployed across a diverse IT landscape, including an on-premises data center equipped with NVIDIA GPUs and multiple cloud platforms from different vendors. These environments need to support complex AI workflows that involve large-scale data processing, real-time analytics, and machine learning model training. To maintain consistent performance and flexibility, which strategy should you prioritize?

Options:

A.

Choose one vendor and standardize on one storage solution across all environments to simplify management and improve interoperability.

B.

Implement a multi-cloud strategy that uses only native storage solutions in each cloud platform while relying on middleware to ensure interoperability and data consistency.

C.

Ensure that all environments use compatible storage protocols and APIs, such as NFS or S3, to facilitate data exchange and integration across platforms.

D.

Focus only on increasing network bandwidth between locations to reduce latency and improve data transfer speeds.

Question 4

What information does the ' ibnodes ' command display?

Options:

A.

All hosts & switches

B.

All host & server names

C.

All server names

D.

All channel adapters

Question 5

A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?

Options:

A.

Enable remote access to the BMC over the internet using the default admin credentials for initial troubleshooting.

B.

Connect the BMC port directly to the production network and retain default admin credentials for convenience.

C.

Leave the BMC port disconnected until after the operating system is fully configured and in production.

D.

Connect the BMC port to a dedicated and firewalled network and change the default admin credentials.

Question 6

You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?

Options:

A.

Increase the SSD RAID-0 local cache size in each node so it can absorb most training data, making network storage type and speed less important for performance.

B.

Implement a standard NFS server on a 10GbE network because the cluster can access the export and job performance will not be impacted.

C.

Deploy a high-performance parallel file system across InfiniBand or 40/100GbE, ensuring at least 3 GB/s per node and scalable aggregate bandwidth for all cluster workloads.

D.

Recommend general-purpose object storage for all training data because it is optimized for deep learning workloads and distributed data access at any scale.

Question 7

Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?

Options:

A.

Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.

B.

Validate the driver version post-install since the fresh install will overwrite the legacy drivers.

C.

Keep the older driver running alongside the new version in case you need to roll back the upgrade.

D.

Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.

Question 8

ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.

Critical failure; expected is greater than 390 GB/s for HDR InfiniBand.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

D.

Inconclusive; rerun with --stress=cpu to validate.

Question 9

During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps?

Pick the 2 correct responses below.

Options:

A.

MPI configuration error; rerun with --cpu-affinity adjustments.

B.

Network packet loss; analyze ibdiagnet reports.

C.

Thermal throttling due to cooling issues; check nvidia-smi dmon.

D.

Memory corruption; reboot the node and reduce problem size N.

Question 10

An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?

Options:

A.

tcpdump

B.

ib_write_lat

C.

ibdiagnet

D.

perfmon

Question 11

A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?

Options:

A.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

B.

esxcli graphics module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

C.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=FRL=0x01 "

D.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x00 "

Question 12

During a DGX cluster deployment, what is the most effective way to verify the health and integrity of the local RAID storage array?

Options:

A.

Run a read/write benchmark utility, such as FIO, across the RAID array, looking for expected speed and latency metrics as proof of storage integrity.

B.

Verify that all configured RAID volumes are mounted and available in the operating system, and that disk utilization levels are within recommended limits.

C.

Use the mdadm --examine and mdadm --detail commands to review the RAID array’s status, checking for drive failures, array consistency, and error events.

Question 13

After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?

Options:

A.

Average CPU usage > 80% and Docker container uptime.

B.

No thermal throttling events and consistent GPU utilization > 95% throughout the test.

C.

SSD write endurance and RAM capacity.

D.

Total energy consumption and NVLink bandwidth.

Question 14

A system administrator needs to install a container toolkit and successfully run the following commands:

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime docker

What step should be taken next to finish the installation?

Options:

A.

dpkg -i doca-host-repo-ubuntu < version > _amd64.deb

B.

apt-get install cuda-drivers

C.

systemctl restart docker

D.

apt-get remove nvidia-container-toolkit

Question 15

What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?

Pick the 2 correct responses below.

Options:

A.

To measure the storage network performance.

B.

To measure the latency between GPUs.

C.

To measure the power consumption of GPUs.

D.

To measure bandwidth between GPUs.

Question 16

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices " > select a switch > " Cables ' tab to see ASIC firmware and transceiver versions.

B.

Use " Topology’ view to visually inspect cable icons.

C.

Run mlxlink -d lid- < LID > -m on each port manually.

D.

Export all switch logs and grep for ’FW Version " .

Question 17

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

Options:

A.

export HPL_OOC_SAFE_SIZE=4.0

B.

export HPL_OOC_MODE=0

C.

export HPL_OOC_NUM_STREAMS=8

D.

export HPL_OOC_MAX_GPU_MEM=90

Question 18

An enterprise is deploying an AI Factory using NVIDIA DGX BasePOD architecture. The infrastructure team must ensure high availability and efficient data transfer between compute nodes. Which network topology should they implement for the InfiniBand fabric?

Options:

A.

Simple ring topology connecting all nodes in a loop.

B.

Fat-Tree topology with rail-optimized design.

C.

Single flat Ethernet network for all traffic.

D.

Star topology with all nodes connected to a single central switch.

Question 19

During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?

Options:

A.

Set blocksize= " 1GB " for data loading and enable RMM asynchronous allocation.

B.

Switch from FP16 to FP32 precision for numerical stability.

C.

Disable add_filename for Parquet files to reduce metadata.

D.

Increase files_per_partition to 1000 for larger batch processing.

Question 20

A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?

Options:

A.

A single VLAN for all types of network traffic.

B.

Two networks: one for management and one for compute.

C.

Four networks: compute, storage, out-of-band, and management.

Question 21

An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?

Options:

A.

nvsm show power

B.

nvsm show powermode

C.

nvsm show health

D.

nvsm show alerts

Question 22

Refer to the output:

~ $ sudo nvsm show healthinfo

—Timestamp: Sat Dec 16 16:26:32 2017 -0800

Version: 17.12-5

Checks—BIOS Revision [5.11].........................

DGX Serial Number [YSY72800016)..................

Verify installed DIMM memory sticks........................Healthy

...[output truncated)

Verify Ethernet controllers...........................Healthy

Verify installed GPU ' s..............................Unhealthy

Checking output of ' lspci ' for expected GPU ' s

Missing GPU at PCI address ' 07:00.0 '

Verify installed InfiniBand controllers....................Healthy

Verify PCIe switches..................................Healthy

...[output truncated)

What insights can a system administrator gain regarding the DGX system ' s health?

Options:

A.

A GPU tray upgrade failed.

B.

A GPU is missing on the DGX system.

C.

A GPU driver upgrade has failed.

D.

The system has passed the hardware health check successfully.

Question 23

You are preparing a Spectrum-based NVIDIA switch for integration into a production AI cluster. To confirm that all modules are running approved firmware versions, you must use the appropriate command from the switch CLI. Which step most accurately meets best practices for ensuring firmware version consistency and cluster compliance?

Options:

A.

Use the show version command to check the overall system version and confirm all modules are updated if the system version matches the documentation.

B.

Use the show interfaces status command to verify all ports are up, and proceed with integration if no interface errors are shown.

C.

Use the show asic-version command to review firmware versions for all modules, then compare these against the documented approved versions.

D.

Use the show inventory command to display component details and serial numbers before proceeding, as this output will include all firmware versions for review.

Question 24

After Spectrum-X fabric deployment, NCCL tests show intermittent latency spikes. Which network condition most severely impacts East-West bandwidth?

Options:

A.

Multiple transceiver firmware mismatches.

B.

400G port utilization at 70% on several nodes during tests.

C.

Jitter below 5 ps with consistent latency.

D.

Packet loss greater than 0.001% causing NCCL pipeline stalls.

Question 25

After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running” but no progress after 15 minutes. How should they track the update’s completion status?

Options:

A.

Check /var/log/messages on the DPU operating system for update logs.

B.

Query the DPU BMC with the Task ID of the installation process.

C.

Power cycle the DPU immediately to force a rollback.

D.

Run bfrec --status on the DPU to view flash progress.

Question 26

A customer has just completed the first boot of their DGX system and is prompted to create an administrative user. What is the correct approach for setting up this user to ensure secure BMC and GRUB access?

Options:

A.

Create a unique, strong, lower-case username and password that will be used for both BMC and GRUB access, avoiding default or weak credentials.

B.

Create separate usernames for BMC and GRUB to maximize flexibility.

C.

Skip the creation of a new user and retain the default admin account for BMC and GRUB access.

D.

Use “sysadmin” as the username and a simple password for ease of management.

Question 27

A single-node stress test fails during the PCIe bandwidth validation phase. Which troubleshooting step is recommended first?

Options:

A.

Reduce PCIe Gen4 speed to Gen3 speed in BIOS settings.

B.

Reseat the GPU, then rerun the test.

C.

Disable NVLink in BIOS to isolate PCIe performance.

D.

Reinstall NVIDIA drivers using apt-get install nvidia-driver-550.

Question 28

During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?

Options:

A.

Rx power variance > 3dB between lanes

B.

Effective BER > 0 during the first 125 minutes of link operation

C.

Raw BER > 1e-12 or Effective BER > 1.5E-254 for < 6hr measurements

D.

Temperature > 85°C on transceiver module

Question 29

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?

Options:

A.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

B.

Use watts used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

Question 30

During cluster deployment, the UFM Cable Validation Tool reports " Wrong-neighbor " errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?

Options:

A.

Reboot all leaf switches to force LLDP rediscovery.

B.

Replace all affected cables with higher-grade OM5 fiber optics.

C.

Verify LLDP data against topology files and remediate.

D.

Disable FEC on all switches to bypass neighbor validation.

Question 31

You are training a deep neural network using NCCL to coordinate communication across four GPUs in a single node. During early performance testing, you notice inconsistent scaling and longer-than-expected training times, even though all GPUs are being used. Which strategy would most effectively improve NCCL efficiency and collective operation performance in this setting?

Options:

A.

Adjust the batch size so that each GPU receives an equal-sized portion of the batch, ensuring all GPUs process similar workloads and communication is evenly distributed.

B.

Assign the largest possible workload to the first GPU to maximize its utilization, and allow the remaining GPUs to process smaller or variable batch sizes as needed.

C.

Disable automatic load balancing so that the deep learning framework can dynamically assign samples to any GPU available during each iteration.

D.

Increase the communication frequency between GPUs while allowing workloads to be unevenly split, so synchronization is more frequent and model updates happen faster.

Question 32

An InfiniBand server stops working, and a system administrator runs the " ibstat " command that provides the following output:

CA ' mlx5_1 '

CA type: MT4115

Number of ports: 2

Firmware version: 10.20.1010

Hardware version: 0

Node GUID: 0x0002c90300002f78

System image GUID: 0x0002c90300002f7b

Port 1:

State: Initializing

Physical state: Linkup

Rate: 100

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0251086a

Port GUID: 0x0002c90300002f79

Link layer: InfiniBand

What is the cause of the issue?

Options:

A.

The HCA port is faulty.

B.

There is no running SM in the fabric.

C.

The neighboring switch port is faulty.

D.

The cable is disconnected.

Question 33

A healthcare organization is deploying an AI system to analyze patient data for predictive diagnostics. The system must comply with strict data protection regulations such as HIPAA, ensuring that sensitive information remains confidential and secure. Considering the need for robust security measures, which combination of strategies should the organization prioritize to protect against data breaches and ensure regulatory compliance?

Options:

A.

Deploy data masking to obscure sensitive data during processing and use role-based access control (RBAC) to limit data access based on user roles.

B.

Use tokenization to replace sensitive data with non-sensitive tokens and employ multi-factor authentication (MFA) for system access.

C.

Implement symmetric encryption for all data at rest and rely solely on password-based access controls.

D.

Rely on asymmetric encryption for all communications and use data deduplication to minimize storage costs without additional security measures.

Question 34

An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?

Options:

A.

Sufficient networking, water-cooled racks, adequate rack power, sufficient storage, and rack space.

B.

Sufficient storage, sufficient networking, adequate rack power, and compatible hardware.

C.

Sufficient CPU capacity, PCIe slot allocation, sufficient cooling in the data center, and rack space.

D.

Sufficient cooling in the data center, adequate rack power, compatible hardware, and PCIe slot allocation.

Question 35

After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?

Options:

A.

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.

Reduce message size to decrease network utilization

C.

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Question 36

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

Options:

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Page: 1 / 9
Total 123 questions