A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?
A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?
You are responsible for ensuring interoperability between AI applications deployed across a diverse IT landscape, including an on-premises data center equipped with NVIDIA GPUs and multiple cloud platforms from different vendors. These environments need to support complex AI workflows that involve large-scale data processing, real-time analytics, and machine learning model training. To maintain consistent performance and flexibility, which strategy should you prioritize?
What information does the ' ibnodes ' command display?
A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?
You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?
Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?
ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps?
Pick the 2 correct responses below.
An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?
A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?
During a DGX cluster deployment, what is the most effective way to verify the health and integrity of the local RAID storage array?
After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?
A system administrator needs to install a container toolkit and successfully run the following commands:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime docker
What step should be taken next to finish the installation?
What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?
Pick the 2 correct responses below.
A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?
When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?
An enterprise is deploying an AI Factory using NVIDIA DGX BasePOD architecture. The infrastructure team must ensure high availability and efficient data transfer between compute nodes. Which network topology should they implement for the InfiniBand fabric?
During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?
A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?
An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?
Refer to the output:
~ $ sudo nvsm show healthinfo
—Timestamp: Sat Dec 16 16:26:32 2017 -0800
Version: 17.12-5
Checks—BIOS Revision [5.11].........................
DGX Serial Number [YSY72800016)..................
Verify installed DIMM memory sticks........................Healthy
...[output truncated)
Verify Ethernet controllers...........................Healthy
Verify installed GPU ' s..............................Unhealthy
Checking output of ' lspci ' for expected GPU ' s
Missing GPU at PCI address ' 07:00.0 '
Verify installed InfiniBand controllers....................Healthy
Verify PCIe switches..................................Healthy
...[output truncated)
What insights can a system administrator gain regarding the DGX system ' s health?
You are preparing a Spectrum-based NVIDIA switch for integration into a production AI cluster. To confirm that all modules are running approved firmware versions, you must use the appropriate command from the switch CLI. Which step most accurately meets best practices for ensuring firmware version consistency and cluster compliance?
After Spectrum-X fabric deployment, NCCL tests show intermittent latency spikes. Which network condition most severely impacts East-West bandwidth?
After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running” but no progress after 15 minutes. How should they track the update’s completion status?
A customer has just completed the first boot of their DGX system and is prompted to create an administrative user. What is the correct approach for setting up this user to ensure secure BMC and GRUB access?
A single-node stress test fails during the PCIe bandwidth validation phase. Which troubleshooting step is recommended first?
During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?
You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?
During cluster deployment, the UFM Cable Validation Tool reports " Wrong-neighbor " errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?
You are training a deep neural network using NCCL to coordinate communication across four GPUs in a single node. During early performance testing, you notice inconsistent scaling and longer-than-expected training times, even though all GPUs are being used. Which strategy would most effectively improve NCCL efficiency and collective operation performance in this setting?
An InfiniBand server stops working, and a system administrator runs the " ibstat " command that provides the following output:
CA ' mlx5_1 '
CA type: MT4115
Number of ports: 2
Firmware version: 10.20.1010
Hardware version: 0
Node GUID: 0x0002c90300002f78
System image GUID: 0x0002c90300002f7b
Port 1:
State: Initializing
Physical state: Linkup
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c90300002f79
Link layer: InfiniBand
What is the cause of the issue?
A healthcare organization is deploying an AI system to analyze patient data for predictive diagnostics. The system must comply with strict data protection regulations such as HIPAA, ensuring that sensitive information remains confidential and secure. Considering the need for robust security measures, which combination of strategies should the organization prioritize to protect against data breaches and ensure regulatory compliance?
An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?
After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?
Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?