A system administrator needs to lower latency for an AI application by utilizing GPUDirect Storage.
What two (2) bottlenecks are avoided with this approach? (Choose two.)
You are managing a deep learning workload on a Slurm cluster with multiple GPU nodes, but you notice that jobs requesting multiple GPUs are waiting for long periods even though there are available resources on some nodes.
How would you optimize job scheduling for multi-GPU workloads?
An instance of NVIDIA Fabric Manager service is running on an HGX system with KVM. A System Administrator is troubleshooting NVLink partitioning.
By default, what is the GPU polling subsystem set to?
A GPU administrator needs to virtualize AI/ML training in an HGX environment.
How can the NVIDIA Fabric Manager be used to meet this demand?
An administrator is troubleshooting issues with an NVIDIA Unified Fabric Manager Enterprise (UFM) installation and notices that the UFM server is unable to communicate with InfiniBand switches.
What step should be taken to address the issue?
A Fleet Command system administrator wants to create an organization user that will have the following rights:
For locations - read only
For Applications - read/write/admin
For Deployments - read/write/admin
For Dashboards - read only
What role should the system administrator assign to this user?
You have noticed that users can access all GPUs on a node even when they request only one GPU in their job script using --gres=gpu:1. This is causing resource contention and inefficient GPU usage.
What configuration change would you make to restrict users’ access to only their allocated GPUs?
An administrator is troubleshooting a bottleneck in a deep learning run time and needs consistent data feed rates to GPUs.
Which storage metric should be used?
Your organization is running multiple AI models on a single A100 GPU using MIG in a multi-tenant environment. One of the tenants reports a performance issue, but you notice that other tenants are unaffected.
What feature of MIG ensures that one tenant's workload does not impact others?
What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?
An administrator requires full access to the NGC Base Command Platform CLI.
Which command should be used to accomplish this action?
A system administrator needs to scale a Kubernetes Job to 4 replicas.
What command should be used?
A system administrator is experiencing issues with Docker containers failing to start due to volume mounting problems. They suspect the issue is related to incorrect file permissions on shared volumes between the host and containers.
How should the administrator troubleshoot this issue?
An administrator is troubleshooting issues with NVIDIA GPUDirect storage and must ensure optimal data transfer performance.
What step should be taken first?
You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.
How can you configure NVIDIA Fleet Command to achieve this?
A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.
What role should you assign them in Run:ai?
If a Magnum IO-enabled application experiences delays during the ETL phase, what troubleshooting step should be taken?
A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue.
What command should be used?
A system administrator wants to run these two commands in Base Command Manager.
main
showprofile device status apc01
What command should the system administrator use from the management node system shell?