An AI Engineer is analyzing a production agentic AI system’s compliance with responsible AI standards.
Which evaluation approaches effectively identify potential safety vulnerabilities and ethical risks in multi-agent workflows? (Choose two.)
A medical diagnostics company is deploying an agentic AI system to assist radiologists in analyzing medical imaging. The system must provide AI-generated preliminary diagnoses and allow radiologists to review, modify, and approve all recommendations before patient treatment decisions. Human expertise should remain central, with detailed records of human interventions and decision rationales maintained.
Which approach would best balance human oversight with AI support in a safety-critical setting?
A development team is building a customer support agent that interacts with users via chat. The agent must reliably fetch information from external databases, handle occasional API failures without crashing, and improve its responses by learning from user feedback over time.
Which of the following tasks is most critical when enhancing an AI agent to handle real-world interactions and improve over time?
You’re evaluating the RAG pipeline by comparing its responses to synthetic questions. You’ve collected a large set of similarity scores.
What’s the primary benefit of aggregating these scores into a single metric (e.g., average similarity)?