Introduction
An Azure Machine Learning (Azure ML) Workspace is a centralized platform within Azure Machine Learning Services that enables data scientists and developers to efficiently manage machine learning (ML) projects. It acts as a collaborative environment for building, training, deploying, and monitoring ML models while ensuring security and scalability.
Use OpsRamp Azure Public Cloud Integration to discover and collect metrics against Machine Learning Services Workspaces.
Setup
To set up the Azure integration and discover the Azure Machine Learning Services Workspaces resources, do the following:
- Create an Azure Integration, if not available in your installed integrations. For more information on how to install the Azure Integration, refer to Install Azure Integration.
- Create a discovery profile. For more information on how to create a discovery profile, refer to Create Discovery Profile.
- Select
Machine Learning Services Workspacesunder the Filter Criteria in the Edit Discovery Profile page. - Save the discovery profile to make them available in the list of Discovery Profiles.
- Scan to discover the resources at any time independent of the predefined schedule.
- Once the scan is completed, you can view the Machine Learning Services Workspaces resources under Infrastructure > Resources > Microsoft Azure category.
Event support
OpsRamp supports Azure events for Machine Learning Services Workspaces. Configure Azure Events in the OpsRamp Azure integration discovery profile.
See Process Azure Events for more information on how to configure Azure events.
Supported metrics
| OpsRamp Metric | Azure Metric | Metric Display Name | Unit | Aggregation Type | Description |
|---|---|---|---|---|---|
| azure_ml_services_workspaces_Agents | Agents | Agents | Count | Average | Number of events for AI Agents in this workspace |
| azure_ml_services_workspaces_IndexedFiles | IndexedFiles | IndexedFiles | Count | Average | Number of files indexed for file search in this workspace |
| azure_ml_services_workspaces_Messages | Messages | Messages | Count | Average | Number of events for AI Agent messages in this workspace |
| azure_ml_services_workspaces_Runs | Runs | Runs | Count | Average | Number of runs by AI Agents in this workspace |
| azure_ml_services_workspaces_Threads | Threads | Threads | Count | Average | Number of events for AI Agent threads in this workspace |
| azure_ml_services_workspaces_Tokens | Tokens | Tokens | Count | Average | Count of tokens by AI Agents in this workspace |
| azure_ml_services_workspaces_ToolCalls | ToolCalls | ToolCalls | Count | Average | Tool calls made by AI Agents in this workspace |
| azure_ml_services_workspaces_Model_Deploy_Failed | Model Deploy Failed | Model Deploy Failed | Count | Total | Number of model deployments that failed in this workspace |
| azure_ml_services_workspaces_Model_Deploy_Started | Model Deploy Started | Model Deploy Started | Count | Total | Number of model deployments started in this workspace |
| azure_ml_services_workspaces_Model_Deploy_Succeeded | Model Deploy Succeeded | Model Deploy Succeeded | Count | Total | Number of model deployments that succeeded in this workspace |
| azure_ml_services_workspaces_Model_Register_Failed | Model Register Failed | Model Register Failed | Count | Total | Number of model registrations that failed in this workspace |
| azure_ml_services_workspaces_Model_Register_Succeeded | Model Register Succeeded | Model Register Succeeded | Count | Total | Number of model registrations that succeeded in this workspace |
| azure_ml_services_workspaces_Active_Cores | Active Cores | Active Cores | Count | Average | Number of active cores |
| azure_ml_services_workspaces_Active_Nodes | Active Nodes | Active Nodes | Count | Average | Number of Acitve nodes. These are the nodes which are actively running a job |
| azure_ml_services_workspaces_Idle_Cores | Idle Cores | Idle Cores | Count | Average | Number of idle cores |
| azure_ml_services_workspaces_Idle_Nodes | Idle Nodes | Idle Nodes | Count | Average | Number of idle nodes. Idle nodes are the nodes which are not running any jobs but can accept new job if available |
| azure_ml_services_workspaces_Leaving_Cores | Leaving Cores | Leaving Cores | Count | Average | Number of leaving cores |
| azure_ml_services_workspaces_Leaving_Nodes | Leaving Nodes | Leaving Nodes | Count | Average | Number of leaving nodes. Leaving nodes are the nodes which just finished processing a job and will go to Idle state |
| azure_ml_services_workspaces_Preempted_Cores | Preempted Cores | Preempted Cores | Count | Average | Number of preempted cores |
| azure_ml_services_workspaces_Preempted_Nodes | Preempted Nodes | Preempted Nodes | Count | Average | Number of preempted nodes. These nodes are the low priority nodes which are taken away from the available node pool |
| azure_ml_services_workspaces_Quota_Utilization_Percentage | Quota Utilization Percentage | Quota Utilization Percentage | Count | Average | Percent of quota utilized |
| azure_ml_services_workspaces_Total_Cores | Total Cores | Total Cores | Count | Average | Number of total cores |
| azure_ml_services_workspaces_Total_Nodes | Total Nodes | Total Nodes | Count | Average | Number of total nodes. This total includes some of Active Nodes, Idle Nodes, Unusable Nodes, Premepted Nodes, Leaving Nodes |
| azure_ml_services_workspaces_Unusable_Cores | Unusable Cores | Unusable Cores | Count | Average | Number of unusable cores |
| azure_ml_services_workspaces_Unusable_Nodes | Unusable Nodes | Unusable Nodes | Count | Average | Number of unusable nodes. Unusable nodes are not functional due to some unresolvable issue. Azure will recycle these nodes |
| azure_ml_services_workspaces_CpuCapacityMillicores | CpuCapacityMillicores | CpuCapacityMillicores | Count | Average | Maximum capacity of a CPU node in millicores. Capacity is aggregated in one minute intervals |
| azure_ml_services_workspaces_CpuMemoryCapacityMegabytes | CpuMemoryCapacityMegabytes | CpuMemoryCapacityMegabytes | Count | Average | Maximum memory utilization of a CPU node in megabytes. Utilization is aggregated in one minute intervals |
| azure_ml_services_workspaces_CpuMemoryUtilizationMegabytes | CpuMemoryUtilizationMegabytes | CpuMemoryUtilizationMegabytes | Count | Average | Memory utilization of a CPU node in megabytes. Utilization is aggregated in one minute intervals |
| azure_ml_services_workspaces_CpuMemoryUtilizationPercentage | CpuMemoryUtilizationPercentage | CpuMemoryUtilizationPercentage | Count | Average | Memory utilization percentage of a CPU node. Utilization is aggregated in one minute intervals |
| azure_ml_services_workspaces_CpuUtilization | CpuUtilization | CpuUtilization | Count | Average | Percentage of utilization on a CPU node. Utilization is reported at one minute intervals |
| azure_ml_services_workspaces_CpuUtilizationMillicores | CpuUtilizationMillicores | CpuUtilizationMillicores | Count | Average | Utilization of a CPU node in millicores. Utilization is aggregated in one minute intervals |
| azure_ml_services_workspaces_CpuUtilizationPercentage | CpuUtilizationPercentage | CpuUtilizationPercentage | Count | Average | Utilization percentage of a CPU node. Utilization is aggregated in one minute intervals |
| azure_ml_services_workspaces_DiskAvailMegabytes | DiskAvailMegabytes | DiskAvailMegabytes | Count | Average | Available disk space in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_DiskReadMegabytes | DiskReadMegabytes | DiskReadMegabytes | Count | Average | Data read from disk in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_DiskUsedMegabytes | DiskUsedMegabytes | DiskUsedMegabytes | Count | Average | Used disk space in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_DiskWriteMegabytes | DiskWriteMegabytes | DiskWriteMegabytes | Count | Average | Data written into disk in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_GpuCapacityMilliGPUs | GpuCapacityMilliGPUs | GpuCapacityMilliGPUs | Count | Average | Maximum capacity of a GPU device in milli-GPUs. Capacity is aggregated in one minute intervals |
| azure_ml_services_workspaces_GpuEnergyJoules | GpuEnergyJoules | GpuEnergyJoules | Count | Average | Interval energy in Joules on a GPU node. Energy is reported at one minute intervals |
| azure_ml_services_workspaces_GpuMemoryCapacityMegabytes | GpuMemoryCapacityMegabytes | GpuMemoryCapacityMegabytes | Count | Average | Maximum memory capacity of a GPU device in megabytes. Capacity aggregated in at one minute intervals |
| azure_ml_services_workspaces_GpuMemoryUtilization | GpuMemoryUtilization | GpuMemoryUtilization | Count | Average | Percentage of memory utilization on a GPU node. Utilization is reported at one minute intervals |
| azure_ml_services_workspaces_GpuMemoryUtilizationMegabytes | GpuMemoryUtilizationMegabytes | GpuMemoryUtilizationMegabytes | Count | Average | Memory utilization of a GPU device in megabytes. Utilization aggregated in at one minute intervals |
| azure_ml_services_workspaces_GpuMemoryUtilizationPercentage | GpuMemoryUtilizationPercentage | GpuMemoryUtilizationPercentage | Count | Average | Memory utilization percentage of a GPU device. Utilization aggregated in at one minute intervals |
| azure_ml_services_workspaces_GpuUtilization | GpuUtilization | GpuUtilization | Count | Average | Percentage of utilization on a GPU node. Utilization is reported at one minute intervals |
| azure_ml_services_workspaces_GpuUtilizationMilliGPUs | GpuUtilizationMilliGPUs | GpuUtilizationMilliGPUs | Count | Average | Utilization of a GPU device in milli-GPUs. Utilization is aggregated in one minute intervals |
| azure_ml_services_workspaces_GpuUtilizationPercentage | GpuUtilizationPercentage | GpuUtilizationPercentage | Count | Average | Utilization percentage of a GPU device. Utilization is aggregated in one minute intervals |
| azure_ml_services_workspaces_IBReceiveMegabytes | IBReceiveMegabytes | IBReceiveMegabytes | Count | Average | Network data received over InfiniBand in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_IBTransmitMegabytes | IBTransmitMegabytes | IBTransmitMegabytes | Count | Average | Network data sent over InfiniBand in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_NetworkInputMegabytes | NetworkInputMegabytes | NetworkInputMegabytes | Count | Average | Network data received in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_NetworkOutputMegabytes | NetworkOutputMegabytes | NetworkOutputMegabytes | Count | Average | Network data sent in megabytes. Metrics are aggregated in one minute intervals |
| azure_ml_services_workspaces_StorageAPIFailureCount | StorageAPIFailureCount | StorageAPIFailureCount | Count | Average | Azure Blob Storage API calls failure count |
| azure_ml_services_workspaces_StorageAPISuccessCount | StorageAPISuccessCount | StorageAPISuccessCount | Count | Average | Azure Blob Storage API calls success count |
| azure_ml_services_workspaces_Cancel_Requested_Runs | Cancel Requested Runs | Cancel Requested Runs | Count | Total | Number of runs where cancel was requested for this workspace. Count is updated when cancellation request has been received for a run |
| azure_ml_services_workspaces_Cancelled_Runs | Cancelled Runs | Cancelled Runs | Count | Total | Number of runs cancelled for this workspace. Count is updated when a run is successfully cancelled |
| azure_ml_services_workspaces_Completed_Runs | Completed Runs | Completed Runs | Count | Total | Number of runs completed successfully for this workspace. Count is updated when a run has completed and output has been collected |
| azure_ml_services_workspaces_Errors | Errors | Errors | Count | Total | Number of run errors in this workspace. Count is updated whenever run encounters an error |
| azure_ml_services_workspaces_Failed_Runs | Failed Runs | Failed Runs | Count | Total | Number of runs failed for this workspace. Count is updated when a run fails |
| azure_ml_services_workspaces_Finalizing_Runs | Finalizing Runs | Finalizing Runs | Count | Total | Number of runs entered finalizing state for this workspace. Count is updated when a run has completed but output collection still in progress |
| azure_ml_services_workspaces_Not_Responding_Runs | Not Responding Runs | Not Responding Runs | Count | Total | Number of runs not responding for this workspace. Count is updated when a run enters Not Responding state |
| azure_ml_services_workspaces_Not_Started_Runs | Not Started Runs | Not Started Runs | Count | Total | Number of runs in Not Started state for this workspace. Count is updated when a request is received to create a run but run information has not yet been populated |
| azure_ml_services_workspaces_Preparing_Runs | Preparing Runs | Preparing Runs | Count | Total | Number of runs that are preparing for this workspace. Count is updated when a run enters Preparing state while the run environment is being prepared |
| azure_ml_services_workspaces_Provisioning_Runs | Provisioning Runs | Provisioning Runs | Count | Total | Number of runs that are provisioning for this workspace. Count is updated when a run is waiting on compute target creation or provisioning |
| azure_ml_services_workspaces_Queued_Runs | Queued Runs | Queued Runs | Count | Total | Number of runs that are queued for this workspace. Count is updated when a run is queued in compute target. Can occure when waiting for required compute nodes to be ready |
| azure_ml_services_workspaces_Started_Runs | Started Runs | Started Runs | Count | Total | Number of runs running for this workspace. Count is updated when run starts running on required resources |
| azure_ml_services_workspaces_Starting_Runs | Starting Runs | Starting Runs | Count | Total | Number of runs started for this workspace. Count is updated after request to create run and run info, such as the Run Id, has been populated |
| azure_ml_services_workspaces_Warnings | Warnings | Warnings | Count | Total | Number of run warnings in this workspace. Count is updated whenever a run encounters a warning |