• /
  • EnglishEspañol日本語한국어Português
  • EntrarComeçar agora

NVIDIA GPU integration

Our NVIDIA GPU integration lets you monitor the status of your GPUs. This integration uses our infrastructure agent with the Flex integration, which lets us access NVIDIA's SMI utility.

NVIDIA GPUs dashboard

After you set up our NVIDIA GPU integration, we give you a dashboard for your GPU metrics.

When you install, you'll get a pre-built dashboard containing crucial GPU metrics:

  • GPU utilization
  • ECC error counts
  • Active compute processes
  • Clock and performance states
  • Temperature and fan speed
  • Dynamic and static information about each supported device

Install the infrastructure agent

To capture data with New Relic, install our infrastructure agent. Our infrastructure agent collects and ingests data so you can keep track of your GPUs performance.

You can install the infrastructure agent two different ways:

Configure Flex integration for NVIDIA GPUs

Flex comes bundled with the New Relic infrastructure agent and it can be integrated with the NVIDIA SMI, a command line utility to monitor NVIDIA GPU devices.

Importante

nvidia-smi ships pre-installed with NVIDIA GPU display drivers on Linux and Windows Server.

Follow these steps to configure Flex:

  1. Create a file named nvidia-smi-gpu-monitoring.yml in this path:

    bash
    $
    sudo touch /etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml

    You may also download from the git repository.

  2. Update the nvidia-smi-gpu-monitoring.yml file with the integration config:

    ---
    integrations:
    - name: nri-flex
    # interval: 30s
    config:
    name: NvidiaSMI
    variable_store:
    metrics:
    "name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
    pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
    pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
    persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
    driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
    gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
    clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
    clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
    clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
    clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
    utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
    encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
    ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
    ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
    ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
    ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
    ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
    ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
    ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
    ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
    ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
    ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
    ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
    ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
    retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
    power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
    clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
    clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
    mig.mode.current,mig.mode.pending"
    apis:
    - name: NvidiaGpu
    commands:
    - run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
    output: csv
    rename_keys:
    " ": ""
    "\\[MiB\\]": ".MiB"
    "\\[%\\]": ".percent"
    "\\[W\\]": ".watts"
    "\\[MHz\\]": ".MHz"
    value_parser:
    "clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
    '.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'

Confirm GPU metrics are being ingested

The Flex configuration will be automatically detected and executed by the infrastructure agent, there's no need to restart the agent. You can confirm metrics are being ingested by running this NRQL query:

SELECT * FROM NvidiaGpuSample

Monitor your application

You can use our pre-built dashboard template to monitor your GPU metrics. Follow these steps:

  1. Go to one.newrelic.com and click on Dashboards.

  2. Click on the Import dashboard tab.

  3. Copy the file content (.json) from the NVIDIA GPU dashboard.

  4. Select the target account where the dashboard needs to be imported.

  5. Click on Import dashboard to confirm the action.

    Your NVIDIA GPU Monitoring dashboard is considered a custom dashboard and can be found in the Dashboards UI. For docs on using and editing dashboards, see our dashboard docs.

    Here is a NRQL query to view all the telemetry available:

    SELECT * FROM NvidiaGpuSample

What's next?

You can adapt the Flex configuration to include or exclude information available from the NVIDIA SMI utility.

To learn more about building NRQL queries and generating dashboards, check out these docs:

  • Introduction to the query builder to create basic and advanced queries.
  • Introduction to dashboards to customize your dashboard and carry out different actions.
  • Manage your dashboard to adjust your display mode, or to add more content to your dashboard.
Copyright © 2024 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.