AI and Systems Software Intern, At Scale AI - Fall 2026
At a glance
AI generatedTL;DR
As an intern at NVIDIA’s AI and Systems Software team for datacenter applications, you will engage in system-level debugging and reliability analysis within large-scale compute clusters. Your daily tasks include investigating failures, analyzing logs and telemetry to identify root causes of job failures, and tracking key reliability metrics such as MTBF and MTBI. You will collaborate closely with OS, container technologies, GPU compute, and systems specialists to optimize performance and develop new solutions. The ideal candidate is pursuing a degree in Computer Science or related fields, with proficiency in Python and Bash scripting for automation. Strong debugging skills and exposure to HPC environments are essential, along with familiarity with server architecture and monitoring tools like Prometheus and Grafana. This role offers the opportunity to work on cutting-edge technologies and contribute to significant infrastructure improvements within a dynamic team environment.
Skills
What you'll do
- Investigate and triage failures within large-scale compute clusters to distinguish between software glitches, configuration errors, and hardware faults.
- Analyze logs and telemetry data to correlate job failures with system-level issues for root cause identification.
- Track, calculate, and report on reliability metrics like MTBF and MTBI to drive infrastructure improvements.
- Assist in analyzing large-scale workload issues to identify application and infrastructure improvement opportunities.
- Document debugging methodologies and assist the team in making informed engineering decisions based on data.
What we're looking for
- Pursuing BS, MS, or PhD in Computer Science, Engineering, or related field.
- Proficient in Python and Bash/Shell scripting for automation.
- Strong debugging skills in complex distributed systems.
- Experience with HPC environments, cluster managers like Slurm/Kubernetes.
- Familiarity with server architecture and hardware diagnostics.
Employer
About Nvidia
Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing
Nvidia currently has 825 open roles on FindRole.
Listed pay typically runs $184,000–$287,500 across 813 roles with salary data.
Most-posted roles
- Senior Solutions Architect, AI Infrastructure 4
- Senior System Software Engineer - AV Platform 4
- Senior Circuit Design Engineer 3
- Senior Circuit Methodology Engineer 3
- Senior Deep Learning Performance Architect 3