SDE - III Devops
AiDash
Software Engineering
Bengaluru, Karnataka, India
About AiDASH
AiDASH is an enterprise AI company and the leading provider of vegetation risk intelligence for electric utilities. Powered by proprietary VegetationAI™ technology, AiDASH delivers a unified remote grid inspection and monitoring platform that uses a SatelliteFirst approach to identify and address vegetation and other threats to the grid. With a prevention-first strategy to mitigate wildfire risk and minimize storm impacts, AiDASH helps more than 140 utilities reduce costs, improve reliability, and lower liability across their networks. AiDASH exists to safeguard critical utility infrastructure and secure the future of humanAIty™. Learn more at www.aidash.com.
We are a Series C growth company backed by leading investors, including Shell Ventures, National Grid Partners, G2 Venture Partners, Duke Energy, Edison International, Lightrock, Marubeni, among others. We have been recognized by Forbes two years in a row as one of “America’s Best Startup Employers.” We are also proud to be one of the few software companies in Time Magazine’s “America’s Top GreenTech Companies 2024”. Deloitte Technology Fast 500™ recently ranked us at No. 12 among San Francisco Bay Area companies, and No. 59 overall in their selection of the top 500 for 2024.
Join us in Securing Tomorrow!
The role
We are looking for a SDE-III DevOps who takes end-to-end ownership of the infrastructure that runs our satellite-imagery and ML-inference platform. This is a senior, hands-on individual-contributor role — you will set the bar across the DevOps function, influence technical direction, and lead by example, without managing a team.
What sets this role apart at AiDASH is what runs on the infrastructure: a globally deployed platform that ingests [petabytes of satellite imagery], runs [millions of inference requests per day] across CPU and GPU fleets, and serves utilities, transportation, and construction customers on tight SLOs. You will build the systems that underpin that — and you will do it with AI as a first-class tool, not an afterthought.
You will work closely with developers, QA, security, and product teams to design systems that are reliable, secure, and easy to operate, while extending an already-mature DevSecOps program.
What you will do
- Own scalable, secure infrastructure across AWS, Azure, or GCP — using AI coding assistants to accelerate IaC authoring, policy validation, and cost reviews.
- Architect and maintain CI/CD pipelines that support rapid, safe deployments, with AI-assisted failure triage, smart test selection, and automated release notes.
- Lead container orchestration on Kubernetes for production satellite-data and ML-inference workloads — including GPU scheduling, autoscaling, and model-serving infrastructure.
- Own observability standards (SLIs, SLOs, error budgets, alerting) and be accountable for keeping platform availability at our committed SLO targets.
- Implement automation using Terraform, Ansible, or equivalent, with AI pair-programming as a normal part of the workflow — not a side experiment.
- Define and harden security, secrets management, and access-control practices in partnership with the DevSecOps function.
- Establish disaster-recovery strategies and backups for critical systems, and prove them with regular game-days.
- Build or extend internal tooling that uses LLMs to make engineers faster — log triage, runbook drafting, alert summarisation, ChatOps for routine ops tasks.
- Participate in a follow-the-sun on-call rotation with the global DevOps team.
- Raise the bar on engineering standards, including how the team adopts and governs AI tooling in infrastructure workflows.
What we are looking for
- 6+ years in DevOps, infrastructure engineering, or SRE, with proven ownership of production systems at meaningful scale.
- Deep experience with at least one major cloud (AWS, Azure, or GCP), and a working grasp of what the others do differently.
- Strong infrastructure-as-code (Terraform or equivalent) and CI/CD pipeline experience (Jenkins, GitLab CI, GitHub Actions, or similar).
- Production Kubernetes — beyond “I have used Helm.” You can debug a stuck pod, design an autoscaling strategy, and reason about cost.
- Strong scripting in Python, Bash, or equivalent.
- Hands-on with at least one observability stack (Prometheus, Grafana, ELK, or equivalent) and able to define meaningful SLOs.
- Shipped real work using AI coding assistants — IaC, debugging, incident triage, internal tooling — and can speak to where they helped and where they got in your way.
- Built or extended at least one internal tool using LLM APIs (or are clearly keen to). A hacky prototype counts.
- Comfortable in a fast-moving, engineering-driven environment, and good at influencing without authority.
Nice to have
- Production experience with ML infrastructure — model serving (Triton, KServe, TorchServe, or similar), GPU workload management, feature stores, or data-pipeline orchestration (Airflow, Argo, or equivalent).
- Familiarity with compliance frameworks relevant to critical infrastructure (SOC 2, ISO 27001, NERC, or similar).
- Certifications such as AWS DevOps Engineer, CKA / CKAD, or equivalent.
- Experience with serverless platforms (Lambda, Cloud Functions, etc.).
- Exposure to managing and tuning production databases (PostgreSQL, MySQL, or NoSQL).
Read our Privacy Policy here: https://www.aidash.com/policy/privacy-policy/