Hardware Engineer, GPU Infrastructure

Posted 21 Days Ago
Be an Early Applicant
Remote
1-3 Years Experience
Cloud • Information Technology • Machine Learning
We empower creators and innovators with access to GPU resources they need to work more efficiently.
The Role
CoreWeave is looking for a Hardware Engineer specializing in GPU and PCIe troubleshooting to join their team. Responsibilities include troubleshooting failures, collaborating with vendors, tracking RMAs, and optimizing server hardware infrastructure.
Summary Generated by Built In

CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry’s fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at www.coreweave.com.

CoreWeave is seeking a highly skilled and motivated Infrastructure/Hardware Engineer, focusing on GPU and PCIe troubleshooting, to join our Hardware Engineering team, reporting to the Director of Compute Architecture. In this role, you will play a crucial part in the design, development, troubleshooting, and optimization of our server hardware infrastructure. You will collaborate closely with cross-functional teams, external vendors, and stakeholders to ensure the successful delivery of highly performant and reliable hardware solutions.

Responsibilities:

  • Troubleshoot complex GPU and PCIe related failures
  • Partner with external vendors on failure analysis
  • Track component RMAs
  • Develop and maintain hardware/firmware management services.
  • Automate all aspects of the server hardware lifecycle.
  • Serve as the senior point of contact for hardware escalation and troubleshooting.
  • Collaborate with cross-functional teams to define hardware requirements, specifications, and system architecture.
  • Create and maintain accurate documentation of hardware designs, specifications, test procedures, and results.
  • Analyze and optimize the performance of hardware systems, identify bottlenecks, and propose improvements for enhanced efficiency.
  • Establish processes for internal hardware testing, deployment, and performance optimization.

The ideal candidate will have at least 2 years professional experience with the following:

  • Prior experience supporting and troubleshooting data center class GPUs (preferably A100 or newer)
  • Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish).
  • Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools
  • In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices.
  • Proven ability to stay updated with the latest industry technologies and trends.
  • Previous experience collaborating with hardware vendors.
  • Strong passion for automation, with a commitment to automating processes comprehensively.
  • Excellent documentation skills and attention to detail.
  • Strong analytical and problem-solving abilities.

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $160,000/year in our lowest geographic market up to $210,000/year in our highest geographic market. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.  

Hybrid Workplace

Successful candidates will be expected to attend onboarding training at our NJ Headquarters within their first several weeks of employment, with subsequent quarterly travel requirements of 1 week duration.

If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture. Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.

Hybrid Workplace

Successful candidates will be expected to attend onboarding training at our NJ Headquarters for up to 2 weeks within their first month of employment, with subsequent quarterly travel requirements of 1 week duration.

If you reside within a 30-mile radius of our New Jersey, New York, Philadelphia, Sunnyvale or Bellevue offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture. Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast!  We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values: 

  • Be Curious at your Core
  • Act like an Owner
  • Empower Employees
  • Deliver Best In-Class Client Experience 
  • Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us! 

Benefits

We offer a competitive salary and benefits, including:

  • Medical, dental and vision insurance - 100% paid for the employee
  • Company paid Life Insurance 
  • Voluntary supplemental life insurance 
  • Short and long-term disability insurance 
  • Flexible Spending Account
  • Tuition Reimbursement 
  • Mental Wellness Benefits through Spring Health 
  • Family-Forming support provided by Carrot
  • Paid Parental Leave 
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our offices
  • A casual work environment
  • Work culture focused on innovative disruption

California Consumer Privacy Act - California applicants only

CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.

As part of this commitment and consistent with the Americans with Disabilities Act (ADA), CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship. If reasonable accommodation is needed, please contact: [email protected]


Top Skills

Ansible,Python

What the Team is Saying

Louis
Taylor
Anthony
Matt
The Company
Roseland, NJ
600 Employees
Hybrid Workplace
Year Founded: 2017

What We Do

CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry’s fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at www.coreweave.com.

Why Work With Us

At CoreWeave we work hard, have fun and move fast! Today we are a small, growing team of intelligent, genuine people, that value different perspectives and approaches to solving complex problems. We foster an environment that champions collaboration and prioritizes innovative solutions. Here, you are surrounded by the best.

Gallery

Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery

CoreWeave Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

Typical time on-site: Flexible
Roseland, NJ

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account