What You Will Do
The High Performance Computing (HPC) Division provides high performance computing systems services to the Laboratory. Our work starts with the early phases of acquisition, development, and production readiness of HPC platforms, and continues through the maintenance and operation of these systems and the facilities in which they are housed. HPC Division also manages the network, parallel file systems, storage, and visualization infrastructure associated with the HPC platforms. The Division directly supports the Laboratory's HPC user base and aids, at multiple levels, in the effective use of HPC resources to generate science. Additionally, we support selected research activities that we deem important to our mission.
The HPC Design group is responsible, working closely with other groups in HPC, for the research, development, and deployment of the full environment of future supercomputing at LANL. The Los Alamos National Laboratory (LANL) High Performance Computing (HPC) Design Group is searching for a strong contributor to help drive the future of supercomputing artificial intelligence (AI) and machine learning (ML). The lab is preparing for a recently announced 2023 "first of its kind" system utilizing Nvidia Grace processors and GPGPU's (see https://www.lanl.gov/discover/news-release-archive/2021/April/0412-nvidia.php) that will provide 10x performance improvements in training giant Artificial Intelligence (AI) models. If you have the drive, intellect, and skill sets discussed below, we look forward to your application.
This position focuses on supporting applications as they transition traditional simulation workflows to integrate aspects of artificial intelligence as well as supporting dedicated AI/ML applications. The successful candidate will understand both application development and HPC system software to assist in the resolution of issues encountered by AI/ML applications, in the optimization of such applications for better performance, and in helping port LANL applications to these new leading edge HPC systems.
This position will be filled at either the Scientist 2/Scientist 3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.
Scientist 2 $94,100 - $155,700
The successful candidate will perform tasks including but not limited to:
Advise application developers in how to best take advantage of new HPC system features while integrating AI/ML workflows into applications. Expected features of interest include hardware accelerators, high bandwidth memory, system topology, vendor specific AI/ML frameworks, and containers.
Assist in the resolution of issues with production application codes running on HPC platforms using compiler, MPI, code profiling, debugging, benchmarking, etc. techniques and knowledge.
Assist with efforts to tune applications for better performance/throughput using profiling, scripting, MPI, interconnect, etc. skills and knowledge.
Contribute to technical journal papers, reports, presentations, and concept papers on applications usage of large simulation and data analytic resources.
Coordinate with production system administrators in resolving problems encountered by applications running on HPC systems using parallel debugging and Linux skills and knowledge.
Develop excellent working relationships with application code teams, HPC tools and user support teams, and systems design, integration and production support teams, and infrastructure related teams.
Represent the aims and needs of LANL applications on HPC/AI/ML systems at internal venues.
Scientist 3 $113,100 - $190,900
In addition to what was outlined at the lower level, the Scientist 3 will be required to:
Advise application developers on integrating AI/ML workflows into existing applications as well as dedicated AI/ML applications and how to best take advantage of new system features while doing so
Lead resolution of issues with production application codes and workflows running in HPC environments
Be an expert resource on a subset of the LANL applications code base
Lead efforts to tune LANL applications for better performance/throughput
Assist the primary interface for applications when integrating new leading edge systems (both for LANL and LLNL systems) for LANL codes
Author technical journal papers, reports, presentations, and concept papers on applications usage of large computational resources.
Represent the aims and needs of LANL applications on HPC/AI/ML systems at internal and national venues.
What You Need
Minimum Job Requirements:
Demonstrated experience with applying ML/AI techniques.
Strong interpersonal and communication skills.
Advanced knowledge and proven ability in formulating and presenting results to technical audiences and readerships.
Strong C or C++ programming skills.
Experience writing, running, tuning, and debugging codes employing the Message Passing Interface (MPI) in a multi-node context.
Experience using software engineering tools such as git, autotools, cmake.
Experience using one or more common scripting languages such as Python, Julia, bash, or R.
Experience using the linux command line and standard tools (awk, sed, vim/nano, ssh, environment variables).
Additional Job Requirements for Scientist 3:
In addition to the requirements outlined above, qualification at the Scientist 3 level requires:
Authorship of technical journal papers, reports, presentations, and concept papers on applications usage of large simulation resources.
Experience working with large-scale software projects written in C and/or C++.
Experience leading code reviews and software organization and development efforts.
Experience organizing large multi-organization meetings.
Track record providing deep problem resolution assistance in a parallel computational context.
Education/Experience at Scientist 2 level: Position requires a Bachelor' degree in a STEM field from an accredited college and university, and 4 years of related experience.
Education/Experience at Scientist 3 level: Position requires a Master's degree in a STEM field from an accredited college or university and 6 years of relevant experience or an equivalent combination of education and experience directly related to the position.
Desired Qualifications:
Experience using PyTorch, TensorFlow, or other ML/AL frameworks.
Experience with application specific or closed source AI/ML technologies (RAPIDS, GPT3, AlphaFold).
Experience with standard AI/ML data sets and benchmarks (ML Perf, MNIST).
Experience applying AI/ML techniques to scientific problems.
Knowledge of memory technologies such as HBM2 or GPU memory hierarchy and their effects on application performance.
Experience using parallel debuggers and profilers (TAU, hpctoolkit, gdb, nsight-systems, nsight-compute, totalview, cachegrind, vtune).
Experience using MPI in combination with other parallel programming models on heterogenous systems, MPI+THREADS+GPU.
Experience with structured data storage solutions (sql, redis, hdf5, json).
Knowledge and experience with large production applications and their migration to new system architectures.
Demonstrated Experience using, tuning, and debugging HPC systems and/or ML/AI applications at scale.
Basic understanding and experience with containerization software (Docker, Singularity, Charliecloud). and remote image repositories.
Demonstrated experience with large scale systems integration.
Location: This position will be located in Los Alamos, NM.
COVID Vaccine:
The COVID vaccine is mandatory for all Laboratory employees, on-site contractors, and on-site subcontractors unless granted an accommodation under applicable state or federal law. This requirement will apply to those working on-site, those teleworking, and all new hires.
Position commitment: Regular appointment employees are required to serve a period of continuous service in their current position in order to be eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the time required, they may only apply for Laboratory jobs with the documented approval of their Division Leader. The position commitment for this position is 1 year.
Note to Applicants:
For full consideration, please submit a comprehensive cover letter that addresses each key requirements of the position.
Where You Will Work
Located in beautiful northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. Our generous benefits package includes:
§ PPO or High Deductible medical insurance with the same large nationwide network
§ Dental and vision insurance
§ Free basic life and disability insurance
§ Paid childbirth and parental leave
§ Award-winning 401(k) (6% matching plus 3.5% annually)
§ Learning opportunities and tuition assistance
§ Flexible schedules and time off (paid sick, vacation, and holidays)
§ Onsite gyms and wellness programs
§ Extensive relocation packages (outside a 50 mile radius)
Additional Details
Directive 206.2 - Employment with Triad requires a favorable decision by NNSA indicating employee is suitable under NNSA Supplemental Directive 206.2. Please note that this requirement applies only to citizens of the United States. Foreign nationals are subject to a similar requirement under DOE Order 142.3A.
Clearance: Q (Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter. This position requires a Q clearance which requires US Citizenship except in extremely rare circumstances. Dependent upon position, additional authorization to access nuclear weapons information may be required that may or may not be available to dual citizens depending upon the circumstances.
Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.
New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.
Regular position: Term status Laboratory employees applying for regular-status positions are converted to regular status.
Internal Applicants: Regular appointment employees who have served the required period of continuous service in their current position are eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the required period of continuous service, they may only apply for Laboratory jobs with the documented approval of their Division Leader. Please refer to Policy Policy P701 for applicant eligibility requirements.
Equal Opportunity: Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regard to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to [email protected] or call 1-505-665-4444 option 1.Employment Status Full Time