A senior systems administrator for the ACCRE cluster team that helps manage the >10,000-core Linux cluster and the 1.2 PB (usable) cluster GPFS filesystem.This systems administrator will be responsible for the following projects: (1) maintaining, administering, and improving upon the current cluster environment, (2) exploring the use of virtualization technologies at ACCRE to offer new or improved services to users, (3) researching, testing, configuring, and managing a cluster optimized for Big Data analysis.
Computing is emerging as a third paradigm for discovery, complementing theory and experiment. The Advanced Computing Center for Research and Education (ACCRE) is being built and operated by Vanderbilt faculty. Its mission is to allow Vanderbilt researchers to define, benefit from, and explore HPC capabilities.
The center manages an over 10,000 processor Linux cluster comprised of multiple computer architectures and over 12 PB of disk storage.
Duties and Responsibilities
Manage ACCRE's network and provide support:
Provide guidance on network design and implementation for ACCRE needs.
Set up / configure network hardware including switches, routers, and firewalls.
Provide operational and troubleshooting support for ACCRE networks.
Configure DHCP, DNS, firewalls, and general network security.
Set up and administer a new cluster optimized for Big Data analysis:
Install, configure, and maintain tools like Hadoop/HDFS, YARN, and Spark across the cluster for ease of access/use
Install, configure, and maintain high-level tools like Pig, Hive, and Mahout for easing programming effort for researchers
Work with ACCRE staff to provide a cluster environment that maintains conventions and standards within the current ACCRE high-performance computing (HPC) cluster
Help make decisions about hardware requirements for meeting needs of researchers using Big Data cluster
Provide system administration for the ACCRE compute cluster:
Set up / configure cluster hardware including gateways, compute nodes, and cluster management infrastructure.
Install operating system and related utility software.
Monitor the status of the cluster utilizing tools such as Nagios, including customizing the tools for ACCRE specific needs.
Compile / install application software packages needed by researchers.
Assist with the administration of the cluster job scheduler, including modify user limits, creating / modifying / deleting node reservations, and diagnosing issues with the job scheduler.
Assist with ACCRE's GPFS, a large-scale distributed data storage system.
Support DORS, a collaboration with the Center for Structural Biology and Vanderbilt IT on the deployment of GPFS on a DataDirect Networks system.
Other project responsibilities:
Respond to help desk tickets to solve user problems and to educate users on cluster usage.
On a rotating basis, serve as the on call person for evening and weekend hours, such as a rotating 4-week schedule or every other week in a Level 2 support rotation.
Work nights and weekends when occasionally needed for scheduled or unscheduled downtimes.
Compile documentation in a timely manner for all ACCRE projects and tasks, both for new projects and for changes to ongoing projects.
Physically move and lift hardware when needed.
Self-driven, inquisitive and productive troubleshooting abilities
Strong ability to work individually and in a team environment.
Commitment to continuous improvement.
Ability to rapidly adapt to the current environment and a dynamic environment.
Strong ability to share knowledge coherently with others.
Strong interpersonal skills.
Ability to work independently and make critical decisions.
Ability to gain knowledge/skills from both others in the group and independently.
Willingness to work outside of one's comfort zone effectively.
Familiarity and experience with Disk storage hardware (constructing and deployment): SAN, NAS, JBOD, RAID (all levels), and RAID controllers preferred.
Networking skills - Strongly qualified candidates will possess a CCNA or equivalent professional experience.
DHCP and DNS configuration and management.
Computer/network security: Firewall configuration (iptables, and transparent bridging firewalls), analyzing packet dumps (tcpdump), virus detection, protection, and elimination, intrusive and non-intrusive monitoring.
Networking protocols (TCP/IP, UDP, Ethernet, etc), LAN/WAN topologies, and high speed networking experience (40 and/or 100 Gbps).
Advanced routing protocols including BGP and OSPF.
Advanced knowledge of networking design, implementation, and support.
Knowledge of national and international high-speed research networks.
Profile of an Ideal Candidate
Bachelor's degree required; strongly preferred to be in computer science or computer engineering.
Five years of experience with system administration with UNIX based operating systems required but ten year preferred.
Five years' experience with programming/scripting.
Two years of networking experience or CCNA certifications.
Familiarity and experience with software-defined networking (SDN) is preferred.
At least 1-2 years of experience with virtualization and Big Data tools.
About Vanderbilt University
Vanderbilt University, located in Nashville, Tennessee, is a top-15 private research university offering a full range of undergraduate, graduate and professional degrees. Vanderbilt is situated on a 330-acre campus near the thriving city center, serving more than 12,000 students and employing almost 7,000 faculty and staff. The university's students, staff, and faculty frequently cite Nashville and the surrounding area as one of the many perks of being a part of the Vanderbilt community. Vanderbilt University is a place where your expertise will be valued, your knowledge expanded, and your abilities challenged. It is a place where your diversity is sought and celebrated. Vanderbilt was recently named as one of "America's Best Large Employers" and the top employer in Tennessee and the Nashville metropolitan area in 2019 (Forbes).
About Vanderbilt Benefits
Vanderbilt University offers a competitive, flexible benefits package including health, dental, vision, life, accidental death & dismemberment, disability insurance, paid time off, and a 403(b) retirement plan with employer match. Vanderbilt offers tuition assistance to employees, spouses, and dependent children.
Commitment to Equity, Diversity and Inclusion
Vanderbilt University is committed to achieving the goal of a diverse and inclusive academic community of faculty, staff, and students. We seek individuals who are committed to this goal and our campus values.
Internal Number: 1901879
About Vanderbilt University
Vanderbilt University is a center for scholarly research, informed and creative teaching, and service to the community and society at large. Vanderbilt will uphold the highest standards and be a leader in the quest for new knowledge through scholarship, the dissemination of knowledge through teaching and outreach, and the creative experimentation of ideas and concepts. In pursuit of these goals, Vanderbilt values most highly intellectual freedom that supports open inquiry, equality, compassion, and excellence in all endeavors.