ID
#44408935
Job type
Permanent
Salary
TBD
Source
First American Financial Corporation
Date
2022-07-26
Deadline
2022-09-24

Remote Lead Site Reliability Engineer

Nevada, Carsoncity, 89701

Permanent

Vacancy expired!

Company Summary

Join a team that puts its People First! As a member of the First American family of companies, First American Trust is a federal savings bank that has provided banking, wealth management, and trust solutions on a national, full-service basis for more than five decades. Since 1889, First American (NYSE: FAF) has held an unwavering belief in its people. They are passionate about what they do, and we are equally passionate about fostering an environment where all feel welcome, supported, and empowered to be innovative and reach their full potential. Our inclusive, people-first culture has earned our company numerous accolades, including being named to the Fortune 100 Best Companies to Work For® list for seven consecutive years. We have also earned awards as a best place to work for women, diversity and LGBTQ+ employees, and have been included on more than 50 regional best places to work lists. First American will always strive to be a great place to work, for all. For more information, please visit www.careers.firstam.com.

Job Summary

Remote Candidates Welcome

At First American Trust, we are working on new technologies supporting the banking and investment industries. We are currently looking for a Lead Site Reliability Engineer to help us establish an SRE practice. You will work together with development teams to implement automated solutions using technologies like Kubernetes, Terraform, GitLab to automatically build, test, integrate and deploy scalable and secure applications in the cloud. As a Lead Site Reliability Engineer, you will be a key part of the software development and product teams and responsible for stabilizing software/platform infrastructure deployment. The ideal candidate applies strong engineering experience and an innate drive to improve existing systems and processes, with the creativity to develop novel solutions to evolving challenges. The role is responsible for the availability, reliability, integrity, and efficient operation of critical platform services and applications, ensuring they meet the requirements of internal and external users. This is achieved by monitoring, maintaining, automating and developing solutions that focus on uninterrupted delivery of applications throughout the software lifecycle.

Responsibilities

Architect and design monitoring and recovery solutions to provide optimum delivery and resilience.
Efficiently handle live production incidents, debug/troubleshoot application and cloud infrastructure issues, follow and implement SRE best practices.
Measure and monitor application performance, take steps to improve overall application performance and stability, and follow through with implementation.
Advises internal and partner groups on establishing Service Level Objectives, Service Level Indicators and Error Budgets to ensure better reliability.
Build end-to-end "monitoring infrastructure" (Logging, Metrics, Tracking) through automation and work closely with the other team members to provide the right tooling to measure the reliability of our systems.
Collaborate with development and product teams to ensure availability and reliability of the applications and infrastructure.
Maintain effective knowledge base and runbooks to bring faster resolution to production issues.
Leads technical evaluations and "proof of concept" programs as it relates to evaluating and implementing new technologies and tools.

Plays a principal role in the evaluation of products to provide SRE services for appropriateness of cost and technical feasibility.
Leads cross-functional technical teams as it relates to complex infrastructure projects and associated service offerings.
Is a principal contributor to the design, development and implementation of automated departmental workflow processes.

Serve as an escalation point for Systems Administrators, Engineers, and other technology teams in the resolution of server and system problems.
Expert analytical and problem-solving skills to troubleshoot cloud infrastructure problems across a wide array of technical disciplines.
Required to perform duties outside of normal work hours based on business needs.

Job Qualifications

Bachelor's degree in Computer Science
7+ years of hands-on experience in application and technical support role in live production environment following Development, DevOps, and SRE best practices.
4+ years of hands-on experience with monitoring tools such as Splunk, AppDynamics, ELK, Azure Monitor, Microsoft SCOM, etc
Knowledge of Azure Service Bus, Azure Kubernetes Services, Azure APIM, Azure Functions, Gitlab preferred.
Proven ability to design and execute zero-downtime cloud deployments (blue-green / rolling deployment / canary) for workloads such as AKS clusters, Azure Functions etc.
Experience on containerization and pipeline administration.
Experience with automation using PowerShell, Python, Terraform, Azure CLI / ARM templates, Bash scripting or similar technologies preferred.
A high level of intellectual curiosity and general professionalism is necessary.
Experience of SCRUM/Agile experience is a plus.

For candidates located in Colorado, the range is $94,798 - 152,900

#LI-BG1

#TSTIT

First American invests in its employees' development and well-being, empowers them to provide superior customer service and encourages them to serve the communities where they live and work. First American is committed to diversity and inclusion. We are an equal opportunity employer.

Based on eligibility, First American offers a comprehensive benefits package including medical, dental, vision, 401k, PTO/paid sick leave and other great benefits like an employee stock purchase plan.

Vacancy expired!

Report job