Site Reliability Engineer Resume Examples

site reliability engineer

  • Escalate issues as needed to product development or service engineering team per documented procedures, while at the same time establishing a contingency plan to eliminate any intermittent service disruption 
  • Document and detail areas of improvement to bolster architecture, design, technical requirements and service specifications.
  • Present architecture, design, and technical choices to internal audiences Design and deploy metrics, monitoring, and logging systems on AWS / Infra systems to understand the system performance and isolate bottlenecks.
  • helps drive efforts to improve triage time and bring down MTTR (Mean Time to Repair) and provides follow-up support to provide mitigation in the future 
  • Proactively monitor availability and performance of the SAP ARIBA cloud products using the required toolset 
  • Effectively respond to Monitoring alerts, incident tickets, email requests or other channels coming in to Site Reliability Engineering team 

senior site reliability engineer

  • design and DevOps implementation of a multi-tenant Kubernetes cluster, running a set of open ecosystem tools (calico, nginx-ingress, fluentd, Prometheus, kube2iam, LDAP auth, etc),
  • authoring of configuration management procedures, workflows, and playbooks,
  • design and execution of management procedures and configuration standards
  • implementations based on devops tools (Ansible, Terraform, AWS API)
  • implementation e2e tests for infrastructure CI and CD processes based on pytest framework

site reliability engineer

  • Ensured production service availability with maximum uptime for Adobe Campaign. Developing tools to facilitate production system uptime and achieving product SLA .
  • Automated production deployment using Ansible.
  • Infrastructure Automation and Orchestration .
  • Automation of daily Ad-hoc manual  processes .
  • Experience in development, deployment and scaling systems across DC and Cloud infrastructure.
  •  Production System Monitoring , Incident Management , Server Capacity Management .
  • Troubleshoot operational and application issues and fix them within the SLA.

site reliability engineer

  • Query AnalyzerSniffs packets(using pcap) on ethernet interface, decodes the packet by using MySQL client-server protocol, calculates the checksum of the query and sends aggregated data to the centralized server .
  • Built UI on top of above data to project meaningful data.
  • Writing up control scripts for new processes.
  • CVE Tracking and Security related fixes in Infrastructure.

site reliability engineer

  • Build from scratch, a web application for Infrastructure inventory management, using the LAMP stack.
  • Developed micro-services, in Golang, for ETL jobs and data collection. 
  • Created web application using Django, Javascript/JQuery, D3.js for graphs and ag-grid for the reports. 
  • Performing data analyses and reporting key insights using python modules like NumPy, Pandas, Matplotlib, …
  • Maintaining the Service Level Agreements (SLAs) with respect to key SLI indicators, in the project. 
  • Created CI/CD pipeline setup and followed Test Driven Development (TDD). 

sr. site reliability engineer

  • Work closely with Business, engineering and 3rd party financial institutions to discuss process design, planning and implementation.
  • Create new and refactor old business-critical batch jobs in a centralized platform for proper dependency linking, management and alerting.
  • Configure Infrastructure and Application monitoring across a multi-OS environment.
  • Support and Maintain f5 LTMs and work closely with the network team on traffic load balancing topologies.
  • Managed GPOs, user and group policies using Windows Server.
  • Assist in migratory efforts for application/server farms to new co-locations.
  • Work with engineering on new application build-out initiatives and assist in deploying their applications to production.

site reliability engineer

  • Co-ordinating and collaborating with different engineering teams and operations teams to ensure the successful completion of a high heat product launch.
  • Define new processes and workflows to improve the efficiency of deliverables in this continuously evolving domain.
  • Ensure the different AWS services utilized by underlying domains, are scaled out as scheduled.
  • Analyze the consumer incidents reported during the purchase of the high product and work with Engineering teams to fix it.
  • Increase the observability of underlying services using various tools like Signal Fx, Splunk, New Relic, etc.
  • Identify the SLIs for the different micro-services and ensure that they are within SLA and meeting SLO.

site reliability engineer

  • Developing CI/CD roadmap and operations processes inside team.
  • Help scale our platform to more than 200x customers.
  • Improve deployment process within AWS (ex. cross-region automated deployment).
  • AWS services administration: IAM, VPC, Route 53, EC2, S3, CodeBuild, CodeDeploy, RDS, CloudWatch.
  • VCS: Stash.
  • Automation tools: Puppet, Ansible
  • Docker, Vagrant

site reliability engineer

  • SME for Business Operations, Risk and Investor Servicing processes and automation.
  • Responsible for proper documentation of job workflows, dependencies and impact.
  • Participate in a 24/7 Site Reliability on-call schedule for production issues
  • Scripting: Python

site reliability engineer

  •  Assist with development and implementation of SRE solutions for large scale distributed web applications across multiple tiers 
  •  Perform proactive daily system monitoring including reviewing system and application logs as well as responding to, triaging, troubleshooting and remediating incidents 
  •  Build monitoring and automation tools, Develop dashboards using APM tool
  •  Experience with UNIX/Linux systems administration, network concepts and Database query languages – SQL 
  •  Responsible for setting up ELK (ElasticSearch, Logstash, Kibana) platform, parsing unstructured logs using regular expressions to structured JSON format 
  •  Act as top-tier on-call support for critical uptime business applications to maintain availability and minimize downtime during outage scenarios. 
  •  Experience in Developing applications using Java, JDBC using tools like Eclipse as a part of Automation