Site Reliability Engineer
Ironnet Cybersecurity Inc
Fulton, MD
Site Reliability Engineer
Ironnet Cybersecurity Inc Fulton, MD or McLean, VA
Job Description
As a Site Reliability Engineer at IronNet, you will be a technology leader in a dynamic team continuously improving the reliability and scalability of our AWS and on-prem deployment platforms, ensuring the highest levels of uptime and performance.
Responsibilities
- Troubleshoot issues across IronNet’s large scale, complex product stack including hardware, software, application and network. Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
- Lead post-sales site surveys and implement scalable, robust solutions that meet the needs of IronNet customers.
- Drive standardization, enhanced monitoring, troubleshooting best practices and automation efforts across the organization
- Recommend and make improvements to our existing production and staging environments.
Technical Requirements
- Fundamental knowledge of operating systems, networking, and distributed systems
- Expert level Linux systems administration and management. Knowledge of Linux/UNIX systems administration and performance tuning
- Deep understanding of: Ethernet, VLAN, IPv4/IPv6, ARP, DHCP, DNS, and TCP. Comfortable configuring DNS, DHCP, and LAN/WAN technologies.
- Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures
- Expert level knowledge of at least one public or private cloud technology such as Amazon AWS or OpenStack
- Familiarity with OS container technology: Docker, LXC, namespaces/cgroups
- Solid understanding of systems and application design and service design, including messaging protocols & behavior, caching strategies and software design practices
- Practical, solid knowledge of shell scripting and at least one higher-level language (Python or Ruby preferred). Ability to develop clean, tested, and maintainable automation and other tools using (one or more of) Python, Ruby, Perl, or Go
- Experience with distributed compute (e.g., Spark or Hadoop), storage (relational databases such as Postgres or MySQL, horizontally-scalable non-relational databases such as HBase, Riak, or Cassandra), and search infrastructure (such as ElasticSearch or Solr/Lucene)
Other Requirements
- Minimum 7 years of managing services in an internet scale *nix environment
- Previous application operations (a.k.a. "site reliability engineering", "production engineering") or experience in a large scale 24/7 production environment as a software engineer, systems administrator, operations engineer, release engineer, or similar role
- Comfortable interfacing with customers as well as across engineering, product, security teams. Excellent written and interpersonal communication, and documentation skills
- Must be adaptable and have the ability to prioritize tasks and work independently
- Excellent troubleshooting, diagnosis and analytical skills
- B.S. in computer science or similar field desirable
- Must have the ability to travel up to 50% and provide after-hours support to customer environments
OR