Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions src/handbook/peopleops/job-descriptions/devops-engineer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
navTitle: DevOps Engineer
navGroup: Job Descriptions
---

# DevOps Engineer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is mostly a SysOps role, where DevOps would be a Developer + Operations. Has the Job Description just changed to be the same as old school sysops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost entirely based on the job description we posted when we hired previously.


## Job Description

The DevOps Engineer at FlowFuse plays a critical role in building and maintaining the infrastructure that powers FlowFuse, ensuring reliability, scalability, and security for our customers' industrial automation and IIoT solutions. Additionally, a DevOps Engineer at FlowFuse should always be practicing our [Iterative Improvement](../../company/values.md#🔁-iterative-improvement) value, seeking areas for automation and refinement in our own internal practices and processes.

This role combines deep technical expertise, with a customer-focused mindset, to ensure reliable infrastructure and automation, as well as establishing a close relationship and collaboration with our engineering teams to create robust, automated systems that enable our engineering teams to deliver value quickly and reliably.

As a DevOps Engineer, you'll be responsible for building and managing our cloud infrastructure, automating deployment processes, and building tools that improve both developer productivity and customer experience. You'll work closely with our engineering teams to understand their needs and translate them into ideas on how automation can help them. This position requires a balance of technical depth, automation expertise, and strong collaboration skills to support both FlowFuse customers and internal engineering teams.

Key Responsibilities:

* **Infrastructure Management & Automation**: Design, implement, and maintain AWS-based infrastructure using Infrastructure as Code (IaC) principles. Build and maintain CI/CD pipelines that enable rapid, reliable deployments while maintaining high security standards.
* **Platform Reliability & Monitoring**: You should have your finger on the pulse of the health of the platform at all times, and be able to implement comprehensive monitoring, logging, and alerting systems to ensure platform stability and performance. Develop automated incident response procedures and conduct root cause analysis for production issues.
* **Developer Experience & Tooling**: Build and maintain development tools, automation scripts, and internal services that improve developer productivity and reduce friction in the development process. This should be a key focus of the role, you should be an amplifier of those around you by removing friction with automation, and you should always be seeking to identify areas for refinement in our own internal practices and processes.
* **Security & Compliance**: Implement security best practices across all infrastructure components, ensuring compliance with industry standards, security policies, and customer requirements.
* **Customer-Focused Operations**: Respond to incidents quickly, and understand the needs of our customers, iterating on onboarding processes for FlowFuse Self-Hosted and Dedicated, reducing friction and improving the customer experience.
* **Collaboration & Knowledge Sharing**: Work closely with engineering teams to understand their infrastructure needs, provide technical guidance, and share knowledge through documentation and mentoring. Ensure we are learning from incidents when they happen, document our learnings and implement changes to prevent similar incidents from happening again.

## Skills

The DevOps Engineer skill set includes:

### Must Have

* **Cloud Infrastructure Expertise**: 4-6 years of hands-on experience with AWS services including EC2, EKS, RDS, S3, CloudFront, and IAM. Strong understanding of cloud architecture patterns and best practices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow Azure or other clouds? Most of the tools are about the same I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had Azure/GCP as "Nice to have", whilst I'm sure there is overlap, it feels like given we are an AWS shop, and I know having a grasp of AWS costing is an absolute art, that having AWS should be "must have", the others are "nice to haves"

* **Kubernetes Proficiency**: Experience with Kubernetes API, container orchestration, and managing containerized applications in production environments.
* **Node.js & JavaScript**: Solid experience with Node.js development and JavaScript ecosystem, enabling effective collaboration with our engineering teams.
* **CI/CD & Automation**: Proven experience building and maintaining CI/CD pipelines using tools like GitHub Actions, Jenkins, or similar platforms.
* **Infrastructure as Code**: Experience with Terraform, CloudFormation, or similar IaC tools for managing cloud infrastructure.
* **Monitoring & Observability**: Experience implementing monitoring solutions using tools like Prometheus, Grafana, ELK stack, or similar observability platforms.
* **Linux System Administration**: Strong Linux skills including shell scripting, system configuration, and troubleshooting.
* **Git & Version Control**: Proficiency with Git workflows and collaborative development practices.

### Nice to Have

* **Observability Tools**: Experience deploying and managing observability tools like DataDog, Sentry, or similar APM and Monitoring solutions.
* **Database Management**: Experience with PostgreSQL, MySQL, or other database systems including backup, recovery, and performance optimization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Git is a must have, and PG a nice to have? That's surprising in my mind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, will pull it up into "Must Have"

* **Security Best Practices**: Knowledge of security frameworks, vulnerability management, and compliance requirements (SOC 2, ISO 27001).
* **Multi-cloud Experience**: Experience with other cloud providers (Azure, GCP) or hybrid cloud environments.
* **Industrial/IIoT Background**: Understanding of industrial automation protocols, edge computing, or IoT device management.
* **Python/Go Development**: Additional programming language experience for building internal tools and automation scripts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use Node.JS for this? For internal tools I don't think it's worth adding a new language?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can scope this to NodeJS, but also, why be opinionated for internal tooling? If we're building utility apps and automations, I'd be hoping people use the best tools available, rathe than forcing it into NodeJS?

* **Team Leadership**: Experience mentoring junior engineers or leading infrastructure initiatives in larger teams.

## 90-Day Plan

* **Week 1-4: Foundation & FlowFuse Immersion**
* **Infrastructure Assessment**: Conduct a comprehensive review of existing AWS infrastructure, CI/CD pipelines, and monitoring systems
* **Team Integration**: Meet with engineering teams to understand their workflows, pain points, and infrastructure needs
* **Documentation Review**: Study existing infrastructure documentation and incident response procedures
* **Install FlowFuse**: Install FlowFuse in a variety of environments, and provide feedback on the experience and areas of improvement
* **Tool Familiarization**: Get hands-on experience with FlowFuse's current toolchain and development processes
* **Initial Improvements**: Implement quick wins to improve developer experience, system reliability and onboarding experience for FlowFuse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a really ambitious plan. With a developer I expect the first week to have a PR merged, and pick up the pace from there. The first 5 bullet points are studying and getting to know items. I fear we hire someone that just studies for weeks without starting to iterate on the infra from day one.

I suspect that anyone can find a broken window in their first week: https://en.wikipedia.org/wiki/Broken_windows_theory

Impact from day 2 onwards I'd say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensible, though I'd argue in that case many of our other job descriptions we already have published need to be more aggressive to align to this too.


* **Week 5-8: Infrastructure Enhancement & Automation**
* **CI/CD Optimization**: Enhance existing deployment pipelines with better testing, security scanning, and rollback capabilities
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the release seems very manual, with mutiple engineers in a room together each time. Consider moving this up to week 1-4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to move it up. Release is only 2 people (because one is running the scripts, and the second has to approve PRs opened before they can be merged). Generally taking 45 minutes of active work (not accounting for 20 mins in the middle where tests run)

* **Monitoring Implementation**: Deploy comprehensive monitoring and alerting for critical systems and customer-facing services
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the foundation already, I'm not sure what the task is here.

* **Automation Development**: Build scripts and tools to automate common operational tasks and reduce manual intervention
* **Performance Optimization**: Establish performance benchmarks and implement optimizations to improve response times and resource utilization
* **Security Hardening**: Tackle security issues as they arise, and implement additional security measures and compliance controls across the infrastructure
* **Knowledge Sharing**: Begin documenting processes and sharing knowledge with the engineering team
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Knowledge is Power" + Contributions on the first week imply that this should be done the first weeks.


* **Week 9-13: Strategic Impact & Innovation**
* **Infrastructure Scaling**: Design and implement solutions to support FlowFuse's growth and increasing customer demands
* **Disaster Recovery**: Implement comprehensive backup and disaster recovery procedures
* **Cost Optimization**: Analyze cloud costs and implement strategies to optimize spending while maintaining performance
* **Advanced Monitoring**: Deploy advanced observability tools and create bespoke dashboards for better system visibility
* **Process Improvement**: Lead initiatives to improve operational processes and reduce mean time to recovery (MTTR)

## Hiring Plan

1. **Initial Screening**: Review resumes and cover letters to assess technical qualifications and experience alignment with FlowFuse's needs.

2. **Technical Interview (Infrastructure & Automation)**: Video interview focusing on AWS expertise, Kubernetes knowledge, CI/CD experience, and problem-solving approach to infrastructure challenges.

3. **System Design Interview**: Present candidates with real-world scenarios involving scaling challenges, incident response, or infrastructure optimization to assess their architectural thinking.

4. **STAR Interview**: Behavioral interview focusing on past situations, tasks, actions, and results to understand problem-solving abilities and value alignment

5. **Final Interview**: A final interview with key stakeholders or other members of the leadership team

6. **Offer**: Extend an offer to the selected candidate.