-
Notifications
You must be signed in to change notification settings - Fork 15
Handbook: Job Description - DevOps Engineer #3870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
5feed0d
ae9fa98
684418a
b23d942
811e97c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
--- | ||
navTitle: DevOps Engineer | ||
navGroup: Job Descriptions | ||
--- | ||
|
||
# DevOps Engineer | ||
|
||
## Job Description | ||
|
||
The DevOps Engineer at FlowFuse plays a critical role in building and maintaining the infrastructure that powers FlowFuse, ensuring reliability, scalability, and security for our customers' industrial automation and IIoT solutions. Additionally, a DevOps Engineer at FlowFuse should always be practicing our [Iterative Improvement](../../company/values.md#🔁-iterative-improvement) value, seeking areas for automation and refinement in our own internal practices and processes. | ||
|
||
This role combines deep technical expertise, with a customer-focused mindset, to ensure reliable infrastructure and automation, as well as establishing a close relationship and collaboration with our engineering teams to create robust, automated systems that enable our engineering teams to deliver value quickly and reliably. | ||
|
||
As a DevOps Engineer, you'll be responsible for building and managing our cloud infrastructure, automating deployment processes, and building tools that improve both developer productivity and customer experience. You'll work closely with our engineering teams to understand their needs and translate them into ideas on how automation can help them. This position requires a balance of technical depth, automation expertise, and strong collaboration skills to support both FlowFuse customers and internal engineering teams. | ||
|
||
Key Responsibilities: | ||
|
||
* **Infrastructure Management & Automation**: Design, implement, and maintain AWS-based infrastructure using Infrastructure as Code (IaC) principles. Build and maintain CI/CD pipelines that enable rapid, reliable deployments while maintaining high security standards. | ||
* **Platform Reliability & Monitoring**: You should have your finger on the pulse of the health of the platform at all times, and be able to implement comprehensive monitoring, logging, and alerting systems to ensure platform stability and performance. Develop automated incident response procedures and conduct root cause analysis for production issues. | ||
* **Developer Experience & Tooling**: Build and maintain development tools, automation scripts, and internal services that improve developer productivity and reduce friction in the development process. This should be a key focus of the role, you should be an amplifier of those around you by removing friction with automation, and you should always be seeking to identify areas for refinement in our own internal practices and processes. | ||
* **Security & Compliance**: Implement security best practices across all infrastructure components, ensuring compliance with industry standards, security policies, and customer requirements. | ||
* **Customer-Focused Operations**: Respond to incidents quickly, and understand the needs of our customers, iterating on onboarding processes for FlowFuse Self-Hosted and Dedicated, reducing friction and improving the customer experience. | ||
* **Collaboration & Knowledge Sharing**: Work closely with engineering teams to understand their infrastructure needs, provide technical guidance, and share knowledge through documentation and mentoring. Ensure we are learning from incidents when they happen, document our learnings and implement changes to prevent similar incidents from happening again. | ||
|
||
## Skills | ||
|
||
The DevOps Engineer skill set includes: | ||
|
||
### Must Have | ||
|
||
* **Cloud Infrastructure Expertise**: 4-6 years of hands-on experience with AWS services including EC2, EKS, RDS, S3, CloudFront, and IAM. Strong understanding of cloud architecture patterns and best practices. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to allow Azure or other clouds? Most of the tools are about the same I think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had Azure/GCP as "Nice to have", whilst I'm sure there is overlap, it feels like given we are an AWS shop, and I know having a grasp of AWS costing is an absolute art, that having AWS should be "must have", the others are "nice to haves" |
||
* **Kubernetes Proficiency**: Experience with Kubernetes API, container orchestration, and managing containerized applications in production environments. | ||
* **Node.js & JavaScript**: Solid experience with Node.js development and JavaScript ecosystem, enabling effective collaboration with our engineering teams. | ||
joepavitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **CI/CD & Automation**: Proven experience building and maintaining CI/CD pipelines using tools like GitHub Actions, Jenkins, or similar platforms. | ||
* **Infrastructure as Code**: Experience with Terraform, CloudFormation, or similar IaC tools for managing cloud infrastructure. | ||
* **Monitoring & Observability**: Experience implementing monitoring solutions using tools like Prometheus, Grafana, ELK stack, or similar observability platforms. | ||
* **Linux System Administration**: Strong Linux skills including shell scripting, system configuration, and troubleshooting. | ||
* **Git & Version Control**: Proficiency with Git workflows and collaborative development practices. | ||
|
||
### Nice to Have | ||
|
||
* **Observability Tools**: Experience deploying and managing observability tools like DataDog, Sentry, or similar APM and Monitoring solutions. | ||
joepavitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **Database Management**: Experience with PostgreSQL, MySQL, or other database systems including backup, recovery, and performance optimization. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So Git is a must have, and PG a nice to have? That's surprising in my mind There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair, will pull it up into "Must Have" |
||
* **Security Best Practices**: Knowledge of security frameworks, vulnerability management, and compliance requirements (SOC 2, ISO 27001). | ||
* **Multi-cloud Experience**: Experience with other cloud providers (Azure, GCP) or hybrid cloud environments. | ||
* **Industrial/IIoT Background**: Understanding of industrial automation protocols, edge computing, or IoT device management. | ||
* **Python/Go Development**: Additional programming language experience for building internal tools and automation scripts. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use Node.JS for this? For internal tools I don't think it's worth adding a new language? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we can scope this to NodeJS, but also, why be opinionated for internal tooling? If we're building utility apps and automations, I'd be hoping people use the best tools available, rathe than forcing it into NodeJS? |
||
* **Team Leadership**: Experience mentoring junior engineers or leading infrastructure initiatives in larger teams. | ||
|
||
## 90-Day Plan | ||
|
||
* **Week 1-4: Foundation & FlowFuse Immersion** | ||
* **Infrastructure Assessment**: Conduct a comprehensive review of existing AWS infrastructure, CI/CD pipelines, and monitoring systems | ||
* **Team Integration**: Meet with engineering teams to understand their workflows, pain points, and infrastructure needs | ||
* **Documentation Review**: Study existing infrastructure documentation and incident response procedures | ||
* **Install FlowFuse**: Install FlowFuse in a variety of environments, and provide feedback on the experience and areas of improvement | ||
* **Tool Familiarization**: Get hands-on experience with FlowFuse's current toolchain and development processes | ||
* **Initial Improvements**: Implement quick wins to improve developer experience, system reliability and onboarding experience for FlowFuse | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a really ambitious plan. With a developer I expect the first week to have a PR merged, and pick up the pace from there. The first 5 bullet points are studying and getting to know items. I fear we hire someone that just studies for weeks without starting to iterate on the infra from day one. I suspect that anyone can find a broken window in their first week: https://en.wikipedia.org/wiki/Broken_windows_theory Impact from day 2 onwards I'd say. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sensible, though I'd argue in that case many of our other job descriptions we already have published need to be more aggressive to align to this too. |
||
|
||
* **Week 5-8: Infrastructure Enhancement & Automation** | ||
* **CI/CD Optimization**: Enhance existing deployment pipelines with better testing, security scanning, and rollback capabilities | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently the release seems very manual, with mutiple engineers in a room together each time. Consider moving this up to week 1-4. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Happy to move it up. Release is only 2 people (because one is running the scripts, and the second has to approve PRs opened before they can be merged). Generally taking 45 minutes of active work (not accounting for 20 mins in the middle where tests run) |
||
* **Monitoring Implementation**: Deploy comprehensive monitoring and alerting for critical systems and customer-facing services | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have the foundation already, I'm not sure what the task is here. |
||
* **Automation Development**: Build scripts and tools to automate common operational tasks and reduce manual intervention | ||
* **Performance Optimization**: Establish performance benchmarks and implement optimizations to improve response times and resource utilization | ||
* **Security Hardening**: Tackle security issues as they arise, and implement additional security measures and compliance controls across the infrastructure | ||
* **Knowledge Sharing**: Begin documenting processes and sharing knowledge with the engineering team | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Knowledge is Power" + Contributions on the first week imply that this should be done the first weeks. |
||
|
||
* **Week 9-13: Strategic Impact & Innovation** | ||
* **Infrastructure Scaling**: Design and implement solutions to support FlowFuse's growth and increasing customer demands | ||
* **Disaster Recovery**: Implement comprehensive backup and disaster recovery procedures | ||
* **Cost Optimization**: Analyze cloud costs and implement strategies to optimize spending while maintaining performance | ||
* **Advanced Monitoring**: Deploy advanced observability tools and create bespoke dashboards for better system visibility | ||
* **Process Improvement**: Lead initiatives to improve operational processes and reduce mean time to recovery (MTTR) | ||
|
||
## Hiring Plan | ||
|
||
1. **Initial Screening**: Review resumes and cover letters to assess technical qualifications and experience alignment with FlowFuse's needs. | ||
|
||
2. **Technical Interview (Infrastructure & Automation)**: Video interview focusing on AWS expertise, Kubernetes knowledge, CI/CD experience, and problem-solving approach to infrastructure challenges. | ||
|
||
3. **System Design Interview**: Present candidates with real-world scenarios involving scaling challenges, incident response, or infrastructure optimization to assess their architectural thinking. | ||
|
||
4. **STAR Interview**: Behavioral interview focusing on past situations, tasks, actions, and results to understand problem-solving abilities and value alignment | ||
|
||
5. **Final Interview**: A final interview with key stakeholders or other members of the leadership team | ||
|
||
6. **Offer**: Extend an offer to the selected candidate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is mostly a SysOps role, where DevOps would be a Developer + Operations. Has the Job Description just changed to be the same as old school sysops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost entirely based on the job description we posted when we hired previously.