At Canva, our mission is to democratize design and empower creativity and visual expression for every person, on every platform. Since launch in April 2013, our user base has grown exponentially, amassing over 15 million users in over 100 languages, making us one of the world's fastest-growing technology companies.
As engineers, we're fortunate and proud to be able to say that our work is used by millions of people all over the world. So for us, reliability must be a feature, not a "nice to have" - it is crucial that we, as engineers, ensure that the Canva experience is uninterrupted for the growing millions using the platform. We're doubling-down on this initiative by building a new team within our broader infrastructure group.
This is an opportunity to be one of the founding members of Canva's Chaos Engineering team. The Chaos Engineering team is responsible for ensuring that all of the resiliency measures that have been developed and implemented are working as expected. When they don't work, we're responsible for working with other engineering teams across the business to investigate and remediate the issues. Sometimes the solution may be a straightforward and pragmatic fix. Other times, it may require the ground-up development of new tools to ensure the issue is resolved going forward.
- As an individual contributor, design and implement tools and libraries that service teams can use to improve the reliability of their services. For example – adding a new long-awaited feature in our circuit breaker library
- Conduct and automate chaos experiments to identify possible scenarios where cascading failures may occur and to verify the reliability measures we introduce to prevent this work as expected. For example: discovering what will happen when this newly introduced service goes down, or, does the fallback for a rare failure actually work?
- Work with product engineering teams to ensure that reliability best practices and tools are rolled out in every service across the whole organization. It’s not enough to create a new throttling library, we want to make sure that it’s successfully used in every service.
- Deep investigation into production incidents – followed up by applying the learnings to the code base
- Researching, developing, and justifying the best choices in the form of design docs for tools and processes that will shape the future of reliability at Canva
- Promote creative and conceptual problem-solving approaches; as opposed to framework- or library-heavy patchwork
- Propose new approaches and solutions to ensure we future-proof Canva’s distributed cloud infrastructure as we scale
- Participating in design meetings, hiring interviews, and code reviews
Requires Skills & Experience
- At least five (5) years of commercial experience of working as a reliability/chaos engineer in a large, distributed, cloud-based environment – any of the usual suspects (AWS, Google Cloud, Azure) is fine!
- Be happy to work in Java, since our services and libraries are primarily written in Java 11
- Disciplined coding practices and experience with code reviews and pull requests
- Strong communication and team collaboration skills, both written and verbal. As a reliability engineer, you will need to share the knowledge, communicate and coordinate changes across multiple service teams.
- Solid understanding of resiliency techniques and patterns – load balancing, throttling, back pressure, circuit breaking, etc – the good stuff
Perks & Benefits
- Competitive salary, plus equity options
- Flexible daily working hours, we value work-life balance
- In-house chefs that cook delicious breakfast and lunch for us each day
- Onsite Gym; Yoga Benefits
- Generous parental (including secondary) leave policy
- Pet-friendly offices
- Sponsored social clubs and team events
- Relocation budget for interstate or overseas individuals that legally qualify for visa sponsorship