InfoQ reached out to MacRae to get a sense of Ticketmaster’s challenges and successes adopting DevOps, as well as how compliance and operations look like in a global, multi-time zone, 24/7 availability organization with 14 different core ticketing products. We also discuss what is Ticketmaster’s tech maturity model tool and how it can be used to promote positive change rather than blame.
InfoQ: Would you briefly introduce yourself, what you do and what brought you to DevOps?
Connon MacRae: I lead the Technical Operations team for International. This includes our Product and Application Support, Systems & Networking and Datacentres. Our main focus is to try and enable product and engineering teams to be more autonomous and own their product. Ultimately, my interest is the same as my colleagues in the technology and product team. We must be able to make more frequent but sustainable change in our ecosystem in order to build the best ticketing experience for clients and fans. It is a business imperative. My role has been to facilitate this.
InfoQ: How did DevOps get started at Ticketmaster? What were the first steps taken and why?
MacRae: It was here many years ago, and it has enabled us to deliver new functionality and change at the necessary pace. Today we’re working to create a model for Product, Project & Technology to all work closer together.
InfoQ: Where is Ticketmaster at now in your DevOps transformation?
MacRae: It depends; some teams are very evolved already. If they are in AWS, then they already have a lot of great tooling to make us more DevOps friendly. Many Engineering teams are already very autonomous but might still be missing operational understanding. It depends on the technology stack that teams are working with. A few groups have an operational layer still because of the scope and importance of uptime, but we’ve tried to move those groups under the same leadership or align them better so that there is less conflict and the goals can be better aligned. There is no one size fits all.
InfoQ: What are the main challenges still ahead in that transformation?
MacRae: Culture is certainly one. Empathy is probably the most important thing to bring people together and understand what drives conflict. Often a team “don’t know what they don’t know”, so we still see the pattern of teams coming late with ‘blockers’. It’s hard because Technology is full of a lot of smart people that aren’t always aware of their blind spots. Technical operations has to view itself as a consultancy, we want people to work with us. It can be frustrating to watch teams learning but ultimately, it’s important to allow some freedom for teams to make lots of small mistakes. I have a hard time myself seeing this - coming from an Ops background - but it’s important if you want to build a culture of trust and learning.
InfoQ: In your talk you mentioned how the introduction of regulations like SOX and PCI 15 years ago led to a siloed engineering structure - how did you evolve to a more collaborative model while staying compliant?
MacRae: We’re still moving so this is very much a work in progress. The tooling is critical. If you have a solid, well tested pipeline with code reviews which includes infrastructure code, then you are already ticking a lot of the boxes and can iterate faster. This means you can be more secure by responding faster to issues. Sharing ownership of DEV/QA with Operations and Dev teams means any concerns on security or performance happen faster, and you expose Operations teams to the challenges faced by Engineering when environments are different. The tool chain now available means it’s easier to share and these are significant improvements for compliance, particularly if automation means little to no production access. Why would you need it if logging and instrumentation give you all the insight you need? In a container world the notion of RDP or SSH to systems doesn’t make sense anymore unless you’re dealing with state and data where things can get a little more complex.
InfoQ: Ticketmaster is available worldwide. How do you structure your on-call rotation to be able to cope with 24h worldwide availability?
MacRae: This is actually very complex and was much simpler in the old model. There used to be a number of teams that operated globally. We’re moving to autonomous teams that build, run and own their own products which means we need to rely on a central tool to manage our service catalogue. This is key to how we run and trigger our incident management process. Having a good tool to handle escalations and improve visibility is also key. Other tools that have been instrumental are Splunk, Grafana, OpenTSDB and the popular ELK stack. We’ve also started to use tools like Prometheus, as well as some commercial tools.
InfoQ: Ticketmaster has expanded globally by acquiring existing ticketing solutions in different countries. How do you ensure those solutions can evolve in the local market yet maintain some consistency/shared knowledge in the engineering department?
MacRae: Different products support different functions. Sometimes there is some overlap, that’s okay. Occasionally we can quickly deprecate the technology and look to involve the existing teams in support of our core stack. Our central code base is shared and we are moving towards decoupled, discrete services that are not market driven, but capability driven. This means we’re at a place where we can iterate and evolve our products much faster. One thing that’s very important is that we try and promote open collaboration. We have a global git instance and this is where teams can collaborate and make PR’s to other team’s projects.
InfoQ: You also mentioned you are moving to a platform model. Could you explain how do you define what a platform is at Ticketmaster and what changes are being implemented both from technical and process perspectives?
MacRae: We’re setting a low bar for standardisation because we have such different tech stacks. Terraform & GitLab are right down there. Chef or Docker and then some internal libraries depending which stack you are running. Identifying the key areas of waste or pain points for teams is the focus on shared services but this is still quite new. It can be tricky to identify the right degree of friction for an engineering team though and it is a constant discussion on what is appropriate. We’re very keen on containers and kubernetes as they put much improved abstractions and capabilities in place, but they require significant investment and some focus too.
InfoQ: You also mentioned having heterogeneous environments and solutions. How do you cope with the operational side in terms of team structure and skillsets?
MacRae: Typically, those splits are based on the products that offer capabilities, by aligning the right people with the right skills against those products. Over time we already see that the OS becomes much thinner and there is less required to run the application. We do have some stacks that use a complete mix and honestly that can be a challenge though it is very healthy. However, we don’t put Cygwin on Windows anymore. That’s a good thing.
InfoQ: Does Ticketmaster promote the "you build it, you run it" mantra internally? If yes, what's the role of traditional operations in this new approach at Ticketmaster?
MacRae: Absolutely, but different teams are in different states of ownership. It really depends on their maturity and the age of the tech. There are three models that we have:
- Embed team members in those groups
- Offer consultation during design, new projects and on boarding to production
- Bring the operationally minded team members into Platform groups to offer internal services
InfoQ: Has Microsoft's shift towards embracing Linux and open source helped to simplify your infrastructure/platform management? In what ways?
MacRae: Automation, testing and repeatability. It really does come down to using tools like git, chef, packer, terraform to enable ‘infrastructure as code’ as templates, tools or consultation. It has also taken the sting out of the childish OS conversations – which can sometimes be fun but ultimately not very constructive. We can now really look at the right tool for the right job.
InfoQ: Could you briefly explain Ticketmaster's tech maturity model and what benefits has it brought?
MacRae: Teams assess themselves against a set of capabilities such as testing, instrumentation, DR capabilities etc. The architecture team members review these with the engineering team and we enter the data. They can then work on prioritising any issues that are raised with their Product team. It’s intended as a positive tool, not a stick to punish teams. Making the results open internally is important because it promotes honesty too – you are judged by your peers on your honesty, not your results. It’s really just enabling us to have data that helps us make better decisions.
Story by Manuel Pais (this article was originally posted at InfoQ.com)
All photos of Connon MacRae at the conference courtesy of WinOps Facebook page
In London, we joined the fight on 4 February to raise money and awareness on World Cancer Day.