Next Gen DevOps Transformation Framework: A case study of Playfish circa 2010 pt. 2 |

Recap

Last week I used Playfish, as it was in 2010, as a case study to show how the Next Gen DevOps Transformation Framework can be used to assess the capability of an organisation. We looked at Playfish’s Build & Integration capability, which the framework classified at level 0, ad hoc and 3rd Party Component Management which the framework classified at level 4.

This week

We’ll take a look at what the Next Gen DevOps Transformation Framework recommends to help Playfish improve its Build & Integration capabilities and we’re going to contrast that with what we actually did. We’ll also look at why we made the decisions we made back then and talk a little about where that took us. Finally I’ll end with some recommendations that I hope will help organisations avoid some of the (spring-loaded trapped and spiked) pitfalls we fell into.

Build & Integration Level 0 Project Scope

Each level within each capability of the Next Gen DevOps Transformation Framework has a project scope designed to improve an organisations capability, encourage collaboration between engineers and enable additional benefits for little additional work.

The project scope for Build & Integration Level 0 is:

Create a process for integrating new 3rd party components.
Give one group clear ownership and budget for the performance, capability and reliability of the build system.

These two projects don’t initially seem to be related but It’s very hard to do one without the other.

Modern product development demands that developers use many 3rd party components, sometimes these are frameworks like Spring, more often than not they’re libraries like JUnit or modules like the Request module for Node.js.

Having a process for integrating new 3rd party components ensures that all engineers know the component is available. It provides an opportunity to debate the relative merits of alternatives and reduces unnecessary work. It also, crucially, provides the opportunity to ensure that 3rd party components are integrated in a robust way. It’s occasionally necessary to use immature components, if there’s a chance that these may be unavailable when needed then they need to be staged in a reliable repository or an alternative needs to be maintained. Creating a process for integrating 3rd party components ensures these issues are brought out into the open an can be addressed.

Having talked about a process for integrating 3rd party components an organisation is then in a great place to decide who should be responsible for the capability and reliability of the build system. Giving ownership of the build system to a group with the skills needed to manage and support it enables strategic improvement of the build capability. Only so much can be engineers, no matter how talented and committed they are, without funding, time and strategic planning.

How Playfish improved it’s build capability

I don’t know how Playfish built software before I joined but in August 2010 all the developers were using an instance of Bamboo hosted on JIRA Studio. JIRA Studio is a software-as-a-service implementation of a variety of Atlassian products. I have’t used it for a while but back in 2010 it was a single server hosting whatever Atlassian components you configured. Some Playfish developers had set up Bamboo as an experiment and by August 2010 it had become the unofficial standard for building software. I say unofficial because Operations didn’t know that it had become the standard until the thing broke.

Playfish Operations managed the deployment of software back then and that meant copying some war files from an artefact repository updating some config and occasionally running SQL scripts on databases. The processes that built these artefacts was owned by each development team. The development teams had a good community and had all chosen to adopt Bamboo.

Pause for context…

Let’s take a moment to look at the context surrounding this situation because we’re at that point where this situation could have degenerated into one of the classic us vs. them situations that are so common between operations and development.

When I was hired at Playfish I was told by the CEO, the Studio Head and the Engineering Director that I had two missions:

Mature Playfish’s approach to operations
Remove the blocks that slowed down development.

Playfish Operations were the only group on-call. They were pushing very hard to prevent development teams from requesting deployments on Friday afternoons and then running out of the building to get drunk.

I realised straight away that the development teams needed to manage their own deployments and take responsibility for their own work. That meant demystifying system configuration so that everyone understood the infrastructure and could take responsibility for their code on the live systems. I also knew that educating a diverse group of 50-odd developers was not a “quick-win”.

This may explain why I didn’t want operations to take ownership of the build system even though that’s exactly what some of the my engineers wanted me to do. Operations weren’t building software at the level of complexity that required a build system back then. Operations code, and some of it was quite complex, wasn’t even in source-control, so if we’d taken control of the build system we’d have been just another bunch of bureaucrats managing a system we had no vested interest in.

…Resume

When the build system reached capacity and everyone looked to operations to resolve it I played innocent. Build system? What build system? Do any of you guys know anything about a build system? It worked, to a certain extent, and some of the more senior developers took ownership of it and started stripping out old projects and performance improved.

While this was happening I was addressing the education issues. Everyone in the development organisation was telling me that the reason deployments were so difficult was because there were significant differences between the development, test and live environments. Meanwhile the operations engineers were swearing blind that there were no significant differences. Obviously there will always be differences, sometimes these are just difference access control criteria but there are differences. As the Playfish Operations team were the only group responsible for performance and resilience they were very protective about who could make changes to the infrastructure and configuration. This in turn led them to being unwilling to share access to the configuration files and that led to suspicion and doubt among the development teams. This is inevitable when you create silos and prevent developers from taking responsibility for their environments.

To resolve this I took it on myself to document the configuration of the environments and highlight everywhere there were differences. This was a great induction exercise for me (I was still in my first couple of weeks). I discovered that there were no significant differences and all the differences were little things like host names and access criteria. Deployment problems were now treated completely differently. Just by lifting the veil on configuration changed the entire problem. We then found the real root cause. The problem was configuration it just wasn’t system configuration it was software configuration. There was very little control on the Java properties and there were frequent duplications with different key names and default values. This made application behaviour difficult to predict when the software was deployed in different environments. There was then a years long initiative to ban default values and to try and identify and remove duplicate properties. This took a long time because there were so many games and services and each was managed independently.

Conclusion

I won’t head into the subject of integration for now as there’s a whole other rabbit hole and we have enough now to contrast the approach we took at Playfish with the recommendation made in the framework.

The build system at Playfish had no clear ownership. Passionate and committed senior developers did a good job of maintaining the build system’s performance but there was no single group who could lead a strategic discussion. That meant there was no-one who could look into extending automated build into automated testing and no one to consider extending build into integration.

This in-turn meant that build and integration were always seperate activities at Playfish. This had a knock-on effect on how we approached configuration management and significantly extended the complexity and timescales of automating game an service deployment.

The Next Gen DevOps Transformation Framework supports a product-centric approach to DevOps. In the case of the Build & Integration it recommends an organisation treats it’s build process and systems as an internal product. This means it needs strategic ownership, vision, roadmaps, budget and appropriate engineering. At Playfish we never treated build that way, we assumed that build was just a part of software development, which it is, but never sufficiently invested in it to reap all the available rewards.