Framework

Next Gen DevOps Transformation Framework: A case study of Playfish circa 2010 pt. 2

Recap

Last week I used Playfish, as it was in 2010, as a case study to show how the Next Gen DevOps Transformation Framework can be used to assess the capability of an organisation. We looked at Playfish’s Build & Integration capability, which the framework classified at level 0, ad hoc and 3rd Party Component Management which the framework classified at level 4.

This week

We’ll take a look at what the Next Gen DevOps Transformation Framework recommends to help Playfish improve its Build & Integration capabilities and we’re going to contrast that with what we actually did. We’ll also look at why we made the decisions we made back then and talk a little about where that took us. Finally I’ll end with some recommendations that I hope will help organisations avoid some of the (spring-loaded trapped and spiked) pitfalls we fell into.

Build & Integration Level 0 Project Scope

Each level within each capability of the Next Gen DevOps Transformation Framework has a project scope designed to improve an organisations capability, encourage collaboration between engineers and enable additional benefits for little additional work.

The project scope for Build & Integration Level 0 is:

Create a process for integrating new 3rd party components.
Give one group clear ownership and budget for the performance, capability and reliability of the build system.

These two projects don’t initially seem to be related but It’s very hard to do one without the other.

Modern product development demands that developers use many 3rd party components, sometimes these are frameworks like Spring, more often than not they’re libraries like JUnit or modules like the Request module for Node.js.

Having a process for integrating new 3rd party components ensures that all engineers know the component is available. It provides an opportunity to debate the relative merits of alternatives and reduces unnecessary work. It also, crucially, provides the opportunity to ensure that 3rd party components are integrated in a robust way. It’s occasionally necessary to use immature components, if there’s a chance that these may be unavailable when needed then they need to be staged in a reliable repository or an alternative needs to be maintained. Creating a process for integrating 3rd party components ensures these issues are brought out into the open an can be addressed.

Having talked about a process for integrating 3rd party components an organisation is then in a great place to decide who should be responsible for the capability and reliability of the build system. Giving ownership of the build system to a group with the skills needed to manage and support it enables strategic improvement of the build capability. Only so much can be engineers, no matter how talented and committed they are, without funding, time and strategic planning.

How Playfish improved it’s build capability

I don’t know how Playfish built software before I joined but in August 2010 all the developers were using an instance of Bamboo hosted on JIRA Studio. JIRA Studio is a software-as-a-service implementation of a variety of Atlassian products. I have’t used it for a while but back in 2010 it was a single server hosting whatever Atlassian components you configured. Some Playfish developers had set up Bamboo as an experiment and by August 2010 it had become the unofficial standard for building software. I say unofficial because Operations didn’t know that it had become the standard until the thing broke.

Playfish Operations managed the deployment of software back then and that meant copying some war files from an artefact repository updating some config and occasionally running SQL scripts on databases. The processes that built these artefacts was owned by each development team. The development teams had a good community and had all chosen to adopt Bamboo.

Pause for context…

Let’s take a moment to look at the context surrounding this situation because we’re at that point where this situation could have degenerated into one of the classic us vs. them situations that are so common between operations and development.

When I was hired at Playfish I was told by the CEO, the Studio Head and the Engineering Director that I had two missions:

  1. Mature Playfish’s approach to operations
  2. Remove the blocks that slowed down development.

Playfish Operations were the only group on-call. They were pushing very hard to prevent development teams from requesting deployments on Friday afternoons and then running out of the building to get drunk.

I realised straight away that the development teams needed to manage their own deployments and take responsibility for their own work. That meant demystifying system configuration so that everyone understood the infrastructure and could take responsibility for their code on the live systems. I also knew that educating a diverse group of 50-odd developers was not a “quick-win”.

This may explain why I didn’t want operations to take ownership of the build system even though that’s exactly what some of the my engineers wanted me to do. Operations weren’t building software at the level of complexity that required a build system back then. Operations code, and some of it was quite complex, wasn’t even in source-control, so if we’d taken control of the build system we’d have been just another bunch of bureaucrats managing a system we had no vested interest in.

…Resume

When the build system reached capacity and everyone looked to operations to resolve it I played innocent. Build system? What build system? Do any of you guys know anything about a build system? It worked, to a certain extent, and some of the more senior developers took ownership of it and started stripping out old projects and performance improved.

While this was happening I was addressing the education issues. Everyone in the development organisation was telling me that the reason deployments were so difficult was because there were significant differences between the development, test and live environments. Meanwhile the operations engineers were swearing blind that there were no significant differences. Obviously there will always be differences, sometimes these are just difference access control criteria but there are differences. As the Playfish Operations team were the only group responsible for performance and resilience they were very protective about who could make changes to the infrastructure and configuration. This in turn led them to being unwilling to share access to the configuration files and that led to suspicion and doubt among the development teams. This is inevitable when you create silos and prevent developers from taking responsibility for their environments.

To resolve this I took it on myself to document the configuration of the environments and highlight everywhere there were differences. This was a great induction exercise for me (I was still in my first couple of weeks). I discovered that there were no significant differences and all the differences were little things like host names and access criteria. Deployment problems were now treated completely differently. Just by lifting the veil on configuration changed the entire problem. We then found the real root cause. The problem was configuration it just wasn’t system configuration it was software configuration. There was very little control on the Java properties and there were frequent duplications with different key names and default values. This made application behaviour difficult to predict when the software was deployed in different environments. There was then a years long initiative to ban default values and to try and identify and remove duplicate properties. This took a long time because there were so many games and services and each was managed independently.

Conclusion

I won’t head into the subject of integration for now as there’s a whole other rabbit hole and we have enough now to contrast the approach we took at Playfish with the recommendation made in the framework.

The build system at Playfish had no clear ownership. Passionate and committed senior developers did a good job of maintaining the build system’s performance but there was no single group who could lead a strategic discussion. That meant there was no-one who could look into extending automated build into automated testing and no one to consider extending build into integration.

This in-turn meant that build and integration were always seperate activities at Playfish. This had a knock-on effect on how we approached configuration management and significantly extended the complexity and timescales of automating game an service deployment.

The Next Gen DevOps Transformation Framework supports a product-centric approach to DevOps. In the case of the Build & Integration it recommends an organisation treats it’s build process and systems as an internal product. This means it needs strategic ownership, vision, roadmaps, budget and appropriate engineering. At Playfish we never treated build that way, we assumed that build was just a part of software development, which it is, but never sufficiently invested in it to reap all the available rewards.

Next Gen DevOps Transformation Framework: A case study of Playfish circa 2010

Introduction

This is going to be part one in a two part series. In this article I’m gong to run a case study capability assessment using my newly published Next Gen DevOps Transformation Framework: https://github.com/grjsmith/NGDO-Transformation-Framework.

I’m going to use Playfish (as it was in August 2010 when I joined) as my target organisation. The reason for this is that Playfish no longer exists and there’s almost no one left at EA who would remember me so I’m not likely to be sued if I’m a bit too honest. I’m only going to review two capabilities, one that scores low and one that scores high otherwise this blog post will turn into a 20 page TL:DR fest.

Next week I’ll publish the second part of this series where I’ll discuss the recommendations the framework makes to help Playfish level it’s capabilities up and contrast that with what we actually did and talk about some of the reasons we made the choices we did and where those choices led us.

But first a little Context

Playfish made Facebook games that was just one of many things that made Playfish remarkable. Playfish had extraordinary vision, diversity, capability but coming from AOL the thing that most stood out for me was that all Playfish technology was underpinned by Platform-As-A-Service or Software-As-A-Service solutions.

Playfish’s games were really just complex web services. The game the player interacts with is a big Flash client that renders in the players browser. The game client records the player actions and sends them to  the game server. The server then plays these interactions back against it’s rule set to first see if they are legal actions and then to record them in a database of game state. The game is asynchronous in that the client allows the player to do what they want and then validates the instructions. This meant that Playfish game sessions could survive quite variable ISP performance allowing Playfish to be successful in many countries around the world.

The game server then interacts with a bunch of other services to provide CRM, Billing, item delivery and other services. So we have a game client, game service, databases storing reference and game state data and a bunch of additional services providing back-office functions.

From a technical organisation perspective Playfish was split into 3 notional entities:

The games teams were in the Studio organisation. The games teams were cross-functional teams comprised of Producers, product-managers, client developers, artists and server developers all working together to produce a weekly game update.

The Engineering team developed the back-office services, there were two teams who split their time between something like 10 different services. The team weren’t pressured to provide weekly releases rather they developed additional features as needed.

Finally the Ops team managed the hosting of the games and services and all the other 3rd party services Playfish was dependant on like Google docs, JIRA Studio and a few other smaller services.

External to these organisation there were also marketing, data analysis, finance, payments and customer service teams.

Just before we dive in let me remind  you that we’re assessing a company that released it’s first game in December 2007. When I joined it was really only 18 months old and had grown to around a 100 people so this was not a mature organisation.

Without further ado let’s dive into the framework and assess Playfish’s DevOps capabilities as they were in August 2010.

Build & Integration

We’ll choose to look at Build & Integration first as this was the first major problem I was asked to resolve when I took on the Operations Director role at Playfish.

Build and Integration: ad-hoc, Capability level 0, description:

Continuous build ad hoc and sporadically successful.

This seems like an accurate description of the Playfish I remember from 2010.

Capability level 0 observed behaviours are:

Only revenue generating applications are subject to automated build and test.

Automated builds fail frequently.

No clear ownership of build system capability, reliability and performance.

No clear ownership of 3rd-party components.

This isn’t a strict description of the state of Playfish’s build and integration capabilities.

The games and the back-office services all had automated build pipelines but there was only very limited automated testing. The individual game and service builds were fairly reliable but that was due in part to the lack of any sophisticated automated testing . The games teams struggled to ensure functionality with the back-office components. Playfish had only recently transitioned to multiple back-office services when I joined and was still suffering some of the transitional pains. There was no clear ownership of the build system. Some of the senior developers had set one up and began running it as an experiment but pretty soon everyone was using it, it was considered production but no-one had taken ownership of it. Hence when it under-performed everyone looked at everyone else. 3rd party components were well understood and well owned at Playfish. Things could have been a little more formal but at Playfish’s level of maturity it wasn’t strictly necessary.

Let’s take a look at the level 1 build and integration capabilities before we settle on a rating. Description:

Continuous build is reliable for revenue generating applications.

Observed behaviours:

Automated build and test activities are reliable in the build environment.

Deployment of applications to production environment is unreliable.

Software and test engineers concerned that system configuration is the root cause.

Automated build and test were not reliable. Deployment was unreliable and everyone was concerned that system configuration was to blame for the unreliability of deployments.

So in August 2010 Playfish’s Next Gen DevOps Transformation Framework Build & Integration capability was level 0.

Next we’ll look at 3rd Party Component management. I’m choosing this because Playfish was founded on the principles of using PAAS and SAAS solutions where possible so it should score highly but I suspect it will be interesting to see how.

3rd Party Component Management

Capability level 0 description:

An unknown number of 3rd party components, services and tools are in use.

This isn’t true Playfish didn’t really have enough legacy to lose track of it’s 3rd party components.

Capability level 0 behaviours are:

No clear ownership, budget or roadmap for each service, product or tool.

Notification of impacts are makeshift and motley causing regular interruptions and impacts to productivity.

All but one 3rd party provided service had clearly defined owners. Playfish had clear owners for the relationship with Facebook and the only tool in dispute was the automated build system. There were a variety of 3rd party libraries in use and these were never used from source so they never caused any surprises. While there were no clear owners for all of these libraries all the teams kept an eye on their development and there were regular emails about updates and changes.

There were no formal roadmaps for every product and tool but their use was constantly discussed.

So it’s doesn’t seem that Playfish was at level 0.

Capability level 1 description:

A trusted list of all 3rd party provided services, products and tools is available.

There was definitely no no list of all the 3rd party services, products and tools documented so it may be that Playfish should be considered t level 0 but let’s apply some common sense (required when using any framework) and take a look at the observed behaviours.

Capability level 1 observed behaviour:

Informed debate of the relative merits of each 3rd party component can now occur.

Outages still cause incidents and upgrades are still a surprise.

There was regular informed debate about the relative merits of almost all the 3rd party services, products and tools. No planned maintenance of 3rd party services, products or tools caused outages.

So while Playfish didn’t have a trusted list of all 3rd party provided services, products and tools they didn’t experience the problems that might be expected. This was due to the fact that it was a very young organisation with very little legacy and a very active and engaged workforce. If we don’t see the expected observed behaviour let’s move on to level 2.

Description for Capability level 2:

All 3rd party services, products and tools have a service owner.

While there was no list it was well understood who owned all but one of the 3rd party services, products and tools.

Capability level 2 observed behaviour:

Incidents caused by 3rd party services are now escalated to the provider and within the organisation.

There is organisation wide communication about the quality of each service and major milestones such as outages or upgrades are no longer a surprise.

There is no way to fully assess the potential impact of replacing a 3rd party component.

Incidents caused by 3rd party services, products and tools were well managed. There was organisation wide communication about the quality of 3rd party components and 3rd party upgrades and planned outages did not cause surprises. The 3rd party components in use were very well understood and debated about replacing 3rd party components were common. We even used Amazon’s cloud services carefully to ensure we could switch to other cloud providers should a better one emerge. We once deployed the entire stack and a game to Open Stack and it ran with minimal work (although this was much later). The use of different components were frequently debated and it wasn’t uncommon for us to use multiple alternative components on different infrastructure within the same game or service to see real-world performance differences first-hand.

So while Playfish didn’t meet the description of Capability Level 2 it’s behaviour exceeded those predicted in the observed behaviours.

Let’s take a look at Capability level 3:

Strategic roadmaps and budgets are available for all 3rd party services, products and tools.

There definitely weren’t roadmaps and budgets allocated for any of the 3rd party services. To be fair when I joined Playfish it didn’t really operate budgets.

Capability level 3 observed behaviour

Curated discussions supported by captured data take place regularly about the performance, capability and quality of each 3rd party service, product and tool. These discussions lead to published conclusions and actions.

Again Playfish’s management of 3rd party components doesn’t match the description but the observed behaviour does. Numerous experiments were performed assessing the performance of existing components in new circumstances or comparing new components in existing circumstances. Debates were common and occasionally resolved into experiments. Tactical decisions were made based on data gathered during these experiments.

Let’s move on to capability level 4:

Continuous Improvement

There was a degree of continuous improvement at Playfish but let’s take a look at the observed behaviours before we draw a conclusion:

3rd party components will be either active open-source projects that the organisation contributes to or they will be supplied by engaged, responsible and responsive partners

This description fairly accurately matches Playfish’s experience.

So in August 2010 Playfish’s 3rd Party Component Management capability was level 4.

It should be understood that Playfish was a business set up around the idea that 3rd party services, products and tools would be used as far as possible. It should also be remembered that at this stage the company was about 18 months old hence the behaviours were good even though it hadn’t taken any of the steps necessary to ensure good behaviour.

Conclusion

Using the Next Gen DevOps Transformation Framework to assess the DevOps capabilities of an organisation is a very simple exercise. With sufficient context it can be done in a few hours. If you want someone external to run the process it will take a little longer as they will have to observe your processes in action.

Look out for next week’s article when I’ll examine what the framework recommends to improve Playfish’s Build & Integration capabilities and contrast that with what we actually did.

Next Gen DevOps pioneer launches highly anticipated framework

  Next Gen DevOps pioneer launches highly anticipated framework

Transformation Framework offers a structured approach to move to DevOps

London, UK – July 28, 2015 – Grant Smith, pioneer of the Next Gen DevOps (NGDO) movement, has launched his Next Gen DevOps Transformation Framework on Github. It comes after the success of Grant’s book, Next Gen DevOps: Creating the DevOps Organisation, released last year.

The NGDO Transformation Framework offers a structured approach to a business-wide transition to DevOps. Underscoring the nature and complexity of moving to DevOps, the framework enables organisations to choose which challenge to tackle first, outlining the benefits at each stage of the transformation.

The NGDO Transformation Framework is designed to help organisations assess their DevOps capabilities and prioritise projects to deliver improvements to existing functions – as well as deliver new ones.

Additionally, the Framework offers the ability to execute a series of projects that can individually progress an organisation’s capabilities, delivering real-world value. The projects are also designed so that they can be used together to enable additional capabilities with minimal little extra work.

The Next Gen DevOps movement merges behaviour-driven development, infrastructure-as-code, automated testing, monitoring and continuous integration into a single coherent process.

Armed with the lessons taken from the Agile software development movement, combined with the latest in Software-as-a-Service (SaaS) solutions, cloud computing and automated testing, NGDO is Grant Smith’s vision for the biggest evolution of business IT yet.

“There is no better time than now to transition to the new ways of thinking outlined in my NGDO Framework,” Grant Smith says. “Using this framework, businesses can increase their efficiency and responsiveness and enable more agile ways of working.”

The Next Gen DevOps Transformation Framework is open source and licensed under the Creative Commons CC0 1.0 Universal (CC0 1.0) licence.

Notes to Editors

Grant has created and led high performance operations teams in some of the largest and fastest growing companies in the UK and is at the forefront of the DevOps movement.

He has driven real collaboration between operations and development teams by implementing infrastructure-as-code and driving system integration from continuous build systems.

Grant has delivered cloud-based game platforms enjoyed by millions of players per day and websites serving a billion page views per month. Most recently, he delivered a high performance, scalable Internet of Things (IoT) platform for British Gas. Grant is a frequent speaker at conferences and events.

More of Grant’s work can be viewed at nextgendevops.com.

Next Gen DevOps: Creating The DevOps Organisation is available for Kindle on Amazon now. http://amzn.to/1DDvD6i

Book Cover Image Link: http://imgur.com/O0GauId

PR Contact Grant Smith Grant@nextgendevops.com

DevOps Framework: The last teaser

2015-06-25 13.00.20

This week’s update is another short one we’ve just come back from an awesome holiday in the lake district. I’d had an intense few of weeks of writing and interviewing and then to top it all off I contracted some sort of flu or virus thing. Kylie’s contract had just come to an end and so we decided to take a short break before she started her next one. As you can see we had a great time and the weather was wonderful!

2015-06-24 15.01.59I love the lake district.There’s something about the fresh air and the countryside that really refreshes me. If anyone in Cumbria is considering a DevOps transformation please feel free to get in touch :).

I’ve returned ready to get the framework into a first draft state and published on Github so you can all get your teeth into it. On that note Kylie has finished her initial assessment and she’s provided some initial feedback that I’ll be working on this week.

Finally I’m participating in a panel discussing the business need for DevOps at Computing Summit’s DevOps 2015 event next week, July 8th. I hope I’ll see some of you there If you’re not interested in seeing me speak you should definitely hear what my friend Phil Hendren of Mind Candy fame has to say. While I’ve been leading DevOps initiatives and teams for the last 6 years Phil’s been engineering the solutions that make it happen you can find Phil’s blog here, it’s a great source of practical DevOps advice and tools.

DevOps Transformation Framework Update: Almost there…

I know that some of my readers are impatient to get their hands on the my new project: the Next Gen DevOps Transformation Framework. I’m as eager to release it as they are to play with it but it isn’t quite ready for it’s public debut yet. I have created a project in Github where I will release it but at the moment it just hosts a readme. If want to keep an eye the project you can find it here: https://github.com/grjsmith/NGDO-Transformation-Framework

With the framework I’m attempting something quite challenging. I don’t mean writing the framework, anyone who’s seen DevOps in action and read around the subject a little could create a framework for a DevOps transition. The challenge is presenting it for two audiences with two very different perspectives. I’m trying to create something detailed enough that engineers and managers can argue about it and contribute to it and yet something simple enough that budget holders can understand at-a-glance what they might be getting their organisations into.

Two months ago when I announced I had started work on the framework I mentioned that one of my goals was to create something that would have helped me while I was at EA. I failed to convince the central operations group to adopt DevOps, I was able to convince them of the benefits of DevOps but I couldn’t present them with an approach that they could plan and budget for. The framework successfully presents the scale and complexity of a DevOps transformation but I’m struggling with presenting the evolving benefits of the DevOps approach as the organisation progresses through the framework.

That’s where my secret weapon comes in. My girlfriend Kylie just happens to be a very talented Business Analyst. One of the things Kylie excels at is presenting complex data simply and effectively. Kylie’s working on the framework now and we hope to have the first version ready for release by the end of June.

Next Gen DevOps Transformation Framework Update… Oh dear

Since I started working on the framework I’ve had the idea that I wanted there to be a visualisation of the entire framework.

I have a visual imagination. I’m most comfortable considering the big picture and I wanted to give people, like myself, an at-a-glance view of the framework. I also thought it might help people understand the scale of the framework and help place their organisation within it.

I’ve had four major attempts at this so far and I think I’m getting there. Unfortunately the Scrabble people might sue me if I go with this version :):

NGDOTF Capability Web v0.2

Next Gen DevOps Transformation Framework: Project Update

I have completed a very rough first draft of all the individual elements of my DevOps Transformation Framework. The framework intends to mature nine functions required to build and manage online services. They are:

  • Build & Integration
  • 3rd Party Component Management
  • Feedback Loops
  • Software Engineering
  • Test Engineering
  • Project Management
  • Incident Management
  • Change Management
  • Budgeting

The framework takes each one of these functions through a series of steps (projects) which improve the capability and productivity of each function. Yesterday I submitted them to the first test process. I wanted to make sure that while each step might refer to other steps preceding it or assume capabilities in other functions at the same level they didn’t assume any capabilities only available at later levels. I did this in a distinctly old school way. I printed them all out, cut them up and laid out the starting point or capability level 0 as I’m calling it at the moment. This effectively defines the prerequisites for starting a DevOps transformation. I’ll publish more on that later. Framework_lvl_0 I then randomly selected capabilities from the next level (level 1 funnily enough) and made sure nothing was assumed that wasn’t already present and continued until I had laid out all the capabilities for each function. framework_lvl_4

I still have a long way to go yet. I need to confirm how far each function can mature before it has dependancies on other functions or what the impact might be if capabilities are introduced in one function before another. The next big task is to present one view showing the complete matrix of functions and capabilities and describe the advantages that result when they combine. For example: what advantages are available to an organisation when support and maintenance tasks result in use cases or user stories AND basic automated tests are available to be executed by anyone in the organisation. I’m getting close to needing some help to get the framework into a fit state for public review (and hopefully contribution). As such it needs a little peer review first. If you’re interested in working with me on creating a framework to help businesses transition to DevOps please drop me a line at grant@nextgendevops.com or tweet me @grjsmith or @nextgendevops or if you really must message me in Linkedin!