Next Gen DevOps Transformation Framework: A case study of Playfish circa 2010

Introduction

This is going to be part one in a two part series. In this article I’m gong to run a case study capability assessment using my newly published Next Gen DevOps Transformation Framework: https://github.com/grjsmith/NGDO-Transformation-Framework.

I’m going to use Playfish (as it was in August 2010 when I joined) as my target organisation. The reason for this is that Playfish no longer exists and there’s almost no one left at EA who would remember me so I’m not likely to be sued if I’m a bit too honest. I’m only going to review two capabilities, one that scores low and one that scores high otherwise this blog post will turn into a 20 page TL:DR fest.

Next week I’ll publish the second part of this series where I’ll discuss the recommendations the framework makes to help Playfish level it’s capabilities up and contrast that with what we actually did and talk about some of the reasons we made the choices we did and where those choices led us.

But first a little Context

Playfish made Facebook games that was just one of many things that made Playfish remarkable. Playfish had extraordinary vision, diversity, capability but coming from AOL the thing that most stood out for me was that all Playfish technology was underpinned by Platform-As-A-Service or Software-As-A-Service solutions.

Playfish’s games were really just complex web services. The game the player interacts with is a big Flash client that renders in the players browser. The game client records the player actions and sends them to  the game server. The server then plays these interactions back against it’s rule set to first see if they are legal actions and then to record them in a database of game state. The game is asynchronous in that the client allows the player to do what they want and then validates the instructions. This meant that Playfish game sessions could survive quite variable ISP performance allowing Playfish to be successful in many countries around the world.

The game server then interacts with a bunch of other services to provide CRM, Billing, item delivery and other services. So we have a game client, game service, databases storing reference and game state data and a bunch of additional services providing back-office functions.

From a technical organisation perspective Playfish was split into 3 notional entities:

The games teams were in the Studio organisation. The games teams were cross-functional teams comprised of Producers, product-managers, client developers, artists and server developers all working together to produce a weekly game update.

The Engineering team developed the back-office services, there were two teams who split their time between something like 10 different services. The team weren’t pressured to provide weekly releases rather they developed additional features as needed.

Finally the Ops team managed the hosting of the games and services and all the other 3rd party services Playfish was dependant on like Google docs, JIRA Studio and a few other smaller services.

External to these organisation there were also marketing, data analysis, finance, payments and customer service teams.

Just before we dive in let me remind  you that we’re assessing a company that released it’s first game in December 2007. When I joined it was really only 18 months old and had grown to around a 100 people so this was not a mature organisation.

Without further ado let’s dive into the framework and assess Playfish’s DevOps capabilities as they were in August 2010.

Build & Integration

We’ll choose to look at Build & Integration first as this was the first major problem I was asked to resolve when I took on the Operations Director role at Playfish.

Build and Integration: ad-hoc, Capability level 0, description:

Continuous build ad hoc and sporadically successful.

This seems like an accurate description of the Playfish I remember from 2010.

Capability level 0 observed behaviours are:

Only revenue generating applications are subject to automated build and test.

Automated builds fail frequently.

No clear ownership of build system capability, reliability and performance.

No clear ownership of 3rd-party components.

This isn’t a strict description of the state of Playfish’s build and integration capabilities.

The games and the back-office services all had automated build pipelines but there was only very limited automated testing. The individual game and service builds were fairly reliable but that was due in part to the lack of any sophisticated automated testing . The games teams struggled to ensure functionality with the back-office components. Playfish had only recently transitioned to multiple back-office services when I joined and was still suffering some of the transitional pains. There was no clear ownership of the build system. Some of the senior developers had set one up and began running it as an experiment but pretty soon everyone was using it, it was considered production but no-one had taken ownership of it. Hence when it under-performed everyone looked at everyone else. 3rd party components were well understood and well owned at Playfish. Things could have been a little more formal but at Playfish’s level of maturity it wasn’t strictly necessary.

Let’s take a look at the level 1 build and integration capabilities before we settle on a rating. Description:

Continuous build is reliable for revenue generating applications.

Observed behaviours:

Automated build and test activities are reliable in the build environment.

Deployment of applications to production environment is unreliable.

Software and test engineers concerned that system configuration is the root cause.

Automated build and test were not reliable. Deployment was unreliable and everyone was concerned that system configuration was to blame for the unreliability of deployments.

So in August 2010 Playfish’s Next Gen DevOps Transformation Framework Build & Integration capability was level 0.

Next we’ll look at 3rd Party Component management. I’m choosing this because Playfish was founded on the principles of using PAAS and SAAS solutions where possible so it should score highly but I suspect it will be interesting to see how.

3rd Party Component Management

Capability level 0 description:

An unknown number of 3rd party components, services and tools are in use.

This isn’t true Playfish didn’t really have enough legacy to lose track of it’s 3rd party components.

Capability level 0 behaviours are:

No clear ownership, budget or roadmap for each service, product or tool.

Notification of impacts are makeshift and motley causing regular interruptions and impacts to productivity.

All but one 3rd party provided service had clearly defined owners. Playfish had clear owners for the relationship with Facebook and the only tool in dispute was the automated build system. There were a variety of 3rd party libraries in use and these were never used from source so they never caused any surprises. While there were no clear owners for all of these libraries all the teams kept an eye on their development and there were regular emails about updates and changes.

There were no formal roadmaps for every product and tool but their use was constantly discussed.

So it’s doesn’t seem that Playfish was at level 0.

Capability level 1 description:

A trusted list of all 3rd party provided services, products and tools is available.

There was definitely no no list of all the 3rd party services, products and tools documented so it may be that Playfish should be considered t level 0 but let’s apply some common sense (required when using any framework) and take a look at the observed behaviours.

Capability level 1 observed behaviour:

Informed debate of the relative merits of each 3rd party component can now occur.

Outages still cause incidents and upgrades are still a surprise.

There was regular informed debate about the relative merits of almost all the 3rd party services, products and tools. No planned maintenance of 3rd party services, products or tools caused outages.

So while Playfish didn’t have a trusted list of all 3rd party provided services, products and tools they didn’t experience the problems that might be expected. This was due to the fact that it was a very young organisation with very little legacy and a very active and engaged workforce. If we don’t see the expected observed behaviour let’s move on to level 2.

Description for Capability level 2:

All 3rd party services, products and tools have a service owner.

While there was no list it was well understood who owned all but one of the 3rd party services, products and tools.

Capability level 2 observed behaviour:

Incidents caused by 3rd party services are now escalated to the provider and within the organisation.

There is organisation wide communication about the quality of each service and major milestones such as outages or upgrades are no longer a surprise.

There is no way to fully assess the potential impact of replacing a 3rd party component.

Incidents caused by 3rd party services, products and tools were well managed. There was organisation wide communication about the quality of 3rd party components and 3rd party upgrades and planned outages did not cause surprises. The 3rd party components in use were very well understood and debated about replacing 3rd party components were common. We even used Amazon’s cloud services carefully to ensure we could switch to other cloud providers should a better one emerge. We once deployed the entire stack and a game to Open Stack and it ran with minimal work (although this was much later). The use of different components were frequently debated and it wasn’t uncommon for us to use multiple alternative components on different infrastructure within the same game or service to see real-world performance differences first-hand.

So while Playfish didn’t meet the description of Capability Level 2 it’s behaviour exceeded those predicted in the observed behaviours.

Let’s take a look at Capability level 3:

Strategic roadmaps and budgets are available for all 3rd party services, products and tools.

There definitely weren’t roadmaps and budgets allocated for any of the 3rd party services. To be fair when I joined Playfish it didn’t really operate budgets.

Capability level 3 observed behaviour

Curated discussions supported by captured data take place regularly about the performance, capability and quality of each 3rd party service, product and tool. These discussions lead to published conclusions and actions.

Again Playfish’s management of 3rd party components doesn’t match the description but the observed behaviour does. Numerous experiments were performed assessing the performance of existing components in new circumstances or comparing new components in existing circumstances. Debates were common and occasionally resolved into experiments. Tactical decisions were made based on data gathered during these experiments.

Let’s move on to capability level 4:

Continuous Improvement

There was a degree of continuous improvement at Playfish but let’s take a look at the observed behaviours before we draw a conclusion:

3rd party components will be either active open-source projects that the organisation contributes to or they will be supplied by engaged, responsible and responsive partners

This description fairly accurately matches Playfish’s experience.

So in August 2010 Playfish’s 3rd Party Component Management capability was level 4.

It should be understood that Playfish was a business set up around the idea that 3rd party services, products and tools would be used as far as possible. It should also be remembered that at this stage the company was about 18 months old hence the behaviours were good even though it hadn’t taken any of the steps necessary to ensure good behaviour.

Conclusion

Using the Next Gen DevOps Transformation Framework to assess the DevOps capabilities of an organisation is a very simple exercise. With sufficient context it can be done in a few hours. If you want someone external to run the process it will take a little longer as they will have to observe your processes in action.

Look out for next week’s article when I’ll examine what the framework recommends to improve Playfish’s Build & Integration capabilities and contrast that with what we actually did.

Next Gen DevOps pioneer launches highly anticipated framework

  Next Gen DevOps pioneer launches highly anticipated framework

Transformation Framework offers a structured approach to move to DevOps

London, UK – July 28, 2015 – Grant Smith, pioneer of the Next Gen DevOps (NGDO) movement, has launched his Next Gen DevOps Transformation Framework on Github. It comes after the success of Grant’s book, Next Gen DevOps: Creating the DevOps Organisation, released last year.

The NGDO Transformation Framework offers a structured approach to a business-wide transition to DevOps. Underscoring the nature and complexity of moving to DevOps, the framework enables organisations to choose which challenge to tackle first, outlining the benefits at each stage of the transformation.

The NGDO Transformation Framework is designed to help organisations assess their DevOps capabilities and prioritise projects to deliver improvements to existing functions – as well as deliver new ones.

Additionally, the Framework offers the ability to execute a series of projects that can individually progress an organisation’s capabilities, delivering real-world value. The projects are also designed so that they can be used together to enable additional capabilities with minimal little extra work.

The Next Gen DevOps movement merges behaviour-driven development, infrastructure-as-code, automated testing, monitoring and continuous integration into a single coherent process.

Armed with the lessons taken from the Agile software development movement, combined with the latest in Software-as-a-Service (SaaS) solutions, cloud computing and automated testing, NGDO is Grant Smith’s vision for the biggest evolution of business IT yet.

“There is no better time than now to transition to the new ways of thinking outlined in my NGDO Framework,” Grant Smith says. “Using this framework, businesses can increase their efficiency and responsiveness and enable more agile ways of working.”

The Next Gen DevOps Transformation Framework is open source and licensed under the Creative Commons CC0 1.0 Universal (CC0 1.0) licence.

Notes to Editors

Grant has created and led high performance operations teams in some of the largest and fastest growing companies in the UK and is at the forefront of the DevOps movement.

He has driven real collaboration between operations and development teams by implementing infrastructure-as-code and driving system integration from continuous build systems.

Grant has delivered cloud-based game platforms enjoyed by millions of players per day and websites serving a billion page views per month. Most recently, he delivered a high performance, scalable Internet of Things (IoT) platform for British Gas. Grant is a frequent speaker at conferences and events.

More of Grant’s work can be viewed at nextgendevops.com.

Next Gen DevOps: Creating The DevOps Organisation is available for Kindle on Amazon now. http://amzn.to/1DDvD6i

Book Cover Image Link: http://imgur.com/O0GauId

PR Contact Grant Smith Grant@nextgendevops.com

The First Blue/Green Production Deployment circa 2005

Just a short post from me this week as I focus on publishing the Next Gen DevOps Transformation Framework. My friend and former-colleague Phil Hendren, now of Mind Candy fame, has just published an article about our first experience with Continuous Delivery back at AOL in 2004/5. This story has been told many times by some of the amazing people we were privileged to work with on that project but they have usually told the story from a software engineering perspective. This is the first time the tale has been told from an operations perspective. There are some interesting nuances to this story, I won’t share them now because for now you should just go and read Phil’s article. However I’d like leave you with one thought before you click away. 6 weeks after we went live with the new system and it’s new deployment mechanism Phil writes about we stopped getting bugs in the live environment. 6 weeks after that everyone stopped caring about when the new update would be deployed. If we could achieve that a decade ago imagine what we can do now.

Photo courtesy of Kylie who delights in taking close up photo’s while we’re on holiday that annoy the hell out of me but make good featured images on blog posts 🙂

Computing Summit DevOps 2015

I attended Computing’s DevOps 2015 summit yesterday (Wednesday 8th July at Hilton Tower Bridge). Let me start by saying if you have a chance to attend one of Computing’s conferences do so. It was incredibly well organised, the material they presented to kick-start the day was really interesting and all the presenters and panelists had insightful and useful experiences to share. There were a few threads that that seemed to run throughout the day and some of them took me by surprise. I’m going to highlight a few of the things that struck me from yesterday while they’re still fresh in my mind but I also encourage you to search twitter for #ctgsummit to see the stuff we were tweeting about during the day. mCapture

“attendees from small businesses to banks, from insurance companies to retailers have accepted DevOps as essential”

I was on a panel discussing the business case for DevOps. What impressed me the most was that we didn’t really discuss the case for DevOps at all. It was clear the attendees all understand that DevOps is essential. We spent most of our time talking about how DevOps can work in heavily regulated environments, how to make a compelling case for DevOps to non-technical executives and how to rapidly show the value of taking a DevOps approach. It’s been obvious to me and many others, for sometime, that DevOps is the next step in Technology’s evolution. Last year was frustrating because of all the talk about DevOps just being for unicorns and the distraction of enterprise DevOps and bi-modal IT. While I’m sure we haven’t heard the last of all that attendees from small businesses to banks, from insurance companies to retailers seem to have accepted DevOps as essential for their future. Now we can have much more interesting discussions about how we  approach DevOps transformations.

“Some of the most successful modern businesses started their DevOps initiatives ten years ago.”

That leads me to something that irked me yesterday. I wasn’t really in a place to object as the comment was raised during a later panel event. There’s still the impression that the best way to approach a DevOps project is to start with a small initiative. I’m sure that’s the easiest way to do it and I’m sure it requires the least investment and hence convincing of executives but I don’t think it’s the best approach anymore. Some of the most successful modern businesses started their DevOps initiatives ten years ago now. Think Netflix, Etsy, and Amazon Web Services, these companies are now giants. They also talked extensively about the methods they were using at the time and continue to do so. The next generation of successful companies learned from those lessons and now the Fin-Tech industry is poised to do to Wall St. and The City what Netflix did to the cable companies. If you start with a small DevOps initiative now and it takes you 2-3 years to complete your transformation you might well be surprised by who your competition is when you complete your journey. You might find yourself a decade behind companies you’ve never heard of who are eating your market share.

“the biggest barrier or advantage to DevOps is senior management buy-in.”

Another consistent thread that ran through yesterday’s event, from Computing’s own Primary research presented at the start of the day through every panel and presentation was that the biggest barrier or advantage to DevOps is senior management buy-in. I know from my own experiences that without senior management buy-in a DevOps approach is limited in what it can achieve. Lack of senior management buy-in held us back from our ultimate potential at Playfish and it’s one of the reasons Hive has been so successful. If your business is considering tackling a small DevOps project because you’re struggling to get senior management buy-in I suggest an alternative approach: There is an absolute wealth of documentation out there that states very clearly that responsive technology departments make their organisations more successful. There is another wealth of documentation demonstrating that DevOps makes IT departments more responsive. Puppet and Thoughtworks produce the State of DevOps Report that makes those points very clearly. Harvard Business School and Oracle have a report that makes a similar point. Computing’s Primary Research certainly suggests it judging by the data I saw yesterday.

“the product-centric approach to DevOps was another popular theme.”

I was my very pleasantly surprised, yesterday, to see the product-centric approach to DevOps was another popular theme. This is particularly gratifying to me because I wrote a whole book about why that’s a great approach and how to implement it in technology organisations. It’s been a hot topic with various DevOps luminaries in recent months but it’s clear that many organisations are making it work. Boyan Dimitrov told an amazing story about Hailo’s product-centric DevOps transformation. What I find most remarkable about Hailo’s story is that they understand that a product isn’t just something that directly generates revenue. Their back-end services are also products in their own right. There was a lot more than this short summary but I wanted to highlight the particular themes that really stood out for me. If you’re considering a DevOps transformation and don’t know how to get started feel free to get in touch.

Dune, Arrakis… DevOps

ajnD6Dp_460sDune, by Frank Herbert, is an amazing book and an equally amazing David Lynch movie. It’s also a phenomenal management book and contains some valuable wisdom that we can apply to DevOps.

In this article I’m going to pull out various quotes and thoughts from the characters in Dune and share with you some of the lessons I’ve taken from them.

One of the most powerful lessons I learned from Dune comes about halfway through the book. Just before a pivotal moment in the story, the lead character, Paul Atriedes’, reflects:

“Everything around him moved smoothly in the ancient routine that required no order.

‘Give as few orders as possible,’ his father had told him … once … long ago. ‘Once you’ve given orders on a subject, you must always give orders on that subject.’”

As a leader this advice has been invaluable to me. A team that understands the values of it’s leader and the behaviour he or she expects from them far outperforms a team who are given constant instruction and supervision.

Not only does this hold true for managing teams but also for DevOps. When infrastructure is considered as individual machines that need care and maintenance it encourages us to give constant orders to them. Permit this person to issue this command, archive this log at this time, execute this command etc… Considering infrastructure as a system composed of identical nodes encourages us to create behaviours that the system as a whole conforms to. This is modern configuration management in a nutshell. Any nodes that don’t comply are simply replaced when they display behaviour out of the ordinary. This makes it easier to manage all the different environments the same way making application behaviour more predictable this then reduces the time spent diagnosing why problems occur in one environment and not another which in turn reduces the time needed to implement new features.

What Paul’s father, the Duke Leto Atriedes knew, was that people, whose loyalty he had earned, respected his values and wanted to work according to his values. What I’ve learned is that systems whose behaviour is managed are more reliable and require considerably less maintenance than systems that are poked and prodded at.

Dune even goes so far as to offer us DevOps axioms:

Earlier in the book Paul is being tested by one of his mother’s teachers and she tells him:

“In politics, the tripod is the most unstable of all structures.”

She’s referring to the major structures of government that make up the world of Dune.

However there’s a lesson for us in Technology here too. The three great competing forces in a typical technology department are Development, Operations and Testing. Just as in Dune there can be no stable and productive arrangement of these three groups.

Development are incentivised to change and grow capability. Testing are at the whim of development but are often underfunded and unloved. Operations are incentivised by stability and predictability and often find take a cautious position that plays well to the needs of a test team always looking for more time. If testing receives funding then they form a power block with Development which besieges operations. So there can never be a stable productive accord between these three groups… While they remain three divided groups.

The lesson for us here is to not create these three structures in the first place or to break down the silos that constrain them if we have them. By creating product-centric service teams comprised of engineers from all the essential disciplines the team will have all the skills they need to build, launch, manage and support their service. This new type of team is motivated by the performance of their service not merely it’s infrastructure, it’s code or it’s conformance to predefined criteria.

There’s also some great practical advice for troubleshooting scaling problems. Early in the book Paul quotes the first law of Mentat. In Dune Mentat’s are human computers capable of processing data at incredible rates. The first law of Mental states:

“A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it.”

Consider the data most technical teams have for troubleshooting. System metrics, log entries, transaction checkpoints. All static data indicating a single point in time and in this data we’re supposed to understand how processes are behaving. Now consider what happens when we collect, aggregate and trend this data over time and review states prior to, during and after incidents. We begin to truly understand process behaviour and the behaviour of the various systems those processes operate within. Now to be fair the Application Performance Management tools took this as their goal from the outset but these tools have only been around for a few years now and many organisations are still not investing properly in their monitoring services.

I’d like to leave you with one final message from Dune:

Survival is the ability to swim in strange water.”

The water we swim in these days has become very strange. Applications have become services, services are comprised of micro-services, these services might run on platform services or infrastructure services that are themselves dependant on other services. We need DevOps if we’re going to survive in these strange waters.

Buy Dune on Amazon and while you’re at it you may as well buy my book too.

Featured image 77354890.jpg by Clarita on Morguefile.

DevOps Framework: The last teaser

2015-06-25 13.00.20

This week’s update is another short one we’ve just come back from an awesome holiday in the lake district. I’d had an intense few of weeks of writing and interviewing and then to top it all off I contracted some sort of flu or virus thing. Kylie’s contract had just come to an end and so we decided to take a short break before she started her next one. As you can see we had a great time and the weather was wonderful!

2015-06-24 15.01.59I love the lake district.There’s something about the fresh air and the countryside that really refreshes me. If anyone in Cumbria is considering a DevOps transformation please feel free to get in touch :).

I’ve returned ready to get the framework into a first draft state and published on Github so you can all get your teeth into it. On that note Kylie has finished her initial assessment and she’s provided some initial feedback that I’ll be working on this week.

Finally I’m participating in a panel discussing the business need for DevOps at Computing Summit’s DevOps 2015 event next week, July 8th. I hope I’ll see some of you there If you’re not interested in seeing me speak you should definitely hear what my friend Phil Hendren of Mind Candy fame has to say. While I’ve been leading DevOps initiatives and teams for the last 6 years Phil’s been engineering the solutions that make it happen you can find Phil’s blog here, it’s a great source of practical DevOps advice and tools.

A great example of DevOps in action!

I’m late bringing this to you as I’ve been ill for a few days so please accept my apologies. Last week I read a great piece that makes a really strong case for DevOps. If you think businesses can continue to thrive while they isolate their software, operations, network and test engineers in separate teams and assigned to projects by resource coordinators then you need to read this article.

I won’t write too much more here because I really just want you to read Ms. Macvittie’s article. I do want to close by saying that I’ve heard a lot about the dangers of optimising too early I’ve also seen many examples of what happens when you optimise too late. I think cross-functional product-centric teams can help maintain the balance between feature development and optimisation whether that’s service optimisation as discussed in Ms. Macvittie’s article or refactoring: https://devcentral.f5.com/articles/microservices-and-http2

* Featured image: cotswold-stone-wall.jpg by: thesuccess found on: Morguefile.com