Recently, I had the opportunity to come on board a complex enterprise application project for one of Canada’s biggest financial and insurance firms. I was brought in as a consultant half way through one of their new significant enterprise application development projects. The project was struggling and had significant problems that needed to be overcome. This article will describe how driving change allowed the paradigm to change to get our enterprise software moving in the right direction, enable accurate data-driven decisions, significantly reduce technical debt and regression, and provide confidence to business owners that we are proceeding in the right direction.
I was part of a team of external consultants (like me), developers, business analysts, architects, domain experts, business owners, testers, managers, both onshore and offshore plus external vendors which included one of the largest software companies in the world, and one of the largest offshore contracting companies based in India, who had provided over 50 people of it’s own to work on the project. Suffice to say, we had a very large project team working in multiple locations around the world.
There were some very significant problems that were inherent to the application that needed to be worked through. I worked with the teams and introduced a set of initiatives that significantly helped reduce our technical debt, regression rates, improve design, and create stable software. I took on other responsibilities as well, including ensuring non-functional requirements such as performance were met signed off by the business owners in which we were able to achieve, but that is out of the scope of this article. Here, I want to discuss how we went from a system riddled with bugs, technical debt, and seemingly unending regression, to a stable and production ready application.
I initially started by reviewing the entire application. I reviewed the backlog of outstanding issues, the entire code base both manually and with automated tools. I reviewed the existing architecture, and I looked for deviations between the intended architecture and the actual implementation; I documented any problem areas I saw and opportunities for improvement. We looked at the data and were able to target the areas which needed the most care. Because the application was still under initial development, we needed to carefully manage this new development while refactoring problem areas in tandem. New development took precedent, so we made sure that the refactoring that was done was on our biggest problem areas.
In addition to leading multiple team areas, I worked directly with all team members on implementation, best practices, pair programming, and code reviews. I always liked code reviews as they allow entire teams to review code, come up with better approaches, and help resolve problems. It helped re-enforce developers working together, team building, and to collaborate on innovative solutions to tough challenges. Often, the code reviews would just take the form of a pair-programming session, so we had several people who could be located in the same building or on other sides of the world, looking at the same code, working together to solve the problem, while other team members observe and provide input as necessary.
We Didn’t Have What We Needed
In a well functioning DevOps environment, refactoring should be second nature, and bad code should get replaced with good code over time. However, larger refactoring exercises, rewrites, and re-structuring large areas of applications, is often a tougher pill to swallow with inconsistent results. Often, developers, and technical people, look at large areas of an application that appear to be sloppy or bad and recommend a large refactoring or a re-write. Sometimes it’s justifiable, and sometimes it’s not – either way, sometimes it’s done regardless even though it shouldn’t be. In order to really justify any of the work we do, we (and I), really needed to understand the value (and ROI) of what we are planning to do. I helped to instil this idea within my team to ensure that the team had this same collective understanding. I certainly didn’t want to promote waste or any other action by the team that couldn’t successfully be measured. Questions I asked myself were “What does that data tell us about where our biggest problem areas are?”. “What are our QA testers telling us? Which areas are we seeing the most regression?”. Unfortunately, the data that we had didn’t fully answer these questions. So, without the data, any large re-write or refactoring exercises might improve the code, might end up being of neutral benefit, or could introduce significant new risk.
Throughout the process of working with multiple projects, offshore and onshore teams, directors, business owners, analysts, development team members and more, I focused on teamwork and accountability, and helping ensure the teams I was involved with were working together and functioning as a team. I introduced agile practices like pair programming and retrospectives, architectural reviews, design enhancements, and solving business problems using technology and code.
We needed more though. When dealing with millions of lines of computer code spanning multiple domains, applications, service architectures, and systems, we really had a lot of work to do before we could bring the application to market, and the business owners needed to understand our progress. From their perspective, our application had been historically very unstable, late, and we were nowhere close to bringing it to market.
The business stakeholders wanted more transparency into the number of stability issues we were seeing including the number of crashes. One of our senior team members came up with an idea to capture the number of errors in the application over time, by looking at all of our application log files, and W3C files on multiple sites on a daily basis, and parsing it, thereby counting the number of exceptions or crashes we had across the entire enterprise system. We had a development consultant develop the initial iteration of this approach, and it was a good first approach. However, it wasn’t completely accurate, had some false positives, and required manually updating excel spreadsheets in order to track this. It required too much manual effort. However, what it did provide is some initial insight into stability issues, and business stakeholders were happy that we at least had some data to report on. However, the data really didn’t tell us an accurate enough story. We weren’t where we needed to be.
Driving Change and Creating a New Paradigm
I created a new initiative to effectively provide more insight into where our applications biggest problems were. Even though business stakeholders only wanted a daily count of the number of crashes or stability issues, I saw this as a huge opportunity to take this to the next level.
I created a strategy and drove change by creating a new paradigm of ensuring we are making the best decisions, or better stated, the best data driven decisions, and ensure that we are validating all of our efforts to show that our efforts have been successful. There are some proprietary details that I won’t get into, but effectively I architected, and lead a team to implement, a new framework within our existing systems that would track all of the raw data we would need to understand where problems and crashes were happening in our application, what the details of those problems were, which variables and application states lead to the problem, which inputs or outputs lead to the problem, and so on. About 20 different software projects within the enterprise application were retrofitted to use this new framework.
My goals with this framework and the new processes around it were multi-faceted.
- Provide confidence to business stakeholders that we are meeting our goals and improving the product through a series of KPI’s
- Provide unprecedented insight into our biggest software problems allowing us to use the data to plan and get business buy-in for new initiatives
- Allow our development teams to quickly diagnose and analyze defects by providing an unprecedented amount of new meta data regarding defects and crashes
- Tie a particular defect logged by our QA team with the actual crash that happened including providing all of the meta data around that crash.
- Allow QA and Development teams to understand when a crash was net new, or if it was related to another crash or recently logged defect.
- Understand what types of crashes happened the most often and act on it
- Understand which areas (projects, services, domains, layers, or modules) of our application had the most problems and therefore were the most important to redesign or spend time hardening, if necessary.
- Provide an unprecedented level of analytical reporting to be used by QA, developers, leads, business stakeholders, and the rest of the project team.
- Create reporting to be used as a guide for managing our enterprise softwares’ defects, technical debt, and refactoring exercises.
After coming up with the plan, everyone was on-board with the idea. Overtime, it evolved into the most critical component to managing our enterprise applications defects and technical debt.
The reporting components were used by most team members on a regular basis or daily basis. The unprecedented insight into applications issues significantly reduced our MTTR metric exponentially. With our before state, we’d often have offshore and onshore developers working on defects, not communicating well, and we’d have multiple defects which had the same root cause that were worked on in parallel by different teams. This was redundant, and counter intuitive. In fact, once we started collecting all of this data, a quick audit on our previous defects, showed that we could have up to 5 separate defects logged for different issues that had the same root cause. Each of these defects had comments back and forth between developers and business analysts looking for a solution. So much waste. With the right data and the ability to report on it and consume it, we eliminated this waste further contributing to significantly better MTTR and significantly reduced lead analysis time.
We weeded out duplicate issues easily, and we could also recognize regression issues just as easily. Because the data would tell us which defects were related to which piece of code and from that piece of code we could determine all of the other related defects we had in the system, and we could report on the whole chain of defects logged by QA that had the same or similar root cause. Now, we could also track, and determine how often we were seeing net new issues and regression issues.
Confidence to Business Owners
We provided confidence to business owners. The business could now see, on demand and through customized analytical reporting, where we were with our software. Is our regression rate going up or down? Are we on a downward trend with application crashes? How fast is the trend moving to zero? I regularly provided analytics and reporting to the business including managers, business owners, and directors, showing the improvements we were making, and demonstrating the downward trajectory. We now had data that really allowed us to put accurate projections together in order to come up with milestone dates and launch dates with a level of predictability. Prior to this, there was little predictability because we had very little accurate metrics and KPIs that allowed us to determine where we are vs where we need to be.
A simple example is just to think about your regression rate. When you are opening more defects than you are closing, and you have no way to know the significance of those defects, how can you possibly say that we will be able to launch in 6 months, or in 12 months. You don’t. Instead, what happens in many organizations is that business stakeholders will talk to the technical and business owners, and they will come up with a plan and ballpark it and say…. “Ok, I think we can have this delivered in 6 months, and phase two can go to market 12 months after that”. Then they have 6 months to figure out how they will get there. However, they’ll get close to the 6 month mark, not be anywhere they need to be, and then say… we need to delay another 6 months. Then another 3 months. etc. Some organizations will just increase man power or ask the development team to work overtime as if overtime will compensate for bad practices or help you figure things out when you have absolutely no grasp on the scope of your problems.
However, since we were now collecting data, and putting together useful and accurate KPI’s, and keeping the team accountable to ensuring we are meeting KPI targets, we could show a downward trajectory. When you are creating X new defects per day, but closing twice as many defects per day, as an example, and you have your regression under control, and you have insight to the significance of each defect or type of defect, you can now make seriously sound judgements about the current and future state of your enterprise software. You know which defects, and which percentage of the overall defects, are significant enough to be fixed for production, which of those can be deferred, and you know based on the historical trend that you can expect to hit an inflection point in X months based on intelligent projections where the software will be stable enough and crash free according to specifications mandated by the business team. It also gives insight as to what would be needed in terms of people or resources to accelerate the development to get to the point where we are ready to launch.
DevOps and QA Improvements
Throughout this process our QA team also improved. The data was showing us that we had significant issues found in different environments that somehow the QA team didn’t find in their testing. How can this be? The data ensured that we relayed this information to the QA team and we worked with them to ensure they introduce additional QA testing coverage to areas of the system that were the most fragile. Again, reporting on the data, the data showed us the areas of the application that were the most fragile. We also were able to determine that, in fact, the QA team was experiencing some bugs and crashes that they did not actually record and put into our defect management system. We worked with the QA team to understand why, and at first they were defensive (why are you questioning us?), but after working through the scenarios and working with them closely, we identified problems in their own processes in which they were not adequately testing some areas and not adequately recording some of the crashes and bugs they experienced because they didn’t know they were “supposed to”. It’s not a blame game or a finger pointing exercise. The data and analytics we now had truly allowed us to collaborate and improve the SDLC processes throughout this whole enterprise application, and we all ended up better off because of it.
Justifying Effort and Measuring Results
Remember at the beginning of the article, I said we were able to look at the code and identify big problems that we had with the application that we should consider doing large refactoring or re-writes for? Well, we didn’t have the data at the beginning to justify the effort. We could do it, and it might work out or might be a wasted effort. We didn’t know. Now we had the actual data. I was able to drive these initiatives and compile all of the data, and identify, with hard data that I backed it up with, the key areas where we have the most fragile code – the key modules, services, and components, where we were seeing the most problems. We could see that we receive null reference exceptions quite often, and we were able to undergo significant searching across the application, and identify and fix potential problems before they occurred. We could now re-write and re-introduce data adapter and data translation components which were causing significant problems, and we had data to back it up which we could use to support our initiative and get business buy-in to move forward with the development work. Because of the amount of data captured, we could also measure our success and present that as a win to business stakeholders. We could now confidently state things like, “after our significant and targeted code hardening and re-write exercises we are now seeing 90% less problems in the targeted areas”. The data gave us the insight needed to accurately measure and validate the ROI.
It actually turns out that some of the areas we initially wanted to re-design didn’t have significant problems. Although the code and architecture certainly could have been re-written and could have been better, upon analyzing the data, we had very few issues in which the root cause was related to sloppy code or architecture in those areas. Fortunately, the data showed us that if we went forward with a significant refactoring in these areas, we would have almost zero benefit to the application, it would delay our launch by months, and introduce new risk and potentially cause new defects and crashes we would need to work through.
Capturing the right data, and using it to our advantage greatly improved the prospects of our enterprise application, and provided insight and confidence to management that we were improving and on the right trajectory to launch into the market. Because many people on the project may not have been used to making data driven decisions, in this regard, to drive enterprise software delivery, there was some pushback at times from business owners, project managers, and program managers that I had to mitigate. Often, I’d have to field important, and deservingly so, questions from people much higher up the chain of command (people who sign my contract and approve my invoices), including “How do you know user’s are experiencing these problems?”. “We are seeing more problems in this application than we anticipated, how do we know you are not reporting on false positives?”, “How do we know these problems you are reporting are significant?”, “How do you know your team is working on the most important issues?”, “How can you be confident that we will be ready for Production in X months?”, “Why are you reporting and indicating that we have X problems in certain areas, but our QA team did not find any problems like this?”.
Navigating Through the Pushback
Ok, so important initiatives will always have some pushback, but fortunately, I was able to effectively navigate and deal with this pushback, and work directly with the business and technical teams so everyone understood the value and was on the same page as to how important the data analytics would be for delivery. I was able to anticipate the concerns that business leaders would have, and I was prepared to answer their questions honestly, as well as provide real life examples, scenarios, and data, which demonstrated an example of their concerns and how we overcome those concerns using data analytics, reporting, and metrics. I used the data to back everything up, and also provided all of the raw data, so that anyone else was able to use this data to come to their own conclusions or build upon existing conclusions if they wanted to.
Here are some of the responses I was able to use to help to continue to mitigate pushback and to help move the data driven decision approach forward and to get to the point where we can successfully launch our enterprise product into the market.
- We weren’t reporting on false positives, because we were able to trace each crash or defect directly to a user experience error the user would be able to see. We also used this to increase QA coverage and this has shown a dramatic decrease of new issues found in SIT and UAT environments.
- There are false positives that show up in the data, and we are able to identify them. In some cases we have fixed the code which has provided the false positive, and in other cases we are able to exclude them from our reporting.
- We know that these problems are significant due to the business area they affect and also due to the large number of specific types of issues that we are seeing over and over in the same area. We have the data to show how often we are experiencing problems in each of these areas.
- Our team is working on the most important issues. Part of our new triage process is to use all of the data we have collected to determine the most significant defects for the sprint. We have a report that identifies the most significant defects ranked by number of repeat issues, application domain, and application areas most significant to the business.
- We anticipate when we will be ready because we have charts that show a downtrend. We now have control of the situation, and can project when we will hit the inflection point and have met our businesses stability and regression requirements. Here are the charts that show the trajectory, and the raw data is available as well for you to look at.
- Our QA team didn’t find many of the issues that we have been reporting on because of a variety of reasons. We are already working directly with the QA team, and have determined gaps in their test coverage, and we have determined there were specific types of problems they didn’t know they were supposed to log. We are working with them and have helped to correct this and introduce new processes for defect logging and new areas for test coverage.
- Here are examples where the data we now have significantly reduced the amount of time required to fix defects and allowed to us to focus on delivering the right solution. We now know when different defects have a relation, and we’ve been able to use this to our advantage.
Overall, by driving change and creating a new paradigm to how we approach defects and technical debt within our enterprise applications, project team members and business stakeholders at all levels had unprecedented levels of data, charts, and reporting that enabled us to inspire confidence in the business that we are on the right trajectory. Our QA teams and test coverage improved and duplicated defect and redundant work was eliminated. The development team was finding the root cause of new defects faster and we were successfully closing them faster significantly reducing the amount of time it takes to resolve an issue by more than 50%. KPI’s were introduced and we could use the same data we had been capturing as metrics to validate that we were meeting our KPIs.
It’s important to always drive change within an organization when you believe the value will be significant to delivery and execution. Consultants and leaders always need to adapt and adjust to project scenarios. It’s not a one size fit’s all approach, but if you are focused on delivering the right software and on-time, you will need a way to measure that delivery, create a measurable trajectory to get there, measure and improve, and very importantly, provide agreeable KPI’s to the business owners, and make sure you get there – this will instil the confidence needed to get continued buy-in, and ultimately lead to a successful launch and execution.