Introduction – Technical Platform
This whitepaper is based on my personal experience working on a technical platform program in the retail banking division of one of the world’s largest banks, and my involvement in the project that introduced cloud technology to the customer.
By outlining the context, strategy, the decisions that were made and reasons for making them, I hope that my experience will be helpful to others involved in similar projects at other large organisations.
This document contains no proprietary information that would disclose the identity of the customer. The challenges described here are not unique to this organisation and are quite common, especially to other large FS organisations.
In 2015, the customer set about transforming its digital approach to its Retail Banking and Wealth Management (RBWM) business. Separate funding was allocated and considerable architectural freedom was given to RBWM IT to determine what changes were needed. At the same time, digital transformation was occurring in other parts of the business: product managers were hired from technology companies rather than banks, and small cross-functional teams focused on specific product lines were formed, with the objective of rapidly delivering software.
One immediate challenge that RBWM digital ran into was that there wasn’t an “environment” that would enable rapid development of anything. The bank was still approaching most of its projects as large monolithic endeavours, in which six-month delivery cycles for a first release were considered the norm. Infrastructure and environments were owned by the IT operations team, and provisioning for the most part was not automated. Almost every project had to start with designing its environment needs from scratch, and it was typical to wait for three months for the first test environments to be delivered. Infrastructure was not immutable, and it was common for configuration issues to bleed in when code moved through environments or into production.
Enter the Technical Platform
The main objective for creating the technical platform was to establish an “environment” for delivering outcomes quickly and efficiently. The platform’s goals were to:
- Create an ecosystem for the rapid development and deployment of applications
- Create an ecosystem more conducive to a micro-services type development, naturally moving away from monolithic style
- Foster an environment where multiple development teams can co-exist
- Eliminate the need to build physical infrastructure every time a new project is started, thus shortening the delivery time required
- Shift to true DevOps culture
With these goals in mind, cloud technology was the obvious solution to address the following four needs:
- Agile development
- Microservices architecture
- Continuous delivery
- Infrastructure on demand: cloud enables the procurement of infrastructure as needed, reducing costs to cover only what is required
- Infrastructure as code: cloud offers the ability to treat infrastructure as code with version controls, leading to immutable infrastructure and repeatability
- Transparency: unlike the hidden costs associated with internal environments, using cloud enables all costs to be known and reported on
- No over-provisioning: cloud has the capability to scale up and down with demand
- Adaptable infrastructure: cloud treats servers as “cattle” rather then “pets” with the ability to discard and rebuild infrastructure at will
It quickly became apparent to the team building out the technical platform that implementing an external cloud in the bank would be a significant challenge (more on those challenges later). The team concluded that the external cloud was associated with a very high degree of risk, and if we went down that route straight away, there would be significant non-technical challenges to resolve that would overwhelm and totally stall the project. We sought an alternative approach that would allow us to meet the above objectives while cautiously exploring external cloud options.
Eventually we settled on the following “two-pronged” approach:
- Internal Platform
- Build the technical platform internally, using VMware Tanzu Application Service as a hosting option
- Build an internal CI/CD pipeline based on all the industry-standard tools
- Start exploring the use of external cloud in parallel, without impact to the delivery of the internal platform
In order for the projects not to diverge too much, the following two guiding principles were adopted:
- Whatever is built internally has to be available in the external cloud as well, so that we avoid re-architecture and rebuild later on.
- The opposite: whatever was built externally had to resemble what was built internally (I will explain why later).
Any team implementing an external cloud at a FSO can expect to run into the following challenges, which can roughly be split into three main categories: (a) Legal & Compliance, (b) Organisational, (c) Technical.
Legal and Compliance
- Regulators are on the fence when it comes to banks adopting cloud technology. For example, the FCA (The UK’s Financial Conduct Authority) issued a paper in 2015 that aimed to clarify its position on external cloud, but lacked specifics and boiled down to a simple statement – banks are not allowed to transfer risk to the cloud provider – and if something goes wrong with provider infrastructure, the regulator is going to go after the financial institution and not the provider. In my opinion, this paper didn’t go far enough. It refused to regulate, license, certify or treat cloud providers as essential infrastructure that exist for a common good, leaving it to each financial institution to sort out its contracts with such providers based on their risk appetites. Instead of creating a level playing field, it actually tilted the scale towards smaller FinTech players and newcomer banks with larger risk appetites and who regularly fly under the regulator’s radar, and against larger established institutions that are under a lot more scrutiny.
- In this bank, the relationship with a cloud provider would come under the category of “third-party vendor / outsourcing.” Internal frameworks, rules, procedures, instruction manuals created to deal with third-party vendors have not been updated recently, and so are not easily applied to a third party-vendor that provides cloud infrastructure.
- Contractual arrangements between IT suppliers and banks are not easy to negotiate to begin with, and contracts with cloud providers are even more difficult because a framework for risk analysis often doesn’t exist.
This leads to two types of issues: firstly, contracts that are too weak, because an organisation is underestimating the risk, or secondly, a negotiations process that takes too long or fails because of the over-estimation of risk, and demands for protection that exceed the provider’s desire to commit.
- Regulators require banks to perform onsite security reviews of their suppliers. Some cloud providers like Amazon Web Services (AWS) are extremely secretive about their locations – for the right reasons – and as a general rule do not allow on-site inspections. The quality of these inspections are can be a point of contention – and often they are no more substantial than a tick in a box – but there is no point arguing it if regulations and auditors require banks to do it.
This goes back to a point I made before: if regulators certified cloud providers, this issue would have gone away, but since they don’t, they leave it to each financial institution to certify its provider. This was a big point of contention between the customer and AWS, but at the time of this writing, AWS had actually softened their position and agreed to allow onsite inspections.
- In this organisation, as in many others, IT is organised in silos. All infrastructure is owned and operated by IT operations, that then recharges that infrastructure back to the businesses based on very complicated apportionment metrics. In the adoption of external cloud, the question of ownership had to be resolved, i.e. the software delivery organisation couldn’t drive the project on its own, requiring ITO support.
- Many individuals at the bank didn’t take the project seriously at first, and there was a lot of resistance from several people that had to be involved. The team realised very quickly that we needed to secure very strong senior sponsorship within the bank to get the project off the ground.
- When it comes to security, AWS and other IaaS providers have what is called a “shared responsibility model.” The cloud provider is responsible for everything up to hypervisor, and the customer is responsible for everything “above” (e.g. OS, OS patching, security of virtual network and OS, etc.) This concept was initially very hard to explain, but was very important and had to be “driven home” to the customer. On the other hand, it was a good way to win over the hearts and minds of the ITO team – they were still responsible for supporting the estate, but they didn’t have to worry about hardware anymore, and got great tools to do the support!
There were not very many, and to a large degree they were manifestations of issues in (1). For example, the approved ways of connecting to third party providers having not been created to take cloud providers into account. The problem was easy to solve technically, but required a change to internal procedures first.
The customer didn’t go through a full RFP process with different vendors; our analysis at the time showed that AWS was by far the leader of the pack. Our thinking was that if we could work out the issues of adopting external cloud with AWS, the same method could be applied to other vendors, and because cloud is a pay-as-you-go service, the lock-in wasn’t a big issue. It is possible that in the future, the customer will use more than one provider at the same time for different projects.
It is important to note that the decision was made to initially only consider AWS for the IaaS, and avoid higher-level services. Each consideration creates more of a “lock-in.” These services would be individually looked at by the architecture community, and the decision would be made whether each was the right one to use, or whether it would be better accomplished by using a BYO solution.
AWS Adoption Strategy
From the very beginning, we took the approach of being very open about what we were doing. Our thinking was that if we ran into fierce opposition from any corner, it was better to encounter it early on so that we could deal with it quickly, or if it was insurmountable, stop before we wasted too much effort.
We also wanted to avoid even the possibility of criticism that what we were doing was a “skunkworks” project and that we were putting the bank at risk. We knew from the beginning that even the slightest mistake (even if it didn’t actually put the bank at risk) could set us back years, as there would be no desire on the customer’s side to restart this effort for some time.
Finally, we wanted to ensure that whatever we were doing, the bank was always completely covered from a legal standpoint.
At this bank (and likely in numerous other large organisations) there was a governing body called the External Hosting Committee. Representatives from the legal, compliance, architecture, networks and security departments were charged with reviewing and either approving or rejecting requests from various projects to host things externally outside the bank.
For the most part, the EHC had to deal with requests to some form of SaaS for different niche purposes, from the outsourcing of benefits processing to HR or performance management systems. Never before had they dealt with a request for a full-blown IaaS; nevertheless, we elected to go to them and seek formal approval.
We anticipated that if we said that our goal was to use AWS for production straight away, we would spend months and months in internal approvals and discussions before we could even start exploring and learning, so we split our submission into two parts. We submitted one request for approval to use AWS for development and testing, and a separate one for production, clearly stating that first one was a priority.
We also clearly stated that under no circumstances would real customer or employee data be used in development and test environments built in the external cloud.
This was very important to emphasise. Since the bank’s primary concern in using any external provider was the accidental exposure of customer data, it was crucial to make very clear the distinction between dev/test and production. This allowed for a very clear separation of risks and concerns, and faster approvals for the former while discussions continued on the latter.
As anticipated, the approval for use of AWS for Development and Test was granted without too many issues.
Before we started anything in AWS – even created the first account – we negotiated and signed the Enterprise Agreement. Again, the idea was to protect the customer, but also to discover early on any issues we might run into so that we could immediately tackle them – or if we found them insurmountable, stop.
We focused on the legal terms of the agreement and not the pricing. AWS pricing is very transparent and public. It is possible to negotiate private pricing, but we agreed that would only make sense if we knew what the volumes would be. Until then, it was pointless to spend time on that discussion.
Without turning this paper into a legal brief (that I am not qualified to do) I would only say that many of the legal points are not new to FSOs who routinely outsource some of their operations to various third parties. Cloud just adds an additional element to this. I recommend approaching these points the same way you would approach a legal agreement with any third-party vendor where your customer data may reside or be processed. And while the responsibility for security is shared, there is no denying that fundamentally the security of the cloud infrastructure is the responsibility of the provider. When we use security groups to control access to instances, we trust that the groups will work as specified and always do what we configure them to do. This also applies to encryption services or other services we may choose to use for security purposes.
In the scope of this project, we determined that these legal items were irrelevant for IaaS used for development and test only, provided there was no customer data involved. These terms were very important if the customer was considering using AWS for production.
Together with AWS, we decided to proceed in stages, mirroring our overall approach.
It was agreed by both parties to sign an agreement for development and test only, using a more or less boilerplate agreement, and we obtained internal sign-off from the CIO that very clearly spelled out the condition that no project at the bank was allowed to use AWS for production until a new agreement had been signed.
It was also agreed that we would continue negotiating the terms in question to get the agreement to “production level,” but would do so at a slower pace, without time pressures.
This approach allowed us to begin moving the project forward, creating AWS accounts and setting up and testing cloud infrastructure, while gaining valuable experience in the process. The alternative would have been waiting for a perfect agreement to be signed before we could have done anything, which would have set us back a few months.
Last but not least, some people believe that AWS is not known for negotiating and adopts a “take it or leave it” approach. I found this belief to be definitely false. Amazon is commercially-driven like most other organisations, and generally will negotiate if they see a business case to do it, which is what happened in this instance. It is worth remembering that their standard agreement caters to a very broad group of customers, from small businesses and individual accounts to large corporations.
In parallel to getting the approvals and negotiating the agreement, we conducted a number of informal education sessions with the help of AWS, which focused on topics such as security, network design and operations. We also arranged AWS Technical Essentials training for a large number of people at the bank, because we realised that it was very important that everyone who might be involved, even to a small degree, should speak the same technical language and be able to grasp the principles.
As I mentioned previously, we felt that it was very important to run this project very formally, with clearly articulated objectives and with very clear sponsorship from senior leadership.
We drafted a Terms of Reference (ToR) for the project entitled “AWS Adoption for Development and Test” that spelled out the project’s objectives, deliverables, approach and initial costs. We identified who the stakeholders should be, and sent the ToR to them for sign-off.
In addition, we identified a few other transformational programs taking place within the bank whose teams were also interested in AWS, and partnered with them.
List of stakeholders:
- Global Head of Infrastructure
- Global Head of IT Security
- Director of RBWM Digital Channels IT
- Directors of other two programs interested
- Head of Platform and Frameworks for RBWM Digital
- Head of Virtualisation Architecture
The Group CTO accepted the role of primary stakeholder, which proved to be very important for the success of the project.
Project objectives were summarised as follows:
- The virtual infrastructure delivered as a result of this project must be production quality. This means that, even though it is only being built for development and test, no shortcuts could be taken that would have reduced the quality or security posture “just because it was dev.” Although taking such shortcuts would allow us to deliver something faster, it wouldn’t prove much, and especially wouldn’t prove that AWS could be used for production in the future.
- Virtual infrastructure should be secure:
- The same guiding principles and FIMs that apply to the way data is handled and protected that apply to internal infrastructure should apply in the cloud.
- The least privileged access model that is followed for internal infrastructure needs to be followed in the cloud.
- The virtual network must be protected from intrusions to the highest possible standard.
- Developers should have maximum flexibility to deploy their applications at speed, with the ability to provision hosting infrastructure on-demand, but without compromising security or stability.
- Developers need the ability to deploy applications, but shouldn’t need to worry about the configuration of the underlying infrastructure. Set up of such infrastructure should be fully automated and based on a number of pre-defined “templates” that developers can choose from (AWS Service Catalog is a possible solution to this).
- Development and test infrastructure in the cloud should match internal production infrastructure. Therefore, for RBWM, Digital VMware Tanzu Application Service should be implemented as PaaS in AWS.
- Implement Adobe AEM and potentially other content hosting solutions.
- Treat infrastructure as code. All of the infrastructure provisioning should be completely scripted and version controlled. It should be possible to roll out the whole or the portion (module) of the infrastructure on demand (The combination of CloudFormation, Ansible and Terraform was being looked at in the project).
- There should be an ability to monitor the infrastructure and integrate with tools already used to monitor on-prem cloud.
- Developers and administrators should be able to access AWS seamlessly, using their LAN credentials and two-factor devices (This implies integration with the company active directory. The AWS solution for this is AWS AD Connect).
- Applications running in AWS will require connectivity to the customer’s network to communicate with systems hosted internally. Likewise, access will be required to the virtual network from the bank.
- Infrastructure needs to be automatically scalable and recoverable with human intervention eliminated or kept to an absolute minimum.
- Sensitive data must be encrypted both at rest and in transit (Here is a good whitepaper on AWS security that touches on encryption. Generally it is important to remember that the customer is responsible for implementing encryption, and special attention must be given when using databases, especially AWS non-native ones).
- There must be a robust set of reports that would allow the customer to be aware of the costs at all times, and understand how the costs must be internally allocated between various projects / programs / teams that are consuming the infrastructure.
A Sample Plan of Activities
The following list of high-level tasks was a result of initial brainstorming sessions between the customer and AWS. These discussions then further evolved into a detailed project plan. This list included all the activities that must take place, with tasks not listed in a precise sequential schedule.
- Complete connectivity design for VPN connectivity between AWS and the customer
- Set up VPN connectivity between AWS and customer
- Set up account structure for billing and IAM
- Design and implement IAM structure
- Complete Direct Connect link connectivity design for production
- Set up Direct Connect between the Bank datacenter(s) and AWS region(s)
- Design VPC
- Connect Active Directory with AWS using AWS AD Connector
- Build customer-specific hardened AMI image(s) for EC2 instances
- Build development and test VPC
- Set up Adobe in dev/test VPC
- Set up VMware Tanzu Application Service in AWS dev/test VPC
- Set up API Manager, API gateway
- Set up webservers in DMZ VP
- Extend internal CD Pipeline to deploy to VMware Tanzu Application Service in AWS and to other components in AWS
- Extend the CD pipeline to call AWS APIs to automatically provision environments (outside of PaaS)
- Implement system for managing privileged access to servers
- Conduct penetration testing and security review
- Script the creation of the web servers and other components using CloudFormation
- Create templates for the whole infrastructure
- Practice re-creating the infrastructure
- Implement AWS Service Catalog
- Conduct security testing, security and operations review
- Re-create infrastructure in other regions where the customer operates
- Validate integration with existing monitoring and alerting tools or identify changes
Notes on the actual project execution
Due to the lack of internal knowledge, the customer hired a number of consultants directly from AWS Professional Services and created a team that was a mix of internal engineers and AWS consultants. VMware Tanzu Application Service consultants were also involved in assisting the team, but the internal virtualisation team did most of the work.
At the request of one of the stakeholders, the team was asked to demonstrate that VMware Tanzu Application Service can be installed in the AWS environment quickly. The team took on the challenge and created a first VPC with VMware Tanzu Application Service installed in about one week. The idea was to then keep adding to it.
Setting up an initial VPN connection to AWS proved to be a bigger challenge, not in small part because of the number of discussions it had generated. One of the issues was that the team responsible for network security standardised all of their operations around a firewall and management tools from a specific vendor. Simply using AWS access control lists to manage access to and from subnets didn’t satisfy their requirements. Fortunately, the firewall vendor had a virtual FW solution that was specifically designed to be deployed in AWS, and that could be managed from the existing FW management tools as just another on-prem firewall.
Throughout this process, our team discovered an amazing thing – while generally, there was a fairly substantial lack of knowledge about AWS amongst the different teams that were involved, almost always someone would come out and say “actually, I have experience and already thought about how to do this, it’s great that you are doing it, can I participate?” This happened both with the virtualisation team and with networks.