In this post I’m describing the work we did to build a Big Data platform for one of our Financial Services clients – a major household name in the UK. Being able to handle and store large quantities of data are essential if you want to make Big Data work for you.
The problem - the needs of regulation
Let’s say you’re a bank and you process payments. You need to be able to hold 7 years’ worth of data on the payments you’ve sent and received. This might not sound too bad, but it can add up to some huge quantities of data!
I can remember when my mum used to go to the bank at the start of the week and get cash (by writing a cheque to “cash” – is this still even a thing?). The rest of the week she’s spend the cash, so the bank had one transaction even though that money would be spent in lots of places.
With contactless and mobile payments we’re moving to ever higher numbers of electronic payments for smaller and smaller average values. This is massively ramping up the data that banks have to store.
Let’s have a look at some example numbers (not exact, but they at least give you the order of magnitude):
- Average number of payments per day: 10 million (some days are busier than others)
- Number of data points per payment: 10
- Years to retain data: 7
- Total data points: 255 billion
That’s a lot of data points!
The other things we needed to be sure we could handle were:
- Average data per payment: 10 kb
- Peak messaging rate: 10,000 per second
- Total data store size: > 200 terabytes
Finding the right technology
We worked with the client’s technical teams to evaluate and choose the right technology. What we were looking for was a combination of messaging and storage.
The messaging solution needed to be able to handle high peak loads and to free up the source systems as quickly as possible. For this we proposed Apache Kafka right from the start.
The storage needed to be scalable and resilient, able to continue operating across data centres and tolerate the loss of any individual computer. The size and nature of the data indicated that a NoSQL solution would be most appropriate. We considered Cassandra, based on the write speed. However we eventually dismissed this in favour of MongoDB. Given some of the other requirements we had such as partial updates and repeatable reads, this was the appropriate choice.
Building the application
The diagram above shows the basic concept of the application – it’s not that hard!
What this hides are the number of issues that needed to be overcome in order to get the solution designed, built and working.
Some of the things we had to overcome were:
- Interoperability of Windows and Linux, especially passing security credentials.
- Running a hybrid solution – source systems on-premise and the data platform in the Cloud. Managing the infrastructure, security and latency issues around this.
- Regulatory impact of putting sensitive data in the Cloud, and having the appropriate controls in place.
- Helping the client put in place the appropriate DevOps processes to deliver the infrastructure and application from code.
In the end, the actual data platform was the easiest part! Cloud technology makes the storage of large quantities of data highly practicable. The hard part was integrating this with the legacy environment of a large organisation.
We needed to test the solution with representative data. We were glad to be able to use historical data that the client had already amassed in other systems.
We worked with the client to extract the data from the existing repositories, reshape it into the right format for the application and feed the data into the messaging system.
By testing in this way we were able to load 7 years’ worth of data into the data platform in 2 weeks! This gave us the confidence to sign off the design as fit for purpose and leave the client’s technical teams to implement the final rollout.