By Bharani Subramaniam, Head of Technology – ThoughtWorks
By Mageswaran Muthukumar, Lead Consultant, Data Engineering – ThoughtWorks
By Nagesh Subrahmanyam, Partner Solutions Architect – AWS

Connect with Thoughtworks-1

In the digital economy, customers expect immediate, relevant, and frictionless experience when they want to find, do, or buy something.

To meet these expectations, businesses need to deliver hyper-personalization for every one of their customers. This demands a comprehensive view of customers’ behavioral patterns across direct and indirect touchpoints.

Such a solution requires an elastic infrastructure, as you can’t predict the amount of compute and storage a solution will need up front. Moreover, a regulated industry like banking puts additional constraints on dealing with personally identifiable information (PII).

This post describes how ThoughtWorks, a global technology consultancy company, handled infrastructure and regulatory challenges and delivered a cloud-native, hyper-personalized data platform on top of Amazon EMR for a leading private sector bank in India.

ThoughtWorks is as AWS Competency Partner and leading global technology consultancy that enables enterprises and technology disruptors to thrive as modern digital businesses.

What is Hyper-Personalization?

Digital platforms are disrupting the traditional business models across many industries. Though there are a number of factors that make a digital platform successful, we have observed the strategy to sustain and promote usage of the platform is the one of most critical factors.

Traditionally, businesses focused most of their marketing efforts on acquiring new customers, but in the digital era they also need to continuously engage existing customers at the same time.

To achieve continuous engagement, some organizations have a dedicated “growth team” while others treat this as a part of their general marketing efforts. No matter how a business is set up, you need a good infrastructure to derive insights and make data-driven decisions.

Years of research point to the fact customers value contextual and personalized content over generic ads. Before we go further, let’s unpack the different levels of personalization and explore why hyper-personalization is so effective.

Assume your business has one million consumers and your marketing department wants to promote 10 offers, each with 10 variants, to keep these customers engaged.

Though there can be many ways to achieve this, we can broadly categorize communication mechanisms into three distinct categories, as shown in the levels of personalization chart below: generic messaging, segmentation, and hyper-personalization. The effectiveness of communication increases as the focus of personalization increases.


Figure 1 -Levels of personalization.

The following sections describe the levels of personalization.

Generic Messaging

In this approach, you would simply send all 100 offers (10 offers with 10 variants) to every one of your million customers. You may overlay the customer name or gender to make it look personalized, but this kind of generic messaging is mostly ineffective and there’s a high probability your customers would treat it as spam.


In this approach, you’d leverage the knowledge gained about your customers through their purchase history and preferences to effectively cluster or segment them into distinct personas. You can also filter the offers according to each segment. Segmentation leads to better conversion rate as it’s more targeted than generic messaging.


With hyper-personalization, you would go a step further than just segmenting your user base. You don’t send the offers at random intervals, but rather watch for signals from customer behavior and promote valid offers that address the needs of your customer in a timely manner.

Hyper-personalization is far more effective than other approaches, but it does require sufficient data to infer behavioral patterns and the ability to process them in near real-time.

Limitations with On-Premises Infrastructure

When we analyzed the volume and nature of transactional data required to create the behavioral model of our customers, we realized a few key insights:

  • Daily volume of raw input data across all sources is approximately 1 TB.
  • Volume of the output (customer model and valid offers) is approximately 10 GB, or one tenth of the input volume.
  • When we compute the customer model with thousands of attributes across different stages in our data pipeline, the data volume expands up to 5x the input volume for the cluster.

With this insight, it was clear we can’t depend only on the on-premises data center capacity; rather, we need a hybrid-cloud model. For security considerations, the data preparation and sanitization steps are executed from on-premises data centers.

ThoughtWorks decided to leverage AWS for the big data pipeline, and this hybrid cloud approach enabled them to focus on quickly testing business hypotheses rather than operating the big data infrastructure.


Figure 2 – Expand and contract nature of data.

Hybrid Cloud Approach Using Amazon EMR

The following table describes our approach at a high level.

No. Stages Where Descriptions
1 Integrate with system of records (SoRs) On-premises This layer is responsible for integrating with hundreds of internal systems via file, Confluent Kafka, or API interfaces.
2 PII scrubber On-premises Once we have the data, we scrub all PII details which aren’t required for the data platform running on the cloud.
3 Tokenization and encryption On-premises Tokenize the record to protect the remaining PII information. Encrypt the reverse lookup map and store it in a RDBMS on the on-premises data center.
4 Big data processing AWS Cloud We leverage Apache Spark in Amazon EMR clusters to compute thousands of metrics to effectively model the customer behavior.
5 De-tokenize and delivery On-premises At this stage, we observe patterns in user behavior and trigger the communication process based on the model we have computed from the cloud. De-tokenization happens just-in-time before communication delivery.

To illustrate the five stages of hyper-personalization as described in the table, let’s consider a fictional user named Saanvi Sarkar and see how the model gets progressively built as we go through each stage.

Integration with System of Records (SoRs)

We carefully mapped our requirements with hundreds of data sources from within and outside the bank. Details on such integrations and mapping of data sources are beyond the scope of this post.

In this stage, we’d have all transactional details of Saanvi Sarkar from their existing banking account. We also have access to credit and other scoring details from regulatory sources.

Personally Identifiable Information (PII) Scrubber

The first stage in the privacy preserving layer of the solution is the PII scrubber. Depending on the nature of the attributes, we apply a few strategies to protect the end user’s privacy:

  • Any PII attribute that’s not required in later stages of processing is scrubbed, meaning it’s completely dropped from our dataset.
  • Some attributes, like age, are transformed from absolute values to ranges without any provision for reverse lookup. Example: 23yr to 20-25yrs.
  • A few other attributes, like external references, are partially masked.

At this stage, Saanvi Sarkar details look like this:

<Key Identifier> <Necessary PII Data> <Non-PII Data>

Tokenization and Encryption

In the next stage, we tokenize the key identifier so that even if there is a breach it will be difficult for bad actors to make sense of the records.

  • <Key Identifier> is tokenized and substituted with random values. The reverse lookup is stored in a secured system in the on-premises data center.
  • <PII Data> like email and phone numbers are removed from the record, and encrypted and stored separately for later use during offer delivery.

After tokenization and encryption, the output record of Saanvi Sarkar will look like <Token>< Non-PII Data>. At this stage, the data is anonymized and safe for processing in the cloud.

Big Data Processing in Amazon EMR

The bulk of the data processing happens in this stage. Cleansed data that’s safe for computation in the cloud is fed into data pipelines in Amazon EMR Spark clusters.

We have a mix of batch and streaming workloads as described in the architecture diagram in Figure 3 below. The entire pipeline can be broken down into two stages: modelling customer behavior, and the matching engine.

Modelling Customer Behavior

Data arrives either via Kafka topics or via files from Amazon Simple Storage Service (Amazon S3). We utilize Apache Airflow to schedule jobs in Amazon EMR clusters.

Since we have hundreds of data sources, we have to support both continuous streams of updates and batches arriving in hourly, daily, or even weekly intervals.

  • Apache Airflow orchestrates all the periodic batch jobs. Once triggered, these jobs create ephemeral Amazon EMR Spark clusters to incrementally compute the customer model.
  • For streaming use cases, we utilize Spark Streaming in Amazon EMR to process real-time continuous updates from Kafka topics.
  • To support ad hoc exploration, we use AWS Glue to effectively catalog files in S3 and create necessary schema for Amazon Athena.

We can now infer that Saanvi Sarkar regularly deposits ₹50,000 within the first two weeks of every month and their average balance never went below ₹300,000. We also know she extensively travels by two wheeler based on fuel expenses.


Figure 3 – Hyper-personalization architecture.

Matching Engine

We wanted a flexible way to express the changing needs of the marketing team, so we created a custom domain specific language (DSL) that’s easy to understand but at the same time expressive enough to capture the changing needs.

The DSL takes the customer model, inventory of offers along with its variants, and the real-time behavioral data to effectively match suitable offers.

  • Each offer and its variants are tagged with potential eligibility criteria, like what channels of communication are applicable and what conditions must be met in the customer model.
  • The DSL also takes customers’ preferences into consideration. For example, if they have opted out of a channel or types of promotions or prefer certain days of the week.
  • We designed the DSL to be flexible and extensible enough to express different constraints and chain actions together. An example of this is sending a special offer if the customer has positively responded to the previous one.

Users from the campaign management team continuously come up with offers and variants and capture the matching logic in the DSL. Once executed, the DSL takes care of creating necessary jobs in Airflow, which in turn orchestrates the matching pipelines in Amazon EMR Spark clusters.

In this example, there is a high probability that Saanvi Sarkar is looking to buy a car. Given this context, the matching engine will pick the car loan offer with attractive interest percentage based on Saanvi Sarkar’s credit history.

De-Tokenize and Deliver

The matching engine communicates the match as <Token> <Offer> to the on-premises data center. Since the data in the cloud is tokenized for protecting customer privacy, we need to de-tokenize before delivering the offer.

  • For the given <Token>, we retrieve the actual customer identifier. This is handled securely by the tokenization system. Output of this stage looks like this: <Key Identifier> <Offer>.
  • Once we know the <Key Identifier> of the customer, we need to retrieve additional information based on the channel, such as the email or phone number to send the offer to. Since these are PII data, it will be encrypted during the tokenization stage. We have to decrypt to get these details. Output of this stage looks like this: <Key Identifier> <Necessary PII Data> <Offer>.
  • With <Key Identifier> <Necessary PII Data> <Offer> we send the offer to appropriate delivery systems via push notifications, email gateway, or SMS providers.

The matching engine predicted the need of the Saanvi Sarkar to own a car and came up the relevant offer for the car loan with customized interest rate based on credit history. This approach significantly transformed the customer relationships for the bank.

Instead of designing a group of offers and sending them to millions at a time, the bank offers immediate, relevant, and frictionless experiences to its diverse customers.


Digitally savvy customers want to feel delighted and they expect organizations to add value to their lives with relevant and timely information. ThoughtWorks and AWS have seen hyper-personalization help in shifting the focus from static promotions to address the dynamic needs of their customers.

From our experience, some key factors to consider while implementing hyper-personalization initiatives are:

  • Come up with a strategy to keep customers engaged. If users are not active, the benefits of a digital platform erode over time.
  • Constantly revisit the number of data sources required to effectively build customer profile. Too much accuracy in personalization may have a negative effect as some customers may find your solution to be invading their privacy.
  • Devise a flexible mechanism to capture the moving needs of the marketing team. ThoughtWorks built a custom domain specific to declaratively capture the details of the offer without exposing the complexity of underlying implementation.
  • Understand the regulatory implications before designing the architecture. Some parts may still be required to be on-premises data center.
  • Leverage cloud for scaling big data infrastructure. ThoughtWorks observed a 5x increase in data volume during transformations and used Amazon EMR for scaling ad hoc experimentations and running the production workload.

If you are interested in knowing more about hyper-personalization, here’s a good collection of articles addressing the dynamic micro-moments of customers.

If you’re interested in the implementation details of streaming processing using Apache Spark and Apache Kafka, learn more in this AWS Big Data blog post.


ThoughtWorks – AWS Partner Spotlight

ThoughtWorks is an AWS Competency Partner and leading global technology consultancy that enables enterprises and technology disruptors to thrive as modern digital businesses.

Contact ThoughtWorks | Partner Overview

*Already worked with ThoughtWorks? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.