By Aditya Ghosh, Head of Data Engineering and Governance, VFS – Verisk
By Mandar Paranjpe, Delivery Practice Manager – AWS
By Mahendar Gajula, Sr. Consultant, Big Data and Analytics – AWS
By Dan Thomas, Principal Customer Practice Manager – AWS

Verisk-AWS-Partners
Verisk
Connect with Verisk-1

Verisk is an AWS ISV Partner with a variety of analytic solutions that run on Amazon Web Services (AWS). Verisk’s solutions address insurance underwriting and claims, fraud, regulatory compliance, natural resources, catastrophes, economic forecasting, geopolitical risks, as well as environmental, social, and governance matters.

More than 70% of the Fortune 100 relies on the company’s advanced technologies to manage risks, make better decisions, and improve operating efficiency.

Verisk’s Argus solution provides predictive analytics and decision-support solutions to customers in the insurance, broader financial services, energy, and related industries.

Data is the lifeblood of insurance, and the insights that Argus provides is an important part of many insurers’ toolkits. Ironically, there is so much data available today that insurers and their technology partners can struggle to digest and make the most use of it. This post explores how Verisk leverages AWS to improve delivery of its Argus solution to the market.

Verisk Argus experienced challenges accelerating the migration of petabytes of data stored in SQL databases to AWS. An optimal migration mechanism needed to be put in place to accomplish this effectively.

Verisk Argus was interested in accelerating the migration to reduce both SQL Server and data center costs. The outcome of a successful migration of data would enable Verisk Argus to develop new products by taking advantage of AWS services.

Upon completion of the migration of data to AWS, Verisk Argus could expand their compute and storage capabilities in the cloud, thereby allowing increased business agility and reduced operational costs. To meet these goals, a joint team was formed by Verisk Argus and AWS Professional Services.

In this post, we’ll describe how the joint team developed a solution to migrate petabytes of data to Amazon Simple Storage Service (Amazon S3). Multiple AWS services can support data migrations, so the team conducted a series of proof of concepts (POCs) to determine the proper migration mechanism that balanced both cost and schedule.

Migration POCs Considered

The following table briefly describes five PoCs that were considered, along with pros and cons of each approach.

POC POC Overview Pros Cons
CSV to AWS Snowball
  • SQL bulk copy program (BCP) creates a CSV file that’s copied to Snowball
  • Snowball is imported to S3
  • CSV file in S3 is converted and validated using EMR/Spark
  • Optimal cost compared to the other options
  • BCP is slower than Spark conversion
  • No robust validation before loading data to Snowball
On-premises Spark/Hadoop environment
  • Customer has an on-premises Hadoop/Spark environment
  • Hadoop/Spark extracts, converts, and validates data
  • Validated data is copied to Snowball
  • Snowball is imported to S3
  • Fast and efficient process
  • Reduces the number of required Snowball devices
  • Robust validation before data leaves the data center
  • Assumes customers have an on-premises Hadoop environment
  • Building an on-premises Hadoop environment is time-consuming and expensive
Leverage AWS Snowball to copy SQL backup and restore in AWS
  • Copy backups to Snowball
  • Snowball is imported to S3
  • Instantiate EC2 instances using customer-supplied SQL licenses to restore backups
  • Instantiate EMR/Spark to extract, convert, and validate data
  • Return extracted data to S3
  • Existing backups can typically be leveraged
  • Majority of the work occurs in AWS, removing impediments of using on-premises infrastructure
  • Flexible, as the resources can scale up and down based on demand
  • One-time cost of spinning up EC2 and EMR instances during the migration process
Leverage AWS Direct Connect
  • Send backups to S3 through AWS Direct Connect
  • Instantiate EMR/Spark in AWS, to extract, convert, and validate data
  • Return extracted data to S3
  • Majority of the work occurs in AWS, removing impediments of using on-premises infrastructure
  • Flexible, as the resources can scale up and down based on demand
  • Migration duration would be extended as other business teams leveraged AWS Direct Connect
Use AWS Snowball to copy from local DMS and restore in AWS
  • Use local DMS to write data into Snowball
  • Snowball is imported to S3
  • CSV in S3 is converted and validated using EMR/Spark
  • Return extracted data to S3
  • Uses Snowball for bulk copy
  • DMS produced a copy output that could not be leveraged in S3
  • Copying of large tables using DMS required larger servers for scalability, which resulted in a higher cost

Migration Pattern

Based on the PoC results, the team determined the primary migration mechanism would be to leverage AWS Snowball to copy SQL backup and restore in AWS.

With this approach, the SQL backups were copied from on-premises servers to AWS Snowball Edge. The backups were then restored onto Amazon Elastic Compute Cloud (Amazon EC2) SQL Servers from S3.

An extraction was performed followed by validation through an Amazon EMR/Spark ingestion framework. Apache Spark is a unified analytics engine for large-scale data processing and running Apache Spark on EMR helps scale the processing easily. The custom code to perform data extraction from the SQL Server and validation was written in Python with the PySpark package.

This approach met the desired business outcomes; however, a secondary migration mechanism was also utilized. This involved the use of on-premises SQL Servers and was supported by the extraction and validation EMR framework for larger tables.

Solution Architecture

To move the on-premises SQL Server databases to the data lake, every table had to be dealt with individually. To migrate a SQL table to the data lake, the data had to be extracted from the SQL server, transformed to Apache Parquet, and transferred to the designated location in the data lake.

When the data transfer is complete, it must be validated—for example, the transferred Parquet file must be compared to the source SQL Server table to ensure data consistency. An ideal solution would perform both operations in sequence for each table scoped for migration.

To supplement this approach, the team built a custom extraction and validation framework on EMR with Spark. This framework was called the Data Validation Engine (DAVE), as shown in Figure 1. The automated solution would help migrate tables in bulk and also perform extraction and validation.
.
Figure 1 – Data migration and validation framework (DAVE).

Figure 1 – Data migration and validation framework (DAVE).

Migration Approach

The migration approach involved the following steps:

  1. The team copied large SQL Server backups to a shared backup server at the Verisk Argus data center capable of populating multiple Snowballs.
  2. This backup server mounted to four Linux copy servers with a configuration of eight CPUs and 32 GB RAM. Each performed high-throughput copies to the connected Snowball Edge devices according to Verisk Argus data classification rules.

    Note: Achieving the best throughput while copying to Snowball Edge is critical. Refer to this blog post for data migration best practices with AWS Snowball Edge.

  3. After the data is copied (using Amazon S3 sync or cp commands), the Snowball devices are shipped to the AWS data center and imported to the designated S3 bucket.
  4. The Snowball data is imported to the S3 bucket. The data is restored through the SQL server on EC2 utilizing AWS Secrets Manager. Data is also archived to S3 Glacier Deep Archive per data classification tags.
  5. Upon restoring the data through the SQL server on EC2, multiple table migrations are scheduled on EMR to migrate the data to the data lake landing bucket. Conversion rules are utilized through custom code on the EMR/Spark framework (see Figure 1).
  6. All successfully migrated datasets are transferred to the data lake. Amazon Aurora PostgreSQL relational database service is used to manage the inventory of on-premises database/tables to be migrated in bulk. It also stores the migration metrics, including success failure, velocity, table sizes, and row counts, and can monitor the migration progress.
  7. The DAVE framework is leveraged for the secondary migration mechanism utilizing Verisk Argus’s Direct Connect (10 Gbps) tunnel. This uses the DMS and EMR Spark framework and connects to on-premises SQL server clone servers.
  8. A regular triage process has been established to continually evaluate tables that fail extraction and/or validation. Improvements are captured into the code based on lessons learned and feedback. This code will continue to evolve as data experiences issues (for example, junk characters or data conversion issues in text fields).

Key Lessons

Lessons learned during this migration process included the following:

  • Use a size-based, data usage-based migration approach. For example, use Amazon EMR for low-to-medium volume data objects where transformations are required (data type translations and Parquet conversion for data lake, for example) vs. Snowball for archival-only data migration requirements.
  • There’s no universal migration approach, because every customer is different. Identify migration patterns and leverage PoCs to confirm your primary migration pattern.
  • For the AWS Snowball copy process, use multiple backup copy servers with dedicated storage to increase the performance and speed of the copy process.
  • Ensure the data type translations from the SQL Server to S3 Parquet is planned and verified consistently throughout the migration.
  • Keep file formats consistent when data is ingested from multiple flows to ensure all AWS Cloud resources work consistently. This refers to DMS creating Parquet 2.0 and the custom framework created Parquet 3.0.
  • Keep field names consistent (lower or upper); AWS crawlers are case-sensitive when extracting similar datasets using multiple migration methods.
  • Define an approach for EMR leveraging reserved and Spot instances to control costs.
  • Remove special characters (such as %, (, ), +) from the field and table names to ensure the Spark process can read the data successfully.
  • Ensure the memory allocation of master node for Spark is properly configured to ensure optimal operation of the executor nodes. Upgrading the master node to 256 GB from 32 GB produced significant performance gains (nearly 50%) in a 40-node cluster with minimal interruptions in data migration.
  • Develop a thorough triage process to perform root cause analysis of every error and continuously upgrade the code to remediate issues.
  • Embrace an Agile mindset and continue to adapt to support data needs.

Cost Optimization

Given the large on-premises data volumes, the team worked closely to classify the data for archival needs vs. data lake consumption needs. The joint team methodically optimized costs across all components of the infrastructure, including the following measures:

  • Amazon S3: Lifecycle policies moved data from S3 to Amazon S3 Glacier after a specified duration. Glacier provides a six-fold cost reduction over S3. Eventually, data will be transferred from Glacier to S3 Glacier Deep Archive to provide a four-fold reduction over Glacier costs.
  • Amazon EMR: Leveraged a higher ratio of Spot instances compared to reserve instances as the migration jobs can easily restart for historical datasets.

Refer to this blog post for more information about leveraging Spot instance for EMR.

Key Metrics for Migration

In support of the AWS services, DAVE accelerated data migrations by 50% and successfully scheduled and automated data migrations in bulk, thereby migrating hundreds of large tables more than 400 GB in size.

Verisk Argus was able to successfully migrate multiple petabytes of data to AWS over a 20-week time period leveraging two Snowballs every two weeks. In addition, Verisk Argus successfully restored more than 300 TB inclusive of over 350K data objects in the data lake for consumption and analytics needs.

Below are some key metrics for this migration:

  • Snowball migration: 17 Snowball devices were used to migrate 1.4 PB of compressed SQL Server data, translating to about 4.3 of uncompressed on-premises SQL server data.
  • EMR data migration velocity: Leveraging the following configuration, 15 TB per day was processed over AWS Direct Connect. Migration throughput is relative and depends on the Direct Connect configuration and source databases capacity; use the PoC phase to derive optimal configuration for better throughput.
    • 10 Gbps Direct Connect (our project was allowed to use a maximum of 5 Gbps).
    • 2 on-premises clone servers with 256 GB RAM SQL Servers.
    • 4+ EMR clusters with 40 nodes (r4xlarge) utilizing the custom Spark ingestion framework (DAVE) were installed.

Conclusion

In this post, we described how Verisk Argus migrated petabytes of SQL Server data from on-premises to Amazon S3 and Amazon S3 Glacier using AWS Snowball Edge and the Amazon EMR custom Spark ingestion framework.

Through this migration framework, Verisk Argus is positioned to save millions of dollars by moving to AWS from their data center, as well as providing incremental capabilities to their users by having data in the cloud.

Acknowledgements

The authors would like to thank the following members of the Verisk and AWS teams for their close collaboration and guidance through the various stages of this post’s journey:

  • Andrew Hoffman, Dishant Shah, Jack Nicholas, Chinmay Juneja, David Disla – Verisk
  • Nitin Srivastava, Shand Glen, Abhiram Chitipirlla, Dan Gibbar – AWS ProServe

.
Verisk-APN-Blog-CTA-1
.


Verisk – AWS Partner Spotlight

Verisk is an AWS ISV Partner that provides data analytic insights to customers in insurance, energy and specialized markets, and financial services.

Contact Verisk | Partner Overview | AWS Marketplace

*Already worked with Verisk? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.