By Jake Vernon, Sr. Sales Engineer – Snowflake
By Bosco Albuquerque, Sr. Partner Solutions Architect – AWS
By Brian Warwick, Sr. Partner Solutions Architect – AWS HCLS
Healthcare customers use Snowflake to store all types of clinical data in a single source of truth. This data can include unstructured data (from documents, imaging, audio, and video) alongside structured (electronic health records, claims) and semi-structured (HL7/FHIR) data.
While there are many estimates as to the amount of unstructured data that exists within an average healthcare organization, most agree that over half of all data assets are unstructured. This unstructured data contains a treasure of clinical value, but it’s “locked” within a body of text.
One method for gaining insights from this data is to use Amazon Comprehend Medical, which is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) to extract health data from medical text.
In this post, we will show you how to take data from Snowflake and push your documents through Amazon Comprehend Medical, allowing you to extract valuable information from them.
From billing codes, diagnosis, diagnostics, social determinants of health, and more, we can then return that data in semi-structured (JSON) format to be “reunited” with its structured counterparts in Snowflake to be joined in harmony within Snowflake.
For demonstrating how to use Amazon Comprehend Medical with Snowflake, which has AWS Competencies in both Machine Learning and Data and Analytics, we will be building the architecture shown in Figure 1.
Please note that Amazon Comprehend Medical is an API that we’ll be calling and there is nothing for us to host. We’ll be using AWS Lambda to make the calls to Amazon Comprehend Medical, and the results of Amazon Comprehend Medical will be JSON that will be loaded back into Snowflake.
Figure 1 – Architecture overview of Snowflake with Amazon Comprehend Medical.
For this walkthrough, you should have the following prerequisites:
- AWS account.
- Snowflake deployment with access to create tables.
- Knowledge of Snowflake and AWS services.
- Apache PDFBox, an open source Java tool for working with PDF documents.
- You should configure the proper security constraints both on Snowflake RBAC and AWS policies for accessing services within your AWS account.
Step 1: Snowflake Basic Setup
First, you will need access to a database in Snowflake. You can use either an existing database of yours, or follow instructions on creating a new database.
Step 2: Add a Snowflake Sequence and Table
Next, create a sequence that we’ll use later. Run the following code in Snowflake, and make sure to use your database name and schema in the code below:
CREATE OR REPLACE SEQUENCE <YOUR DATABASE>.<YOUR SCHEMA>.COUNT_SEQ_DOC_ID;
Add a table so we can store document information. Run the following command in Snowflake to do so. Make sure to use your database name and schema in the code below:
Step 3: Create a Snowflake Stage for Storing Documents
We now need to create a stage for storing documents in Snowflake. We’ll create an internal stage that is managed by Snowflake for our documents. Run the following command:
Step 4: Hosting a Java JAR File in Snowflake
We need to create a stage to host a Java JAR File in Snowflake. For creating a Snowflake managed internal stage, run the command:
create or replace stage jars_stage_internal;
Step 5: Configuring PDFBox for Working with PDF Documents
We used the open source Apache PDFBox for working with PDF documents. Download the 2.0.24.JAR from Apache PDFBox, and then upload it to the JAR stage you created above. You can follow instructions on uploading to the stage using this documentation.
Next, run the following command to create a new function in Snowflake:
Step 7: Create a Lambda Function to Call Amazon Comprehend Medical
Now, we need to create a new Lambda function to use Amazon Comprehend Medical. Follow the Snowflake documentation regarding creating Lambda functions.
Once you have a Lambda function created, add the below code to it:
Step 8: Amazon API Gateway Setup
We now need to create an Amazon API Gateway to call your Lambda function. Follow Snowflake’s documentation to create an API Gateway. You’ll want to create a POST method that forwards the request to the Lambda function created above.
You also need to make sure to set up the proper AWS Identity and Access Management (IAM) permissions to allow the api_key and api_secret to access the API Gateway. Also, make sure to use Lambda Proxy as the API integration so that Amazon API Gateway properly passes the request from Snowflake to Lambda.
Step 9: Create a Snowflake Function
Finally, we need to create an external function in Snowflake. Run the below commands in Snowflake to do so:
Copy the API_AWS_IAM_USER_ARN and API_AWS_EXTERNAL_ID from the output of the describe command. Go back to IAM and update the “Trust Relationship” for the role you created above with this information.
Lastly, run the following command in Snowflake:
Step 10: Testing
Now let’s test our function using a Snowflake worksheet. Run the below command in Snowflake:
select medical_doc_ner('Mary takes advil 3x a day due to lower leg pain. Mary has a history of nerve damage bilaterally.');
This will run the text through our configured setup above. Our output should be a JSON response with extracted entities that looks like the following:
Figure 2 – Example output of extracted entities.
Step 11: Creating a Stored Procedure in Snowflake
We’ll wrap this all together so we can easily extract the contents using stored procedures in Snowflake. This allows us to get the contents back into Snowflake.
In Snowflake, insert extracted text from PDF document into our table with this command:
Extract entities from a document and update the current row with extracted entities using this command:
Step 12: Final Validation
Finally, you can try out the stored procedures and have the full pipeline run. Run the below commands in Snowflake to see it all work together:
Note the document ID in the DOCUMENT_ID column. Also note that the DOCUMENT_ENTITIES column is NULL.
Figure 3 – Caption goes here.
Run the same SELECT statement from above to check that the DOCUMENT_ENTITIES column has been populated, as seen below:
Figure 4 – Caption goes here.
Click on the cell with heading DOCUMENT_ENTITIES and you should see the entities extracted as below, and then you can scroll through the entire list:
Figure 5 – Caption goes here.
You should now have extracted entities using Amazon Comprehend Medical. This full pipeline allows you to take information from Snowflake, run them through Amazon Comprehend Medical, and store the results back inside Snowflake.
A healthcare provider in the Northeast United States plans to use Snowflake and Amazon Comprehend Medical to analyze all of their unstructured data such as doctors’ notes, dictations, nurses’ reports, radiology reports, etc. across their patient base.
The goal will to bring value from this important data given that it is currently a challenge to organize and work with in an unstructured format. They plan to provide a quick summarization of the patient’s medical history without having to scan over all of their unstructured documents.
The Snowflake Data Cloud allows healthcare and life sciences (HCLS) organizations to centralize all of their data in a single and secure location to unlock deeper insights with faster analytics. This enables improved patient outcomes, quality member experiences, and streamlined operational inefficiencies.
This post has shown how Snowflake customers can also leverage Snowflake data with other services, like Amazon Comprehend Medical, to do even more with unstructured data.
To learn more, check out the HCLS Snowflake Cloud for more information on how Snowflake is helping enable healthcare organizations.
Snowflake – AWS Partner Spotlight
Snowflake is an AWS Competency Partner that has reinvented the data warehouse, building a new enterprise-class SQL data warehouse designed from the ground up for the cloud and today’s data.
*Already worked with Snowflake? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.