By Jake Vernon, Sr. Sales Engineer – Snowflake
By Bosco Albuquerque, Sr. Partner Solutions Architect – AWS
By Brian Warwick, Sr. Partner Solutions Architect – AWS HCLS

Connect with Snowflake-2

Healthcare customers use Snowflake to store all types of clinical data in a single source of truth. This data can include unstructured data (from documents, imaging, audio, and video) alongside structured (electronic health records, claims) and semi-structured (HL7/FHIR) data.

While there are many estimates as to the amount of unstructured data that exists within an average healthcare organization, most agree that over half of all data assets are unstructured. This unstructured data contains a treasure of clinical value, but it’s “locked” within a body of text.

One method for gaining insights from this data is to use Amazon Comprehend Medical, which is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) to extract health data from medical text.

In this post, we will show you how to take data from Snowflake and push your documents through Amazon Comprehend Medical, allowing you to extract valuable information from them.

From billing codes, diagnosis, diagnostics, social determinants of health, and more, we can then return that data in semi-structured (JSON) format to be “reunited” with its structured counterparts in Snowflake to be joined in harmony within Snowflake.

Solution Overview

For demonstrating how to use Amazon Comprehend Medical with Snowflake, which has AWS Competencies in both Machine Learning and Data and Analytics, we will be building the architecture shown in Figure 1.

Please note that Amazon Comprehend Medical is an API that we’ll be calling and there is nothing for us to host. We’ll be using AWS Lambda to make the calls to Amazon Comprehend Medical, and the results of Amazon Comprehend Medical will be JSON that will be loaded back into Snowflake.


Figure 1 – Architecture overview of Snowflake with Amazon Comprehend Medical.


For this walkthrough, you should have the following prerequisites:

  • AWS account.
  • Snowflake deployment with access to create tables.
  • Knowledge of Snowflake and AWS services.
  • Apache PDFBox, an open source Java tool for working with PDF documents.
  • You should configure the proper security constraints both on Snowflake RBAC and AWS policies for accessing services within your AWS account.

Step 1: Snowflake Basic Setup

First, you will need access to a database in Snowflake. You can use either an existing database of yours, or follow instructions on creating a new database.

Step 2: Add a Snowflake Sequence and Table

Next, create a sequence that we’ll use later. Run the following code in Snowflake, and make sure to use your database name and schema in the code below:


Add a table so we can store document information. Run the following command in Snowflake to do so. Make sure to use your database name and schema in the code below:


Step 3: Create a Snowflake Stage for Storing Documents

We now need to create a stage for storing documents in Snowflake. We’ll create an internal stage that is managed by Snowflake for our documents. Run the following command:

Create or replace stage medical_pdf_docs_internal_stage
Directory = (enable = true)
Encryption = (type = 'SNOWFLAKE_SSE');

Step 4: Hosting a Java JAR File in Snowflake

We need to create a stage to host a Java JAR File in Snowflake. For creating a Snowflake managed internal stage, run the command:

create or replace stage jars_stage_internal;

Step 5: Configuring PDFBox for Working with PDF Documents

We used the open source Apache PDFBox for working with PDF documents. Download the 2.0.24.JAR from Apache PDFBox, and then upload it to the JAR stage you created above. You can follow instructions on uploading to the stage using this documentation.

Next, run the following command to create a new function in Snowflake:

create or replace function read_pdf(file string)
returns string
language java
imports = ('@jars_stage_internal/pdfbox-app-2.0.24.jar')
HANDLER = 'PdfParser.ReadFile'
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea; import;
import; public class PdfParser { public static String ReadFile(InputStream stream) throws IOException { try (PDDocument document = PDDocument.load(stream)) { document.getClass(); if (!document.isEncrypted()) { PDFTextStripperByArea stripper = new PDFTextStripperByArea(); stripper.setSortByPosition(true); PDFTextStripper tStripper = new PDFTextStripper(); String pdfFileInText = tStripper.getText(document); return pdfFileInText; } } return null; }

Step 7: Create a Lambda Function to Call Amazon Comprehend Medical

Now, we need to create a new Lambda function to use Amazon Comprehend Medical. Follow the Snowflake documentation regarding creating Lambda functions.

Once you have a Lambda function created, add the below code to it:

import json
import boto3 medComprehendClient = boto3.client(service_name='comprehendmedical', region_name='us-east-1') def lambda_handler(event, context): extracted = [] body = json.loads(event["body"]) status_code = 200 for row in body["data"]: try: entities_extracted = medComprehendClient.detect_entities( Text = row[1] ) except Exception as e: print(e); entities_extracted = "ERROR" # status code of 400 implies an error status_code = 400 extracted.append([row[0], entities_extracted["Entities"]]) json_compatible_string_to_return = json.dumps({"data" : extracted}) # return data according to Snowflake's specified result format return { 'statusCode': status_code, 'body': json_compatible_string_to_return }

Step 8: Amazon API Gateway Setup

We now need to create an Amazon API Gateway to call your Lambda function. Follow Snowflake’s documentation to create an API Gateway. You’ll want to create a POST method that forwards the request to the Lambda function created above.

You also need to make sure to set up the proper AWS Identity and Access Management (IAM) permissions to allow the api_key and api_secret to access the API Gateway. Also, make sure to use Lambda Proxy as the API integration so that Amazon API Gateway properly passes the request from Snowflake to Lambda.

Step 9: Create a Snowflake Function

Finally, we need to create an external function in Snowflake. Run the below commands in Snowflake to do so:

-- NER External Function
select system$get_snowflake_platform_info();
use role accountadmin;
use database <YOUR DATABASE>; -- create API integration
create api integration awsComprehendMedicalIntegration api_provider=aws_api_gateway api_aws_role_arn='<arn_created_to_allow_snowflake_resources>' api_allowed_prefixes=('<aws_api_gateway_url>') enabled=true; desc integration awsComprehendMedicalIntegration;

Copy the API_AWS_IAM_USER_ARN and API_AWS_EXTERNAL_ID from the output of the describe command. Go back to IAM and update the “Trust Relationship” for the role you created above with this information.

Lastly, run the following command in Snowflake:

ls @<stage_created_for_docs>; -- Create the external function medical comprehend
create or replace external function medical_doc_ner(input string) returns variant api_integration =awsComprehendMedicalIntegration as '<aws_api_gateway_endpoint_with_resource>';

Step 10: Testing

Now let’s test our function using a Snowflake worksheet. Run the below command in Snowflake:

select medical_doc_ner('Mary takes advil 3x a day due to lower leg pain.  Mary has a history of nerve damage bilaterally.');

This will run the text through our configured setup above. Our output should be a JSON response with extracted entities that looks like the following:


Figure 2 – Example output of extracted entities.

Step 11: Creating a Stored Procedure in Snowflake

We’ll wrap this all together so we can easily extract the contents using stored procedures in Snowflake. This allows us to get the contents back into Snowflake.

In Snowflake, insert extracted text from PDF document into our table with this command:

CREATE or replace PROCEDURE docParseInsert(DOC_NAME STRING) RETURNS string LANGUAGE javascript AS $$ var rs = snowflake.execute( { sqlText: `INSERT INTO document_contents (document_extracted_contents, document_path,doc_text_extracted) (SELECT read_pdf('@medical_pdf_internal_stage/' || '${DOC_NAME}'), '${DOC_NAME}', true);` } ); return 'success'; $$;

Extract entities from a document and update the current row with extracted entities using this command:

CREATE or replace PROCEDURE docEntityInsert(DOC_ID FLOAT) RETURNS string LANGUAGE javascript AS $$ var rs = snowflake.execute( { sqlText: `update document_contents set document_entities = medical_doc_ner(document_extracted_contents), doc_entities_extracted = true where document_id = ${DOC_ID};` } ); return 'success'; $$;

Step 12: Final Validation

Finally, you can try out the stored procedures and have the full pipeline run. Run the below commands in Snowflake to see it all work together:

call docParseInsert('SampleWriteUp_editv7.pdf'); //Use your own document that is stored in the stage. You can use Snowsql to upload a pdf document from your desktop to the Snowflake internal stage specified above. select * from dev_db.public.document_contents;

Note the document ID in the DOCUMENT_ID column. Also note that the DOCUMENT_ENTITIES column is NULL.


Figure 3 – Caption goes here.

Run the following command, with your DOCUMENT_ID to populate the DOCUMENT_ENTITIES column: Call docEntityInsert(<DOCUMENT_ID>);

Run the same SELECT statement from above to check that the DOCUMENT_ENTITIES column has been populated, as seen below:


Figure 4 – Caption goes here.

Click on the cell with heading DOCUMENT_ENTITIES and you should see the entities extracted as below, and then you can scroll through the entire list:


Figure 5 – Caption goes here.

You should now have extracted entities using Amazon Comprehend Medical. This full pipeline allows you to take information from Snowflake, run them through Amazon Comprehend Medical, and store the results back inside Snowflake.

Real-World Example

A healthcare provider in the Northeast United States plans to use Snowflake and Amazon Comprehend Medical to analyze all of their unstructured data such as doctors’ notes, dictations, nurses’ reports, radiology reports, etc. across their patient base.

The goal will to bring value from this important data given that it is currently a challenge to organize and work with in an unstructured format. They plan to provide a quick summarization of the patient’s medical history without having to scan over all of their unstructured documents.


The Snowflake Data Cloud allows healthcare and life sciences (HCLS) organizations to centralize all of their data in a single and secure location to unlock deeper insights with faster analytics. This enables improved patient outcomes, quality member experiences, and streamlined operational inefficiencies.

This post has shown how Snowflake customers can also leverage Snowflake data with other services, like Amazon Comprehend Medical, to do even more with unstructured data.

To learn more, check out the HCLS Snowflake Cloud for more information on how Snowflake is helping enable healthcare organizations.


Snowflake – AWS Partner Spotlight

Snowflake is an AWS Competency Partner that has reinvented the data warehouse, building a new enterprise-class SQL data warehouse designed from the ground up for the cloud and today’s data.

Contact Snowflake | Partner Overview | AWS Marketplace

*Already worked with Snowflake? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.