The use of artificial intelligence (AI) helps speed up many operations that were unthinkable until some time ago.

In many cases, it is necessary to understand the content of your image archive or the dialogue in an audio or video file to intercept content deemed harmful, dangerous, or out of place for a given situation or audience. This post illustrates some methodologies to extract this critical information using AI services on Amazon Web Services (AWS).

In some circumstances, it is necessary to analyze and understand content that can be offensive, harmful to your brand, or inappropriate.

In this post, we show you how such content can be detected using Amazon Transcribe and Amazon Rekognition.

Launched at AWS re:Invent 2017, Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for AWS customers to add speech-to-text capabilities to their applications.

The media and entertainment industry often uses Amazon Transcribe to convert media content into accessible and searchable text files. Use cases include generating subtitles or transcripts, moderating content, and more. Amazon Transcribe is also used by operations teams for quality control—for example, checking that audio and video are in sync using the time stamps present in the extracted text.

Amazon Transcribe makes it easier to filter unwanted content automatically and programmatically.

You can mask or remove words you don’t want to appear in your transcription results with vocabulary filtering. For example, you can use vocabulary filtering to prevent the display of offensive or profane terms. This feature lets you generate captions of a TV show or transcripts of conferences that are appropriate for all audiences.

Vocabulary filtering is available for both real-time streaming and batch processing.

Prerequisites

To get started, you will need:

Mask unwanted words in audio files

In this section, we carry out the analysis of an audio file, including extracting the text (transcription) and using a masking technique on some words that we consider inappropriate for our workload. In our example, we will perform a masking task with some words present in a text file that we will upload to AWS.

We will analyze an audio file extracted from an AWS video titled “What is AWS” from the official AWS YouTube channel.

You can download the audio files here: AUDIO FILE.
(right-click and save file as)

Now we have our audio file: whatisaws.mp3

Next, we upload the audio clip to a bucket in Amazon Simple Storage Service (Amazon S3), object storage built to retrieve any amount of data from anywhere:

$ aws s3 cp whatisaws.mp3 s3://change-to-yourbucket/

We will then start a transcription job on the file we’ve uploaded. We use StartTranscriptionJob API, an asynchronous job to transcribe speech to text. This way, the job will run in the background.

The following is how you can do this job using AWS CLI:

$ aws transcribe start-transcription-job --transcription-job-name testwords --language-code en-US —media MediaFileUri=s3://change-to-yourbucket/whatisaws.mp3

Now that we have started the asynchronous job, we will need to check if it is in running mode or if the job is complete. For this check, we use the API GetTranscriptionJob.

This API returns information about a transcription job. To see the status of the job, check the TranscriptionJobStatus field. If the status is COMPLETED, the job is finished, and you can find the results at the location specified in the TranscriptFileUri field.

We use the following command to check the job status:

$ aws transcribe get-transcription-job —transcription-job-name testword
{ "TranscriptionJob": { "TranscriptionJobName": "testvideo", "TranscriptionJobStatus": "COMPLETED", "LanguageCode": "en-US", "MediaSampleRateHertz": 44100, "MediaFormat": "mp3", "Media": { "MediaFileUri": "s3://change-to-yourbucket/whatisaws.mp3"
}, "Transcript": { "TranscriptFileUri": "https://s3.eu-west-1.amazonaws.com/aws-transcribe-eu-west-1-prod/xxxxxxxxxxxxx/testvideo/9332f738-196d-4e83-ba41-4c9014b0ad9b/asrOutp..........

When the job completes, you can see the TranscriptionJobStatus has status COMPLETED.

You can download the result or our transcription file at the signed URL in the TranscriptFileUri variable.

If we download it, we can see the entire transcription:

So far we described how it is possible to transcribe audio content, getting a complete transcription of our audio. Now we will see how to instruct Amazon Transcribe that some words (defined in a vocabulary), which we define as inappropriate, are reported, removed, or—in our case—masked with the character “*”.

We have two options to define a vocabulary filter: with a text file in an Amazon S3 bucket or with a list of words defined in the command line.

You can download a simple text file here.
The text files contains these three words:

  • AWS
  • infrastructure
  • applications

Now, we can create a vocabulary filter directly with this API by specifying a name, a language code, and the uniform resource identifier (URI) of the previously loaded text file:

$ aws transcribe create-vocabulary-filter --vocabulary-filter-name vocabularyfiltertest --language-code en-US —vocabulary-filter-file-uri s3://change-to-yourbucket/vocabulary-filter-example.txt

In this case, we want to filter the words “AWS,” “infrastructure,” and “applications.”

Remember that words in a vocabulary filter aren’t case sensitive (for example, “AWS” and “aws” are considered the same). Amazon Transcribe filters only words that exactly match words in the filter, and it doesn’t filter words that are contained within other words.

We can now launch our job with the vocabulary filter, using the mask method:

$ aws transcribe start-transcription-job --transcription-job-name testwords2 --language-code en-US --media MediaFileUri=s3://change-to-yourbucket/whatisaws.mp3 —settings VocabularyFilterName=vocabularyfiltertest,VocabularyFilterMethod=mask

After a while, we can check the job progress:

aws transcribe get-transcription-job —transcription-job-name testvideo2
$ aws transcribe get-transcription-job —transcription-job-name testwords.
{ "TranscriptionJob": { "TranscriptionJobName": "testvideo", "TranscriptionJobStatus": "COMPLETED", "LanguageCode": "en-US", "MediaSampleRateHertz": 44100, "MediaFormat": "mp3", "Media": { "MediaFileUri": "s3://change-to-yourbucket/whatis.mp3"
}, "Transcript": { "TranscriptFileUri": "https://s3.eu-west-1.amazonaws.com/aws-transcribe-eu-west-1-prod/xxxxxxxxxxxxx/testvideo/9332f738-196d-4e83-ba41-4c9014b0ad9b/asrOutp..........

You can insert a notification into your workflow using Amazon Simple Notification Service (Amazon SNS), a fully managed messaging service, or use an event from Amazon EventBridge, a serverless event bus that makes it easier to build event-driven applications at scale, to automate the entire process.

You can download the result of the transcription at the URI specified in TranscriptFileURI. By inspecting the file, you can see how Amazon Transcribe has masked unwanted words specified in the vocabulary filter:

That’s the entire process. You can use the same option with the AWS Management Console, which provides everything you need to manage your AWS accounts, or use AWS SDKs, which take the complexity out of coding by providing language-specific APIs for AWS services.

Detect inappropriate content in a photo using Amazon Rekognition

You can use Amazon Rekognition, which automates image and video analysis with machine learning, to detect content in your images and videos that your target audience may consider inappropriate. You can use Amazon Rekognition moderation APIs in social media, broadcast media, advertising, and ecommerce situations to create a safer user experience, helping to increase your confidence in your content.

You can use the DetectModerationLabels API operation to detect inappropriate or offensive content in images. Additionally, you can use the Amazon Rekognition Video API to detect inappropriate content asynchronously by using the StartContentModeration and GetContentModeration operations.

In this section, we show how you can moderate smoking in a photo of a model on the street.

The following is a photo example:

Man smoking a cigarette

Download the file from HERE and save the photo as “smoking-photo.jpg.”

First, we need to upload a photo to an Amazon S3 bucket:

$ aws s3 cp smoking-photo.jpg s3://change-to-yourbucket/

Now, call the API to detect the content:

$ aws rekognition detect-moderation-labels \
--image "S3Object={Bucket=change-to-yourbucket, Name=smoking-photo.jpg}"

Look at the response to find the following:

{ "ModerationLabels": [
{ "Confidence": 51.32129669189453, "Name": "Smoking", "ParentName": "Tobacco"
},
{ "Confidence": 51.32129669189453, "Name": "Tobacco", "ParentName": ""
}
], "ModerationModelVersion": "5.0"
}

We can check that Amazon Rekognition has detected “Tobacco” with specific confidence scores.

A confidence score is a number between 0 and 100 that indicates the probability that a given prediction is correct. In the photo example, the object detection process returned a confidence score of 51 for the label “Tobacco”.

Applications that are very sensitive to detection errors (false positives) should discard results associated with confidence scores below a certain threshold. The optimum threshold depends on the application. In many cases, you will get the best user experience by setting minimum confidence values higher than the default value.

In this way, we can find explicit content in a photo or video.

As of 2021, Amazon Rekognition can detect explicit content like nudity, sexual activity, other suggestive content, violence, drugs, alcohol, tobacco use (like our example), and more.

For a complete list, you can use the link here: https://docs.aws.amazon.com/rekognition/latest/dg/moderation.html.

Conclusion

In this post, we demonstrate how to find inappropriate content within audio files, videos, and images. Using AI services such as Amazon Transcribe and Amazon Rekognition, it is possible to analyze content without the need to perform training. The results you obtain can be inserted into verification workflows, or you can automatically discard inappropriate content and alert the audience by providing suitable disclaimers.