This post was contributed by Thomas Ngue Minkeng, Nathalie Au, Marc Bouffard, and Pierre-Marie Airiau from Ogury
Ogury, the Personified Advertising company, is using open source machine learning (ML) on AWS to deliver a planned 300,000 inferences per second under 10-ms latency. Ogury’s breakthrough advertising engine delivers precision, sustainability, and privacy protection within one technology stack and is built and optimized for mobile. Advertisers working with Ogury benefit from fully visible impactful advertisements, future-proof targeting, and unwavering protection. Publishers enjoy the rewards of a respectful user experience, incremental revenues, and premium demand with Ogury’s solutions. Founded in 2014, Ogury is a global organization with 350+ employees, including 100 engineers across 11 countries. Ogury has been using the AWS cloud for more than 6 years and already presented their experience in a past AWS case study.
To serve the right advertisement at the right time, Ogury is considering using ML to estimate click probabilities associated to advertising campaigns. While it is straightforward to research and develop standalone ML models on small-size datasets, it is more challenging to deploy ML features in an existing large-scale production application. In particular, Ogury highlights the following challenges:
- The latency budget for scoring is tight: An ad bid request must be answered within tens of milliseconds, including less than 10 ms for campaign scoring, which contains up to 50 campaigns to evaluate.
- Scale is an additional challenge: Thousands of campaign scoring requests occur per second, and the ML inference solution must be able to scale to and hold this amount of traffic.
- Integration must also be possible with the rest of the ad serving stack. In particular, Ogury needs the ML inference to be callable in a Node.js application.
The rest of the blog describes the technologies and architecture that enable Ogury to serve hundreds of thousands of predictions per second by their ad scoring model.
Model representation and runtime with ONNX
To model ad conversion patterns, Ogury engineers train a logistic regression model with Apache Spark on Amazon EMR. Apache Spark is a broadly adopted open source framework for distributed batch and streaming computation. At the time of this writing, Spark however has limited support for synchronous, real-time inference on small payloads. For example, it does not have a native inference compiler nor model server, and often needs to be complemented with third-party specialized components, such as MLeap for model serialization and Springboot for request handling.
Architecture diagram of the open source ML inference solution
To reduce the orchestration overhead and let ONNXRuntime leverage parallelism where possible, the team tested client-side batching: Instead of inferring a batch of 50 campaigns with a loop, the client submits an input matrix consisting of a batch of 50 campaigns. This optimization reduces inference latency by 75%, from 40 ms to 10 ms at P95.
Feature fetching with a Node object
Burning the model in the client application
In real-time inference use cases, it is common to architect ML models as standalone representational state transfer (REST) services, with their own compute fleets and web application programming interfaces (APIs). Machine learning models served as microservices have the benefit of enabling modular architectures with clean separation of concerns. In this specific use case however, the Ogury team prioritizes ML inference latency and is ready to make concession on modularity for the sake of minimizing latency. The team deploys the model in the same container as the consuming application. Consequently, traditional ML web serving overhead—including network travel time, serialization, and request handling—is completely suppressed and several extra milliseconds gained.
Deployment on AWS and Next Steps
The ML model will receive several thousand batches per second, representing hundreds of thousands of predictions realized per second. To scale to such a magnitude while maintaining infrastructure management overhead low, Ogury is using self-managed Kubernetes cluster on Amazon Elastic Compute Cloud (Amazon EC2) on which over 300 Kubernetes pods will be deployed at peak. This setup achieves a satisfying performance. As a next step, Ogury will iterate on model science to increase the business value delivered to their customers.
Advertising is rich with ML use cases, and Ogury is evaluating AWS services for several other projects. If you are also interested in running ML projects on AWS or deploying open source technology at scale on AWS, please take a look at our AWS Machine Learning and AWS Open Source pages.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.