- Inference - means predict outputs upon real world data
- Two major ways to get inferences as
- Online (real-time) inference - we can get real time outputs
- Batch inference - can not get real time outputs
Online Inference
- This is for making predictions one at a time, on demand, and getting a response almost instantly
- How it works: A single piece of data is send to a live model endpoint (like the Vertex AI Endpoint), and a single prediction is returned immediately.
- Primary Goal: Low latency
- Use case: User-facing or time-critical applications.
- ex: Real-time product recommendations.
Batch Inference
- This is for processing a large volume of data all at once when an immediate response isn’t necessary.
- How it works: You collect a large “batch” of data (e.g. thousands of images, a CSV file with all of yesterday’s sales data) and run it through the model in a single job. The job can take minutes or hours to complete.
- Primary Goal: High throughput (throughput = processing large amount of data efficiently at a time).
- Use case: Backend analytical jobs that typically run on a schedule.
Online vs Batch Inference
— use slides (Table) —