Get Inferences - Note in-progress

Inference - means predict outputs upon real world data
Two major ways to get inferences as
1. Online (real-time) inference - we can get real time outputs
2. Batch inference - can not get real time outputs

Online Inference

This is for making predictions one at a time, on demand, and getting a response almost instantly
How it works: A single piece of data is send to a live model endpoint (like the Vertex AI Endpoint), and a single prediction is returned immediately.
Primary Goal: Low latency
Use case: User-facing or time-critical applications.
- ex: Real-time product recommendations.

This is for processing a large volume of data all at once when an immediate response isn’t necessary.
How it works: You collect a large “batch” of data (e.g. thousands of images, a CSV file with all of yesterday’s sales data) and run it through the model in a single job. The job can take minutes or hours to complete.
Primary Goal: High throughput (throughput = processing large amount of data efficiently at a time).
Use case: Backend analytical jobs that typically run on a schedule.

— use slides (Table) —