Google Cloud earlier announced the launch ofCloud Run GPU service, allowing users to use NVIDIA L4 GPUs in the cloud through this service, using automatic expansion and elastic deployment, mainly for workloads such as artificial intelligence computing and inference training.
At the same time, since there is no need to apply for setting the GPU configuration scale in advance, the number of GPUs can be flexibly configured automatically according to computing needs. This will prevent the GPU from being idle when not in use, thereby avoiding additional costs. This increases deployment flexibility and simplifies management difficulty through automated deployment.
This service is billed by the second and automatically resets to zero when not in use. It can also complete GPU and driver startup in approximately 5 seconds from a cold start. Taking Gemma 3's 40 billion parameter inference operation as an example, it takes only about 19 seconds from cold start to generating the first token, which means it can be started quickly in a short time.
The Cloud Run GPU service itself can directly add relevant instructions to the application, or choose whether to enable GPU accelerated computing behind the application service console.
Since it is provided in an elastic configuration, Google Cloud also touts the reliability of this service and states that users or enterprises can deploy it in multiple regions according to operational needs. In addition, they can turn off partition redundancy to adjust the overall available computing resource configuration.
Currently, Cloud Run GPU services are available in multiple Google Cloud regions in the United States, Europe, and Asia.









