Deploying AI to the Cloud

Ilya Kaminsky
The Academy
Published in
5 min readJan 11, 2023

--

This is a tale of the hurdles I had to overcome in my pursuit of deploying an artificial intelligence based program to the cloud for general use. But first, a quick disclaimer and a TL;DR.

Disclaimer: while I used AI in the past — voice assistant, code completion, image generation — I’m new to the concept of AI development. So, if you come across any glaring errors or obvious shortcomings, kindly share your thoughts and suggestions in a response below.

TL;DR: in my experiments with high-quality text-to-speech conversions, I found that generating the voice samples locally was sixty times faster than doing so in an environment for GPU-enabled serverless computing. Therefore, I don’t recommend going after the same outcome at this time. Instead, consider using this article as a guide of what to expect when you’re ready to deploy your own AI creation to the cloud.

“A tortoise on a flat road looking far into the distance. 4K 8K HDR photorealistic” – generated with Stable Diffusion v1.5 via DreamStudio Beta

Let’s Start With a Few Basic Concepts

  • Serverless Framework — an architecture that allows developers to deploy code to the cloud in a way that can be consumed by any other app (mobile, desktop, web, etc.) while only being charged for its runtime, which is usually measured in seconds. For this post, I’ll be using Banana, which currently costs roughly one-twentieth of a US penny ($0.00052) per one second of GPU use.
  • Local Testing — one or more physical machines that are dedicated to debugging, running, testing, and validating the application before it is published. Thus saving time and money by catching incomplete models and buggy code before it is deployed.
  • Docker — a virtualization platform that enables developers to deploy their code in a portable manner so that it can run on all kinds of servers with different configurations like CPUs, GPUs, and RAM, thus allowing one to easily scale the app as needed.
  • Tortoise TTS — an open source text-to-speech (“TTS”) program that can run on Windows OS or in a Google Collab notebook. It comes with more than twenty pre-trained, high-quality voices in terms of prose and intonation, including the ability to train infinitely more voices. Tortoise can also understand prompts such as [I am really sad] and use that information to modify the tone of voice accordingly (source).

Implementation

Banana’s Serverless Framework is pretty straightforward. It offers a REST API, as well as SDKs in Python, Go, and Node that enable the developers to communicate with their creations. The call to Banana can include a collection of model inputs. Those are used for things like parameters, prompts, and properties that are used to produce and fine tune the models’ outputs.

In my case, I wanted to start by providing it with just the text that it needs to convert to audio. But per Tortoise’s own API, it can be expanded to accept inputs such as voices and their presets (existing or newly-trained), as well as multiple parameters that adjust the final output.

(source)

Because the output is an audio file, there are basically two ways to play it back. The first way follows some examples of the existing text-to-image generative AI programs, where it’s possible to encode the output as a base-64 string. But instead of an image, it’s encoding an audio sample. This file can be then played back using the <audio controls autoplay data="data:audio/wav;base64,Base+64+string+goes+here" /> HTML element. The other way is by storing the generated audio file on some server (e.g., AWS S3, SoundCloud, Google Drive, etc.) and returning a path to its location. Due to the experimental nature of this project, I chose to return the response as a base-64 string, mainly because I did not want to add to the complexity of hosting the resulting files elsewhere in the cloud. At this stage, it’s important to point out that one of the shortcomings of the Serverless Framework is that it’s mostly ephemeral. Meaning that with the common exception of the logs that it produces in real time, it leaves behind very few artifacts (if any) which are normally used in other types of software development projects.

The next challenge that I faced was downloading all of the pre-trained models from Hugging Face. They come out to 5.5GB. While Banana offers a convenient download.py script for this exact reason, I noticed that by taking this path, my development efforts were considerably slower due to the Docker containers needing to download these models every time they were created. So instead, I put together a new Docker image called deps-for-tts that’s based on PyTorch’s CUDA 11.3 image that I borrowed from Banana’s existing model repos. All of the models were cached within it.

With the image and framework in tow, it was a relatively straightforward matter of creating a server and exposing a couple of predefined routes. One GET /healthcheck to ensure that everything’s running smoothly and that the GPU is ready. And another POST / that can take the model inputs and pass them to the application. Given that my code is open source, I see no reason of going into the technical details of its implementation. Instead, I encourage you to take a look at the following files in order to see how everything comes together:

  • server.py — the aforementioned REST server and route handler that follows the Serverless Framework best practices
  • app.py — the application handler that calls the AI program, waits for its output, and encodes it for consumption on the frontend
  • tortoise/api.py — (unmodified) the entrypoint to Tortoise

Results

Below is a comparison of the logs between my local development machine and what took place on Banana. My testing was done on an i5–3330 CPU with 16GB of DDR3 RAM and the main component being an Nvidia GeForce RTX 3060 video card that I bought specifically for this purpose.

Local

Generating autoregressive samples..
100%|█████████| 12/12 [01:04<00:00, 5.41s/it]
Computing best candidates using CLVP
100%|█████████| 12/12 [00:05<00:00, 2.08it/s]
Transforming autoregressive outputs into audio..
100%|█████████| 80/80 [00:12<00:00, 6.56it/s]

Banana

Generating autoregressive samples..
33%|███▎ | 32/96 [2:26:21&lt;5:20:02, 300.04s/it]
67%|██████▋ | 64/96 [5:05:15&lt;2:48:35, 316.11s/it]
100%|██████████| 96/96 [7:31:45&lt;00:00, 282.35s/it]
Computing best candidates using CLVP

Transforming autoregressive outputs into audio..

Of note…

  1. Compare the number of seconds per iteration (ns/it); fewer is better
  2. Not output was provided on Banana for the second and third steps of computing via CLVP and transforming the autoregressive outputs
  3. Instead of showing 0–100%, I truncated Banana’s output for brevity
  4. 7½ hours of GPU time at 0.05¢ per second comes out to about $14

I’m not comfortable with including the actual audio samples that Tortoise had generated since I didn’t do anything to improve or modify the underlying architecture of that project. But if you’re interested in listening to what Tortoise produces, check out the project’s own examples folder.

Next Steps

Now that I have an understanding how to deploy AI-powered programs to the cloud, I want to focus on developing and optimizing the machine learning algorithms themselves. My plan is to go through the “Practical Deep Learning for Coders 2022” course. Consider following The Academy for more insights on this topic.

--

--