Home / News / Integrate vLLM inference on macOS/iOS with Alamofire and Apple Foundation

Integrate vLLM inference on macOS/iOS with Alamofire and Apple Foundation

Welcome back to our series on using vLLM inference within macOS and iOS applications. This post focuses on demonstrating how to establish communication with vLLM using Apple Foundation technology and the Alamofire open source project for low-level HTTP interactions. Communication with the vLLM server is facilitated via an OpenAI-compatible Chat Completions endpoint.

Why use OpenAI to communicate with vLLM?

The OpenAI API specification, stewarded by OpenAI (the creators of ChatGPT and earlier GPT models), establishes a RESTful standard for interaction with inference servers. Its official documentation definitively defines endpoints, request/response formats, and the overall interaction architecture.

In a previous article, I showed how to use SwiftOpenAI and MacPaw/OpenAI open source projects to communicate with a vLLM OpenAI-compatible endpoint . Those projects abstracted away the intricacies of the underlying OpenAI REST calls, allowing me to focus solely on their provided communication abstractions and error handling.

This article, however, will delve into the complete code required for an HTTP REST call without an abstraction layer. It covers:

  • Data encoding for structuring the HTTP REST call
  • Handling network, HTTP, and OpenAI errors
  • Processing HTTP streaming results via server-sent events (SSE)
  • Decoding HTTP response data for application use.

The OpenAI API documentation is invaluable

Having to independently determine how to invoke vLLM via its provided OpenAI-compatible endpoints, I prioritized a thorough review of the OpenAI API documentation. My immediate focus was the chat completions API endpoint, leading me directly to that specific section of the OpenAI documentation.

To initiate an inference request, I first needed to ascertain several key pieces of information:

  • Endpoint URL: The default path is v1/chat/completions.
  • HTTP method: This was straightforward, confirmed as POST.
  • HTTP headers: The required headers were Authorization for the API key and Content-Type set to application/json.
  • HTTP request body: This needed to be in JSON format.

Among all the aforementioned parameters, the HTTP request body can be the most complex to work with. However, for a simple chatbot that only handles streaming text, the required values are minimal:

  • model (String, required): Specifies the vLLM model to be used for the chat request.
  • messages (Array of Objects, required): Represents the prompt sent to the model for inference. Each object must include a role (system, user, or assistant) and content (the message text). Given that the sample code does not support prompt engineering or chat history, only a single user role object with prompt content is needed.
  • stream (Boolean, required for streaming): This is the crucial parameter that distinguishes a streaming request from a non-streaming one. Setting it to true instructs the OpenAI API to deliver the response in chunks as an SSE stream rather than waiting for the complete response generation.

After understanding the HTTP request, I focused on the response format, which varies for streaming versus non-streaming requests. As the sample code uses a streaming request, vLLM responds in the SSE format, characterized by:

  • Event stream: The response is a continuous flow of events rather than a single JSON object. Each event can have a distinct format indicating its type.
  • data: prefix: Responses containing actual data events are prefixed with data: and followed by a JSON string representing an OpenAI ChatCompletionChunk object. Multiple chunks can be returned to represent a full inference response.
  • [DONE] marker: The stream concludes with a special data: [DONE] message, signifying the completion of the response and the absence of further JSON data.

Good information to know, but where are the actual inference results? Those are contained in the contents of the JSON string within the data: prefixed strings of the SSE formatted response. Each JSON string is formatted as a ChatCompletion chunk that contains the following key fields:

  • id: A unique identifier for the entire completion request. This ID will be the same for all chunks belonging to the same streamed response.
  • object: Will always be chat.completion.chunk for streaming responses.
  • choices: An array containing the generated content.
    • index: The index of the choice (useful if you requested multiple choices, but usually 0 for single responses).
    • delta: This is the most critical object. It contains the actual incremental data.
      • role (optional): Appears only in the very first chunk of a streamed response for a choice. It will specify the role of the message being generated (e.g., assistant). Subsequent chunks for the same message will not have this field.
      • content (optional): This is the actual text content chunk. You concatenate these content strings from successive delta objects to build the complete response. It might be missing if the chunk is just sending a role or tool_calls without any immediate text.
      • tool_calls (optional): If you are using function calling, this field will appear in chunks to stream tool calls.
    • finish_reason (optional): This field appears only in the final chunk for a given choice. It indicates why the model stopped generating content. Common values include:
      • stop: The model generated a natural stopping point.
      • length: The model reached the max_tokens limit.
      • content_filter: Content was flagged by OpenAI’s moderation system.
      • tool_calls: The model decided to call a tool.

For each chunk returned (except the last), the inference text we need to show the user is choices.first!.delta.content field value of the OpenAIStreamResponse struct defined in the sample source code and populated by decoding the ChatCompletion chunk JSON string. The value of that field is concatenated to the llmResponse variable.

What else is there to consider? Errors! Specifically errors that can occur at the URL level, and HTTP errors that can be thrown during communications. More on that soon.

How the sample code works

The code using Apple Foundation technology (FoundationChatViewModel.swift) and the Alamofire project (AlamoFireChatViewModel.swift) each follow a fairly similar flow, so I’ll compare the two in parallel when presenting the code flow for making the vLLM inference call.

Note 

Be sure to have the sample code available to use. You’ll want to reference the sample code available from GitHub and set up Xcode to use it. Instructions for this are included in the article appendix.

The sendMessage method, common to both implementations, initiates an inference request. It performs basic error checking before delegating the actual call to the sendLLMRequest method, which handles the specific requirements of Foundation and Alamofire.

The initial step in the inference flow involves constructing the inference request for vLLM. For both implementations, it is crucial to validate the provided URL for the connection and ensure the HTTP request header and body are correctly configured.

The Foundation implementation encapsulates the HTTP request header and body setup within prepareURLRequest. This function returns a URLRequest to initiate a URL session upon success; otherwise, it throws a foundationAIInferenceError.

Conversely, the Alamofire implementation integrates these activities directly into the sendLLMrequest method. Should any error checks fail during this initial phase, an alamoFireAIInferenceError is thrown.

The next step involves making an HTTP call to the vLLM server. Alamofire streamlines this process, combining the HTTP request header and body setup with the request execution into a single method: streamRequest. The method will return a DataStreamRequest instance that several actions are immediately taken on:

  1. Validation: The HTTP call’s success is validated by a status return code of 200. Any errors will surface during the asynchronous data stream processing.
  2. Resource management: The DataStreamRequest is closed regardless of how the sendLLMrequest method concludes. This prevents memory leaks within the underlying Alamofire implementation, even across numerous inference requests.
  3. Error handling: An alamoFireAIInferenceError.apiError is thrown if an Alamofire API error occurs, terminating the inference request.

In the Foundation implementation, the setupURLSession method manages the HTTP request by creating a URLSession for network data transfer. For our basic streaming REST call, the convenient URLSession.shared singleton can be used. Some might question this, as a limitation of the shared session is its inability to obtain data incrementally. However, because the stream data is returned in chunks adhering to the SSE protocol—part of the HTML Living Standard—that restriction doesn’t apply.

With the HTTP request initiated, I can now proceed with result handling and error verification, starting with HTTP error checking.

For the Alamofire implementation, the code processes streaming strings from the vLLM server that may contain error information, or data to process. Alamofire was a bit confusing here due to a lack of documentation and examples. I implemented basic error checking, but I’m not entirely confident in its perfection. The error checking process involves these steps:

  1. Check the dataStreamRequest.error to see if it contains a non-nil value, and throw an exception if it does. (I never observed this code executing.)
  2. Check to see if the stringStream.completion is not a nil value, and contains a completion error. If it does then throw an error.
  3. Check the string value returned, and see if it starts with an int value, and if that value is one of the known OpenAI error codes. If so, then throw an error.

Error checking was more straightforward for the Foundation implementation because errors and response data could be accessed independently. Potential errors manifest in two stages: initially during the network connection established in setupURLSession, and subsequently as HTTP errors within the sendLLMrequest method. The latter are captured within the for try await line in bytes.lines block of code. In this block, the vLLM server returns JSON-encoded strings containing error information, which are then captured, decoded, and thrown as an foundationAIInferenceError.apiError.

Once network and HTTP errors are addressed, both implementations proceed to process the SSE data, which is returned as a string. OpenAI chat completions format these SSE strings to denote various return values. For simplicity, the code only needs to process strings that begin with data: and do not include the value [DONE]. All other strings can be disregarded.

If a valid string is available for processing, both implementations handle the chat completion chunk within their respective processChatCompletionChunk methods. The initial step in these methods is to check for any errors returned by OpenAI, and if an error is present, it is converted to an error struct and thrown.

Error codes vary between Foundation and Alamofire implementations

The Foundation code provides the specific HTTP error code for OpenAI API errors, which is then included in any thrown errors. However, with Alamofire, I could not find a method to access the HTTP error code, so a generic -1 error code is used instead.

The final step involves managing a successful return. In both scenarios, the text or chunk received from vLLM is added to the llmResponse variable. This action subsequently triggers an update to the SwiftUI view. As more text chunks are processed similarly, the SwiftUI updates create the illusion of the vLLM response being typed in real-time, when in fact, the code is simply handling numerous small text segments.

Lower-level coding observations

Directly engaging with the OpenAI API to understand its detailed workings and then implementing those details myself in code was a surprisingly enjoyable and educational experience. I recognize that diving into such details might seem unnecessary when purpose-built API wrapper projects like SwiftOpenAI and MacPaw/OpenAI offer significant advantages:

  • Ease of use: Both projects simplify integration with intuitive APIs, drastically reducing coding time for OpenAI calls.
  • Active maintenance: Both projects regularly updated, ensuring compatibility with evolving API specifications and rigorously testing released code.

Despite the added effort, creating lower-level code with Apple Foundation technology offered distinct benefits:

  • No external dependencies: This allows for the use of the latest Apple technologies and adherence to best practices, such as Swift structured concurrency.
  • Full control: Every aspect of the server connection, including HTTP requests (headers, timeouts, retries) and response parsing, is fully manageable.
  • Robust error handling: This approach provides comprehensive error handling capabilities.
  • Faster execution: Optimized code limited in scope to the chatbot sample led to faster code execution, in my experience.

Choosing a path for using the OpenAI API with Swift

When integrating the OpenAI API with Swift, I’m at a crossroads. While I value the granular control offered by low-level networking and error handling, the complexity and frequent updates of the OpenAI API present a significant challenge.

Therefore, for applications requiring extensive use of the OpenAI API, I lean towards the SwiftOpenAI project. In my opinion, its user-friendliness surpasses that of MacPaw/OpenAI, and it boasts an active and responsive community.

Conversely, for applications that only need a limited subset of the OpenAI API, I prefer to leverage Apple Foundation technology. Despite Alamofire’s popularity, I haven’t found its simplification benefits compelling, especially since I don’t perceive Apple Foundation technology as particularly difficult to work with.

Summary

This article explored two distinct methods for integrating vLLM inference into macOS and iOS applications via an OpenAI-compatible endpoint: the granular control offered by Apple Foundation and the more streamlined experience provided by the Alamofire project. Both approaches successfully facilitate communication with vLLM, but they present different trade-offs regarding complexity, control, and external dependencies.

Unlike a previous article that focused on SwiftOpenAI and MacPaw/OpenAI, this piece demonstrated the use of lower-level code for communicating with an OpenAI-compatible endpoint. Ultimately, the decision between employing a low-level networking approach with Apple Foundation or an abstraction layer hinges on your project’s specific requirements and your comfort with managing HTTP complexities. 

Apple Foundation is a robust and rewarding option for simpler applications or those demanding maximum control and minimal dependencies. Conversely, for more intricate applications that heavily interact with the OpenAI API and prioritize rapid development, well-maintained API wrappers offer considerable benefits. 

Regardless of the chosen method, a thorough understanding of the underlying OpenAI API specification is crucial for effective and efficient integration of large language models into your applications.

Ready to build your own vLLM-powered macOS and iOS applications?

Get started by cloning the GitHub repository containing sample code. Experiment with Llama Stack, SwiftOpenAI, MacPaw/OpenAI, Alamofire, and Apple Foundation sample code to grasp the fundamentals of integrating powerful AI inference into your projects.

For assistance with cloning the sample code, building the developer documentation, or running the application, refer to the instructions provided in the article appendix.

Now go have some coding fun!

Appendix

Get the sample source code set up

To quickly facilitate this article’s learning objectives, sample source code is available in a GitHub repository. To get started with that code, you’ll need to clone it to your macOS computer that also has Xcode 16.4 or greater on it. Xcode will be used to review, edit, and compile the code.

Watch the following interactive tutorial on how to clone the repository, and refer to the text describing the steps that follow it.

Here are the steps to follow to clone the repository using Xcode so you can start using the project:

  1. Make sure you have set your source code control preferences in Xcode
  2. Select the Clone Git Repository option of the startup dialog for XCode, or the Clone… option under the Integrate menu in the main menu bar.
  3. In the dialog box that appears, enter https://github.com/RichNasz/vLLMSwiftApp.git into the search bar that appears at the top of the dialog box
  4. In the file dialog that appears choose the folder you want to clone the repository into.

Build the project and developer documentation

Once the cloning process is complete, a new Xcode project will open and immediately start building the project. Stop that build using the Product -> Stop menu option. Once that is done, check the following items to make sure you get a clean build:

  1. During your first build of the project, a dialog box asking you to enable and trust the OpenAPIGenerator extension will pop up. Select the option to Trust & Enable when prompted. If you don’t do this, code for the SDK can’t be generated, and the code won’t work
  2. You need to have your Apple developer account set up in Xcode so that projects you work on can be signed and used on physical devices. You can create a free Apple Developer account, or use an existing account.
    1. First, make sure your developer account is set via Xcode Settings… -> Accounts.
    2. Select the vLLMSwiftApp in the project navigator, and then set the values in the Signing & capabilities section of the vLLMSwiftApp target.
      1. Select the check box for Automatically manage signing.
      2. Set the team to your personal team.
      3. Choose a unique bundle identifier for your build.

Once you have verified the critical items identified above, go ahead and clean the build folder using the Product -> Clean Build Folder… menu option. Then start a new build using the Product ->Build menu option. The initial build will take a while to complete since all package dependencies must be downloaded (such as SwiftOpenAI and MacPaw/OpenAI ) before a full build starts. Provided the build completes without error, you can now run the sample chatbot.

In addition to building the source code, build the developer documentation by selecting the Product -> Build Documentation menu item in Xcode. This build takes a minute or two to complete, and you can open the generated Developer Documentation using the Xcode Help -> Developer Documentation menu item. Once the documentation is open, select the vLLMSwiftApp item, and then the vLLM Swift Application (vLLMSwiftApp) items the sidebar on the left of the help screen. We’ll use the generated documentation to help simplify the code review process.

Run the application

Provided the build completes without error you can now run the sample chatbot by selecting the Product -> Run menu item. If you want to see a quick tour of the application, watch the interactive demo, and then run your own local copy.

The post Integrate vLLM inference on macOS/iOS with Alamofire and Apple Foundation appeared first on Red Hat Developer.

Tagged: