Simplify Image Resolution In Remote Agent Server: A Guide

Nov 7, 2025 by Admin 58 views

Hey guys! Let's dive into how we can make the image resolution logic in our remote agent server example way simpler. Currently, the 04_convo_with_api_sandboxed_server.py example has some pretty complex stuff going on to figure out which agent-server Docker image to use. But don't worry, we're going to break it down and make it much easier to manage. This guide will walk you through the current challenges and how we plan to simplify things once the Runtime API supports customizable image pull policies. So, let's get started!

The Current Complexity

Right now, the example includes a few key complexities that we want to address. Here’s a quick rundown:

get_latest_commit_sha() Function: This function queries the GitHub API to fetch the latest commit SHA. It's a bit of a detour just to get the right image.
Conditional Logic for GITHUB_SHA: The code checks for the GITHUB_SHA environment variable (common in CI environments) and falls back to fetching from the main branch if it’s not set. This adds extra layers of decision-making.
Manual Image Tag Construction: The image tag is manually constructed, including the architecture. This means we’re handling details that could be automated.

Digging into the Code

Let's take a closer look at the problematic code block. This snippet, found in examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py (lines 49-79), shows the current image resolution logic:

def get_latest_commit_sha(
    repo: str = "OpenHands/software-agent-sdk", branch: str = "main"
) -> str:
    """
    Return the full SHA of the latest commit on `branch` for the given GitHub repo.
    Respects an optional GITHUB_TOKEN to avoid rate limits.
    """
    url = f"https://api.github.com/repos/{repo}/commits/{branch}"
    headers = {}
    token = os.getenv("GITHUB_TOKEN") or os.getenv("GH_TOKEN")
    if token:
        headers["Authorization"] = f"Bearer {token}"

    resp = requests.get(url, headers=headers, timeout=20)
    if resp.status_code != 200:
        raise RuntimeError(f"GitHub API error {resp.status_code}: {resp.text}")
    data = resp.json()
    sha = data.get("sha")
    if not sha:
        raise RuntimeError("Could not find commit SHA in GitHub response")
    logger.info(f"Latest commit on {repo} branch={branch} is {sha}")
    return sha


# If GITHUB_SHA is set (e.g. running in CI of a PR), use that to ensure consistency
# Otherwise, get the latest commit SHA from main branch (images are built on main)
server_image_sha = os.getenv("GITHUB_SHA") or get_latest_commit_sha(
    "OpenHands/software-agent-sdk", "main"
)
server_image = f"ghcr.io/openhands/agent-server:{server_image_sha[:7]}-python-amd64"

This code first defines a function, get_latest_commit_sha(), which fetches the latest commit SHA from a GitHub repository. It constructs a URL to the GitHub API, sends a request, and parses the response to extract the SHA. It even handles authentication using a GitHub token to avoid rate limits, which is a nice touch but adds complexity. If anything goes wrong during this process, such as a non-200 status code or a missing SHA, the function raises a RuntimeError. Error handling is crucial, but it also increases the amount of code we need to manage.

The subsequent lines use this function to determine the server image to use. It first checks for the GITHUB_SHA environment variable, which is often set in CI environments. If this variable is present, it uses its value to ensure consistency across builds. If not, it calls get_latest_commit_sha() to fetch the latest SHA from the main branch. Finally, it constructs the server image name using the fetched SHA and a hardcoded architecture (python-amd64). This manual construction of the image name adds another layer of complexity and potential for errors.

Why Is This Complex?

Fetching the latest commit SHA and constructing the image tag manually adds significant complexity. For developers trying to understand the example, this logic can be a major detour. It's not immediately clear why all this is necessary, and it shifts the focus away from the core functionality of the agent server.

Additionally, relying on the GitHub API to fetch the latest commit SHA introduces a dependency on an external service. This can make the example more brittle and prone to failures due to rate limits, network issues, or changes in the GitHub API. Using environment variables like GITHUB_SHA is a good practice for CI environments, but the fallback to fetching from the main branch adds complexity that we can avoid.

The manual construction of the image tag is also a potential source of errors. Hardcoding the architecture (python-amd64) makes the example less flexible and may not work in all environments. It also requires developers to understand the naming conventions used for the Docker images, which can be a barrier to entry.

The Proposed Simplification

So, how can we make this better? The key is leveraging the Runtime API's customizable image pull policies. Once this feature is available (tracked in https://github.com/OpenHands/runtime-api/pull/356), we can dramatically simplify the example.

The plan is to:

Remove get_latest_commit_sha(): Say goodbye to this function! We won't need it anymore.
Use a Simpler Image Tag: Instead of constructing tags based on commit SHAs, we can use stable tags like latest or version-based tags. This makes things much cleaner.
Let the Runtime API Handle Image Pulling: The Runtime API will take care of pulling images with the appropriate policies, so we don't have to manage this logic in the example.
Reduce Cognitive Load: By simplifying the image resolution logic, we make the example easier to understand and focus on the important parts.

Benefits of Simplification

Simplifying the image resolution logic offers several key benefits:

Reduced Complexity: The most obvious benefit is a simpler codebase. By removing the get_latest_commit_sha() function and the logic for constructing image tags, we can significantly reduce the complexity of the example. This makes it easier for developers to understand and modify the code.
Improved Readability: A simpler codebase is also more readable. When developers don't have to wade through complex image resolution logic, they can focus on the core functionality of the agent server. This makes the example more accessible to newcomers and reduces the barrier to entry.
Increased Maintainability: Simpler code is easier to maintain. When the image resolution logic is handled by the Runtime API, we don't have to worry about keeping the example up-to-date with changes in the image building process. This reduces the maintenance burden and makes the example more robust.
Better Focus on Core Functionality: By removing the distraction of complex image resolution logic, we can focus on demonstrating the core functionality of the agent server. This makes the example more effective as a learning tool and showcases the capabilities of the software-agent-sdk.
Reduced Dependency on External Services: By using stable image tags and letting the Runtime API handle image pulling, we reduce the dependency on external services like the GitHub API. This makes the example more resilient to network issues and rate limits.

The Simplified Approach: A Closer Look

Let’s break down how the simplified approach will work. Instead of dynamically fetching the latest commit SHA and constructing the image tag, we'll use a stable tag, such as latest or a version-based tag (e.g., v1.0.0). This immediately simplifies the image name, making it easier to read and understand.

With the Runtime API handling image pulling, we can rely on its configuration to ensure the correct image is pulled. This means we no longer need to include logic in the example to check for environment variables or interact with the GitHub API. The Runtime API can be configured to always pull the latest image, pull if not present locally, or use other policies as needed. This moves the complexity out of the example code and into the Runtime API's configuration, where it belongs.

This approach significantly reduces the amount of code needed in the example, making it cleaner and more focused. Developers can then concentrate on the core concepts of the agent server without getting bogged down in image management details.

What Needs to Happen?

To make this simplification a reality, we have a few tasks to complete:

Wait for runtime-api#356 to be merged: This PR needs to be finalized and integrated into the Runtime API. This is the linchpin of our simplification strategy.
Update Runtime API Client: We need to update the client to support image pull policy configuration. This will allow us to specify how images should be pulled.
Simplify the Example: The fun part! We’ll remove the complex image resolution logic from the example.
Consider Moving Utility Functions: If we still have any image utility functions lingering, we might move them to a shared utility module for other examples to use.

Timeline and Dependencies

The timeline for these tasks is largely dependent on the progress of runtime-api#356. Once this pull request is merged, we can move forward with updating the Runtime API client and simplifying the example. We'll be closely monitoring the progress of the pull request and coordinating our efforts accordingly.

The key dependency here is the Runtime API. Without the ability to configure image pull policies, we can't effectively simplify the example. This is why waiting for the merge of runtime-api#356 is the critical first step. Once we have that in place, the remaining tasks should be relatively straightforward.

Collaboration and Communication

Collaboration and communication will be essential throughout this process. We'll need to work closely with the Runtime API team to ensure that the image pull policy configuration is implemented in a way that meets our needs. We'll also need to keep the community informed of our progress and any potential changes to the example.

We plan to use GitHub issues and pull requests to track our progress and facilitate discussion. We'll also provide regular updates in our community channels, such as our Slack workspace and mailing list. This will help ensure that everyone is aware of the changes and has the opportunity to provide feedback.

References

For more context, you can check out these links:

Runtime API PR: https://github.com/OpenHands/runtime-api/pull/356
Related PR: #1090
Comment Thread: https://github.com/OpenHands/software-agent-sdk/pull/1090#discussion_r1872685471

Conclusion

Simplifying the image resolution logic in the remote agent server example is a big win for clarity and maintainability. By leveraging the Runtime API's image pull policies, we can make the example easier to understand and focus on the core functionality. This will benefit developers who are new to the software-agent-sdk and make the example more robust and easier to maintain.

We're excited about this simplification and the improvements it will bring. Stay tuned for updates as we work through these tasks! And remember, keeping things simple often leads to the best results. Let's make this example shine!