Performance Optimization Guide - Infercom Documentation

This guide covers best practices for minimizing latency and maximizing throughput when using the Infercom Inference Service.

Connection pooling

The single most impactful optimization is reusing your client instance across requests. This enables HTTP connection pooling, which skips the TCP and TLS handshake on subsequent calls.

Reusing your client instance can reduce network overhead by up to 50% on consecutive requests.

How it works

When you create a new client for every request, each call must establish a fresh TCP connection and negotiate TLS — adding several tens of milliseconds depending on your location and network conditions. By reusing the client, the underlying connection stays open and subsequent requests skip this setup entirely.

Recommended pattern

from sambanova import SambaNova

# Create the client once
client = SambaNova(
    base_url="https://api.infercom.ai/v1",
    api_key="your-infercom-api-key"
)

# Reuse it for all requests
response_1 = client.chat.completions.create(
    model="MiniMax-M2.7",
    messages=[{"role": "user", "content": "Hello"}]
)

response_2 = client.chat.completions.create(
    model="MiniMax-M2.7",
    messages=[{"role": "user", "content": "Follow-up question"}]
)

Avoid creating a new client inside loops or request handlers. This forces a new TCP+TLS handshake on every call.

# Avoid this pattern
for message in messages:
    client = SambaNova(base_url="https://api.infercom.ai/v1", api_key="...")
    response = client.chat.completions.create(...)  # New connection each time

Both the SambaNova SDK and OpenAI SDK use httpx under the hood, which automatically manages a connection pool when you reuse the client. The default pool maintains up to 20 keep-alive connections.

Response performance metadata

Every API response includes detailed performance metrics in the usage object. Use these to measure and optimize your application.

Available metrics

Field	Description
`time_to_first_token`	Server-side time until the first token is generated (seconds)
`total_latency`	Server-side total processing time (seconds)
`completion_tokens_after_first_per_sec`	Output throughput after the first token (tokens/sec)
`completion_tokens_per_sec`	Overall output throughput including TTFT (tokens/sec)
`total_tokens_per_sec`	Combined input + output throughput (tokens/sec)
`prompt_tokens`	Number of input tokens processed
`completion_tokens`	Number of output tokens generated

Example response

{
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 9,
    "total_tokens": 46,
    "time_to_first_token": 0.092,
    "total_latency": 0.126,
    "completion_tokens_after_first_per_sec": 232.25,
    "completion_tokens_per_sec": 71.16,
    "total_tokens_per_sec": 363.71
  }
}

Measuring client vs server latency

To understand how much time is spent on the network versus inference, compare your client-side elapsed time with the server-reported total_latency:

import time
from openai import OpenAI

client = OpenAI(
    base_url="https://api.infercom.ai/v1",
    api_key="your-infercom-api-key"
)

start = time.perf_counter()
response = client.chat.completions.create(
    model="MiniMax-M2.7",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=50
)
client_elapsed = time.perf_counter() - start

server_latency = response.usage.total_latency  # seconds
network_overhead = client_elapsed - server_latency

print(f"Server inference: {server_latency*1000:.0f}ms")
print(f"Client total:     {client_elapsed*1000:.0f}ms")
print(f"Network overhead: {network_overhead*1000:.0f}ms")

A typical network overhead from within the EU is 50–150ms, depending on your proximity to the datacenter and whether connection pooling is active.

Streaming for interactive applications

For chatbots and interactive use cases, streaming delivers the first token to the user faster and provides a more responsive experience.

from sambanova import SambaNova

client = SambaNova(
    base_url="https://api.infercom.ai/v1",
    api_key="your-infercom-api-key"
)

stream = client.chat.completions.create(
    model="MiniMax-M2.7",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Streaming does not reduce total processing time, but it significantly improves perceived latency by delivering output as it is generated.

Performance best practices

Practice	Impact	Details
Reuse client instances	Avoids repeated TCP/TLS handshake	Saves several tens of ms per request
Use streaming for UI	Faster perceived response	First token arrives sooner
Set `max_tokens` appropriately	Avoids unnecessary generation	Don’t leave it at the default if you need short responses
Choose the right model	Varies	Smaller models have lower TTFT and higher throughput
Deploy close to EU datacenter	Lower network round-trip time	Infercom runs in Germany (Equinix Munich)

​Connection pooling

​How it works

​Recommended pattern

​Response performance metadata

​Available metrics

​Example response

​Measuring client vs server latency

​Streaming for interactive applications

​Performance best practices

Connection pooling

How it works

Recommended pattern

Response performance metadata

Available metrics

Example response

Measuring client vs server latency

Streaming for interactive applications

Performance best practices