Skip to main content
This guide covers best practices for minimizing latency and maximizing throughput when using the Infercom Inference Service.

Connection pooling

The single most impactful optimization is reusing your client instance across requests. This enables HTTP connection pooling, which skips the TCP and TLS handshake on subsequent calls.
Reusing your client instance can reduce network overhead by up to 50% on consecutive requests.

How it works

When you create a new client for every request, each call must establish a fresh TCP connection and negotiate TLS — adding several tens of milliseconds depending on your location and network conditions. By reusing the client, the underlying connection stays open and subsequent requests skip this setup entirely.
from sambanova import SambaNova

# Create the client once
client = SambaNova(
    base_url="https://api.infercom.ai/v1",
    api_key="your-infercom-api-key"
)

# Reuse it for all requests
response_1 = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello"}]
)

response_2 = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Follow-up question"}]
)
Avoid creating a new client inside loops or request handlers. This forces a new TCP+TLS handshake on every call.
# Avoid this pattern
for message in messages:
    client = SambaNova(base_url="https://api.infercom.ai/v1", api_key="...")
    response = client.chat.completions.create(...)  # New connection each time
Both the SambaNova SDK and OpenAI SDK use httpx under the hood, which automatically manages a connection pool when you reuse the client. The default pool maintains up to 20 keep-alive connections.

Response performance metadata

Every API response includes detailed performance metrics in the usage object. Use these to measure and optimize your application.

Available metrics

FieldDescription
time_to_first_tokenServer-side time until the first token is generated (seconds)
total_latencyServer-side total processing time (seconds)
completion_tokens_after_first_per_secOutput throughput after the first token (tokens/sec)
completion_tokens_per_secOverall output throughput including TTFT (tokens/sec)
total_tokens_per_secCombined input + output throughput (tokens/sec)
prompt_tokensNumber of input tokens processed
completion_tokensNumber of output tokens generated

Example response

{
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 9,
    "total_tokens": 46,
    "time_to_first_token": 0.092,
    "total_latency": 0.126,
    "completion_tokens_after_first_per_sec": 232.25,
    "completion_tokens_per_sec": 71.16,
    "total_tokens_per_sec": 363.71
  }
}

Measuring client vs server latency

To understand how much time is spent on the network versus inference, compare your client-side elapsed time with the server-reported total_latency:
import time
from openai import OpenAI

client = OpenAI(
    base_url="https://api.infercom.ai/v1",
    api_key="your-infercom-api-key"
)

start = time.perf_counter()
response = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=50
)
client_elapsed = time.perf_counter() - start

server_latency = response.usage.total_latency  # seconds
network_overhead = client_elapsed - server_latency

print(f"Server inference: {server_latency*1000:.0f}ms")
print(f"Client total:     {client_elapsed*1000:.0f}ms")
print(f"Network overhead: {network_overhead*1000:.0f}ms")
A typical network overhead from within the EU is 50–150ms, depending on your proximity to the datacenter and whether connection pooling is active.

Streaming for interactive applications

For chatbots and interactive use cases, streaming delivers the first token to the user faster and provides a more responsive experience.
from sambanova import SambaNova

client = SambaNova(
    base_url="https://api.infercom.ai/v1",
    api_key="your-infercom-api-key"
)

stream = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
Streaming does not reduce total processing time, but it significantly improves perceived latency by delivering output as it is generated.

Performance best practices

PracticeImpactDetails
Reuse client instancesAvoids repeated TCP/TLS handshakeSaves several tens of ms per request
Use streaming for UIFaster perceived responseFirst token arrives sooner
Set max_tokens appropriatelyAvoids unnecessary generationDon’t leave it at the default if you need short responses
Choose the right modelVariesSmaller models have lower TTFT and higher throughput
Deploy close to EU datacenterLower network round-trip timeInfercom runs in Germany (Equinix Munich)