> ## Documentation Index
> Fetch the complete documentation index at: https://wb-21fd5541-mintlify-9ec128a6.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

> Evaluate model checkpoints or hosted API models within W&B and analyze the results using automatically generated leaderboards.

# LLM Evaluation Jobs

[LLM Evaluation Jobs](/models/launch) is a benchmarking framework for evaluating an LLM model's performance, using infrastructure managed by CoreWeave. Choose from a comprehensive suite of modern, industry-standard [model evaluation benchmarks](/models/launch/evaluations), then view, analyze, and share the results using automatic leaderboards and charts in W\&B Models. LLM Evaluation Jobs removes the complexity of deploying and maintaining GPU infrastructure yourself.

<Note>
  LLM Evaluation Jobs is in **Preview** for [W\&B Multi-tenant Cloud](/platform/hosting/hosting-options/multi_tenant_cloud). Compute is free during the preview period. [Learn more](/models/launch#pricing)
</Note>

## How it works

Evaluate a model checkpoint or a publicly accessible hosted OpenAI-compatible model in just a few steps:

1. Set up an evaluation job in W\&B Models. Define its benchmarks and configuration, such as whether to generate a leaderboard.
2. Launch the evaluation job.
3. View and analyze the results and leaderboard.

Each time you launch an evaluation job with the same destination project, the project's leaderboard updates automatically.

<Frame>
  <img src="https://mintcdn.com/wb-21fd5541-mintlify-9ec128a6/n8ammsXXZSklHHXd/images/models/llm-evaluation-jobs/model-checkpoint-leaderboard-example.png?fit=max&auto=format&n=n8ammsXXZSklHHXd&q=85&s=43bddcfadc099f705e4d735addaf3164" alt="Example evaluation job leaderboard" width="3354" height="1552" data-path="images/models/llm-evaluation-jobs/model-checkpoint-leaderboard-example.png" />
</Frame>

## Next steps

* Browse the [Evaluation benchmark catalog](/models/launch/evaluations)
* [Evaluate a model checkpoint](/models/launch/evaluate-model-checkpoint)
* [Evaluate an API-hosted model](/models/launch/evaluate-hosted-model)

## More details

### Pricing

LLM Evaluation Jobs evaluates a model checkpoint or hosted API against popular benchmarks on fully-managed CoreWeave compute, with no infrastructure to manage. You pay only for resources consumed, not idle time. Pricing has two components: compute and storage. Compute is free during public preview, and we will announce pricing at general availability. Stored results include metrics and per-example traces saved in Models runs. Storage is billed monthly based on data volume. During the preview period, LLM Evaluation Jobs is available for Multi-tenant Cloud only. See the [Pricing](https://wandb.ai/pricing) page for details.

### Job limits

An individual evaluation job has these limits:

* The maximum size for a model to evaluate is 86 GB, including context.
* Each job is limited to two GPUs.

### Requirements

* To evaluate a model checkpoint, the model weights must be packaged as a VLLM-compatible artifact. See [Example: Prepare a model](/models/launch/evaluate-model-checkpoint#example-prepare-a-model) for details and example code.
* To evaluate an OpenAI-compatible model, it must be accessible at a public URL, and an organization or team admin must configure a team secret with the API key for authentication.
* Certain benchmarks use OpenAI models for scoring. To run these benchmarks, an organization or team admin must configure team secrets with the required API keys. See the [Evaluation benchmark catalog](/models/launch/evaluations) to determine whether a benchmark has this requirement.
* Certain benchmarks require access to gated datasets in Hugging Face. To run one of these benchmarks, an organization or team admin must request access to the gated dataset in Hugging Face, generate a Hugging Face user access token, and configure it as a team secret. See the [Evaluation benchmark catalog](/models/launch/evaluations) to determine whether a benchmark has this requirement.

For more details and instructions for meeting these requirements, see:

* [Evaluate a model checkpoint](/models/launch/evaluate-model-checkpoint)
* [Evaluate a hosted API model](/models/launch/evaluate-hosted-model)
