Amazon launches Bedrock: AI Model Evaluation with Human Benchmarking

NISHANT TIWARI 30 Nov, 2023

2 min read

In a development, Amazon Bedrock introduces the ability to assess, compare, and choose the optimal foundation models (FMs) tailored to your specific need. The Model Evaluation feature, now in preview, empowers developers with a range of evaluation tools, offering both automatic and human benchmarking options.

The Power of Model Evaluation

Model evaluations play a pivotal role at every stage of development. Developers can leverage the Model Evaluation feature to build generative artificial intelligence (AI) applications with unprecedented ease. This includes experimenting with different models in the platform’s playground environment, streamlining the iterative process by incorporating automatic evaluations, and ensuring quality through human reviews during the launch phase.

Automatic Model Evaluation Made Simple

With automatic model evaluation, developers can seamlessly incorporate their own data or utilize curated datasets and predefined metrics, such as accuracy, robustness, and toxicity. This feature eliminates the complexities of designing and executing custom model evaluation benchmarks. The ease of evaluating models for specific tasks like content summarization, question and answering text classification, and text generation is a game-changer for developers seeking efficiency.

Human Model Evaluation for Custom Metrics

Amazon Bedrock also offers an intuitive human evaluation workflow for subjective metrics like friendliness and style. Developers easily define custom metrics and use their datasets with just a few clicks. The flexibility extends to the choice of leveraging internal teams as reviewers or opting for an AWS-managed team. This simplified approach eradicates the cumbersome effort traditionally associated with building and managing human evaluation workflows.

Crucial Details to Consider

During the preview phase, Amazon Bedrock allows the evaluation and comparison of text-based large language models (LLMs). Developers can select one model for each automatic evaluation job and up to two models for each human evaluation job using their own teams. Additionally, for human evaluation through an AWS-managed team, custom project requirements can be specified.

Pricing is a crucial consideration, and during the preview phase, AWS only charges for the model inference required for evaluations, with no additional fees for human or automatic evaluations. A comprehensive breakdown of Amazon Bedrock Pricing is available to provide clarity on associated costs.

Our Say

Amazon Bedrock’s Model Evaluation empowers developers, marking a significant leap in decision-making for foundation models. Automatic and human evaluation options, simplified workflows, and transparent pricing herald a new era in AI development. Delving deeper into the preview phase, the industry anticipates the transformative impact on artificial intelligence’s landscape. Developers, buckle up – the future of model selection is here.