Simple Agent Benchmark

What is this?

As AI's get better and better, benchmarking them becomes fuzzier and fuzzier. While in the past 'do this computation' was good enough, now we can give them much more abstract tasks. This page attempts to solve this problem by using visual bencmarks - somethig that is easy for a human to compare. This allows the agent to take creative license and produce a result that is observably 'good' rather than provably 'correct'.

Technically, this project takes the form of a very very simple agent. The agent is just a loop that takes an initial prompt and an initial set of files, and then runs the AI on them, providing the agent with some simple tools and 100 iterations. To judge the results, it is expectd that the test will output an index.html file, which will be embedded here. To aid in figuring out what is going on, the complete message log is also available, and can be viewed by clicking the 'View Message Log' link.

Table of Contents

Results

Demo site for 3d printer

html
internal-knowledge

Can the model create a simple webpage with informational content from it's own memory?

Initial Prompt:

Create me a website about 3d printers. Make it technical and informative, including information such as calibration, what process should be followed after changing filament etc. Add lots of detail and use a modern design language with the main color being orange. You should create multiple pages, but make sure the main page is available in index.html

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Can the model create a simple html webpage with some demo content

Initial Prompt:

I'd like to create a plain HTML site for a plumbing business. Make it look modern. Output it into index.html

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Create a 3d cart racing game using some webgl based technology

Initial Prompt:

Make a 3d web based game of racing gocarts. Pick a standard rendering technology (Eg balylonjs, playcanvas, three.js). Make sure the result is available in index.html

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Create a game of tic-tac-toe

Initial Prompt:

Make a game of tic-tac-toe. The game should be playable from index.html, but you can create supporting css/js files.

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Can the model create a simple html webpage

Initial Prompt:

Create a single index.html file containing a hello world. Include styling in a dark theme with bold and striking color choice.

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Create an image of some mountains as an svg

Initial Prompt:

I would like you to create an svg image of a sunset through the mountains. Make sure the result is visible in index.html

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Rotating hexagon in plain HTML

Initial Prompt:

Create a HTML page containing a rotating hexagon with bouncing balls inside it. Save it to `index.html`

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Use Bash

bash
html

Extract current data using bash

Initial Prompt:

Read the file `/project/instructions.md`. There are more instructions for you there.

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM

Create a demo shader in webgl

Initial Prompt:

Can you set up a webgl demo for me? I'd like an example of a basic webgl shader. Make sure the output is available in index.html, but you can install any packages necessary.

Meta

meta-llama-3.1-8b-instruct

OpenAI

QWQ

Qwen

THUDM