Add vision capability for bots #413

gmuffiness · 2025-01-19T18:07:01Z

I added support for image input using GPT-4V and GPT-4o, enabling effective image interpretation.
This is an initial implementation, so I would greatly appreciate any feedback or suggestions for improvement. Thank you!

Changelog

Two actions added, leveraging mineflayer’s screenshot functionality (as @MaxRobinsonTheGreat suggested in this issue )
• lookAtPlayer: Allows the bot to focus on the player’s direction or viewpoint for better understanding
• lookAtPosition: Enables the bot to focus on specific coordinates for targeted image interpretation
Added a promptImageConvo method in src/agent/prompter.js.
Included examples to demonstrate these new features.

Known Limitations

Node.js Compatibility : Using a relatively new Node.js version (in my case, v23.2.0) caused installation errors with node-canvas-webgl and three packages. Switching to the LTS version (18.20.5) resolved these issues. Use nvm use 18 for compatibility.
Minecraft Version Support : Works reliably with Minecraft versions up to 1.20.1, as specified in the Prismarine Viewer README. Rendering and execution issues may occur with versions beyond 1.20.1.

gmuffiness · 2025-01-19T18:14:07Z

I also made a demo video about this feature.
This video was inspired by your work, Maximus! I hope you enjoy it :)

https://www.youtube.com/watch?v=gPyFrBs45Es

uukelele-scratch · 2025-01-19T19:42:59Z

why change default port to 56069?

and why comment out init message?

gmuffiness · 2025-01-20T01:33:27Z

Oh, I hadn’t noticed that settings.js was changed. Thanks for pointing it out!
no particular reason, haha

…amera api

gmuffiness · 2025-01-20T08:56:55Z

Currently, the lookAtPlayer and lookAtPosition functions in skills.js handle both 1) taking screenshots and 2) sending requests to the vision model. However, the other functions in skills.js seem to focus solely on controlling Mineflayer’s actions.

This makes me wonder if it might be better to separate these responsibilities by creating a new class, such as VisionInterpreter to handle the vision-related functionalities and use it on agent.js.

I’ll think more about whether this approach would be better. I’d appreciate any feedback or thoughts!

gmuffiness added 3 commits January 15, 2025 17:26

feat: add screenshots and look action works on gpt

1be24f4

feat: update skill look to lookAtPlayer & export lookAtPosition

f5923db

fix: add camera file & move image describe to promptImageConvo

1fee081

chore: reset settings

85ed526

gmuffiness added 3 commits January 20, 2025 11:48

chore: remove duplcated func

65113c7

fix: update camera direction to use mineflayer viewer setFirstPersonC…

55c045f

…amera api

chore: remove unnecessary 5-second sleep

e4eda9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision capability for bots #413

Add vision capability for bots #413

gmuffiness commented Jan 19, 2025

gmuffiness commented Jan 19, 2025

uukelele-scratch commented Jan 19, 2025 •

edited

Loading

gmuffiness commented Jan 20, 2025

gmuffiness commented Jan 20, 2025

Add vision capability for bots #413

Are you sure you want to change the base?

Add vision capability for bots #413

Conversation

gmuffiness commented Jan 19, 2025

gmuffiness commented Jan 19, 2025

uukelele-scratch commented Jan 19, 2025 • edited Loading

gmuffiness commented Jan 20, 2025

gmuffiness commented Jan 20, 2025

uukelele-scratch commented Jan 19, 2025 •

edited

Loading