Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vision capability for bots #413

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

gmuffiness
Copy link

I added support for image input using GPT-4V and GPT-4o, enabling effective image interpretation.
This is an initial implementation, so I would greatly appreciate any feedback or suggestions for improvement. Thank you!

Changelog

  • Two actions added, leveraging mineflayer’s screenshot functionality (as @MaxRobinsonTheGreat suggested in this issue )
    • lookAtPlayer: Allows the bot to focus on the player’s direction or viewpoint for better understanding
    • lookAtPosition: Enables the bot to focus on specific coordinates for targeted image interpretation
  • Added a promptImageConvo method in src/agent/prompter.js.
  • Included examples to demonstrate these new features.

Known Limitations

  • Node.js Compatibility : Using a relatively new Node.js version (in my case, v23.2.0) caused installation errors with node-canvas-webgl and three packages. Switching to the LTS version (18.20.5) resolved these issues. Use nvm use 18 for compatibility.
  • Minecraft Version Support : Works reliably with Minecraft versions up to 1.20.1, as specified in the Prismarine Viewer README. Rendering and execution issues may occur with versions beyond 1.20.1.

@gmuffiness
Copy link
Author

I also made a demo video about this feature.
This video was inspired by your work, Maximus! I hope you enjoy it :)

https://www.youtube.com/watch?v=gPyFrBs45Es

@uukelele-scratch
Copy link
Contributor

uukelele-scratch commented Jan 19, 2025

why change default port to 56069?

and why comment out init message?

@gmuffiness
Copy link
Author

Oh, I hadn’t noticed that settings.js was changed. Thanks for pointing it out!
no particular reason, haha

@gmuffiness
Copy link
Author

Currently, the lookAtPlayer and lookAtPosition functions in skills.js handle both 1) taking screenshots and 2) sending requests to the vision model. However, the other functions in skills.js seem to focus solely on controlling Mineflayer’s actions.

This makes me wonder if it might be better to separate these responsibilities by creating a new class, such as VisionInterpreter to handle the vision-related functionalities and use it on agent.js.

I’ll think more about whether this approach would be better. I’d appreciate any feedback or thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants