Replies: 1 comment 1 reply
-
@rwjdk - Using Code here would be ideal and use AI where it gives you the most value. 4o-mini is a good model to use here and has a very low cost per million tokens. I would recommend using code to count and use APIs to summarize (if they support this) and to do the updates in bulk (if they support this). Make sure you are using strongly typed plugins so the LLM is not guessing. Hopefully this helps. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Before I begin this "rant", know that this is not against the SK Team or the Product (both have been amazing to work with 🙌 )... Also, it might just be me not knowing enough as I've only been working with SK / AI for about a month now...
... But I want to ask people's opinion on the state of being able to build anything meaningful for the below scenario with the current state of the LLMs from two perspectives (Cost and Accuracy)
My Scenario (Using C# and ChatGpt4o/ChatGpt4o-mini)
I've been working on building a chatbot with a plugin against a Kanban Board (aka a glorified Todo-list). The Board has ~500 Tasks on it and 8 Columns (Todo, In Progress, Done, etc.). Each task can have a title, a due date, user assignments, and Tags.
The goal is to build a Plugin, that can report stats back (like how many tasks are overdue) and execute commands on the board (make new tasks, add/remove labels, move to different columns).. + and combine this via auto-function calling.
An advanced example would be to be able to ask the AI "Please tag all tasks that are more than 3 days overdue with the tag 'urgent' and move them to the top of their columns"
Issue 1: Accuracy
In the above example there are multiple "skills" the AI needs to master... It needs to, among a set of tasks, find the ones with specific criteria and issue a set of commands with the right parameters (Add Tag + Change Task position/priority).
None of this is, code-wise, difficult to do (just a plugin to get all tasks and a set of methods to do the commands), and getting an LLM to do the individual steps is doable, but with mixed results...
Example 1: Most of the time the LLM can count, but at times you can feed it a list of 25 tasks and It will report back that there are example 30 tasks (tried to indicate via prompting that it should double-check all counting)
Example 2: When updating a Task one needs to provide the string-based ID of the task + the updated data... Again works most of the time, but sometimes things break down and the LLM provides the name of the task instead of the ID for the taskId parameter.
I've tried both giving huge amounts of metadata/hints to instruct and also just the barebone (to avoid it getting confused), but all with mixed results.
Issue 2: The Cost/token usage
Cost is of cause always relative, but in my case, I have the Model set to max 20K Tokens/min, and I hit the limit constantly during single-user testing because every time I need to ask the LLM it needs the entire list of tasks (so it can check if the task example is overdue) and it easily spend all the tokens (especially if you do not clear the chat-history). I've tried making plugin pre-filters-parameters that the LLM can use to limit the result set, but more often than not the user's questions do not context to leverage the pre-filters or again get it wrong 😔.
Am I doing it wrong?
So summarized; I have a system that works at times but "breaks down and does something wrong" enough times for the users to lose faith and interest + we lose a lot of money doing it due to the high token count (that I could live with if the accuracy (aka user-value) was there).
... Or is it me that is just not good and Prompt-engineering and tweaking the models, or just have too high expectations on what is possible in the current state of affairs ... or are we all just stuck to doing cool demos and party-tricks while we prepare/learn for a future where these LLMs can bring value in precision-work-scenarios like mine/others?
Beta Was this translation helpful? Give feedback.
All reactions