Are we stuck with Party-tricks until LLMs get smarter and cheaper? #9455

rwjdk · 2024-10-29T12:30:01Z

rwjdk
Oct 29, 2024

Before I begin this "rant", know that this is not against the SK Team or the Product (both have been amazing to work with 🙌 )... Also, it might just be me not knowing enough as I've only been working with SK / AI for about a month now...

... But I want to ask people's opinion on the state of being able to build anything meaningful for the below scenario with the current state of the LLMs from two perspectives (Cost and Accuracy)

My Scenario (Using C# and ChatGpt4o/ChatGpt4o-mini)

I've been working on building a chatbot with a plugin against a Kanban Board (aka a glorified Todo-list). The Board has ~500 Tasks on it and 8 Columns (Todo, In Progress, Done, etc.). Each task can have a title, a due date, user assignments, and Tags.
The goal is to build a Plugin, that can report stats back (like how many tasks are overdue) and execute commands on the board (make new tasks, add/remove labels, move to different columns).. + and combine this via auto-function calling.

An advanced example would be to be able to ask the AI "Please tag all tasks that are more than 3 days overdue with the tag 'urgent' and move them to the top of their columns"

Issue 1: Accuracy

In the above example there are multiple "skills" the AI needs to master... It needs to, among a set of tasks, find the ones with specific criteria and issue a set of commands with the right parameters (Add Tag + Change Task position/priority).
None of this is, code-wise, difficult to do (just a plugin to get all tasks and a set of methods to do the commands), and getting an LLM to do the individual steps is doable, but with mixed results...
Example 1: Most of the time the LLM can count, but at times you can feed it a list of 25 tasks and It will report back that there are example 30 tasks (tried to indicate via prompting that it should double-check all counting)
Example 2: When updating a Task one needs to provide the string-based ID of the task + the updated data... Again works most of the time, but sometimes things break down and the LLM provides the name of the task instead of the ID for the taskId parameter.

I've tried both giving huge amounts of metadata/hints to instruct and also just the barebone (to avoid it getting confused), but all with mixed results.

Issue 2: The Cost/token usage

Cost is of cause always relative, but in my case, I have the Model set to max 20K Tokens/min, and I hit the limit constantly during single-user testing because every time I need to ask the LLM it needs the entire list of tasks (so it can check if the task example is overdue) and it easily spend all the tokens (especially if you do not clear the chat-history). I've tried making plugin pre-filters-parameters that the LLM can use to limit the result set, but more often than not the user's questions do not context to leverage the pre-filters or again get it wrong 😔.

Am I doing it wrong?

So summarized; I have a system that works at times but "breaks down and does something wrong" enough times for the users to lose faith and interest + we lose a lot of money doing it due to the high token count (that I could live with if the accuracy (aka user-value) was there).

... Or is it me that is just not good and Prompt-engineering and tweaking the models, or just have too high expectations on what is possible in the current state of affairs ... or are we all just stuck to doing cool demos and party-tricks while we prepare/learn for a future where these LLMs can bring value in precision-work-scenarios like mine/others?

evchaki · 2024-10-29T15:44:12Z

evchaki
Oct 29, 2024
Collaborator

@rwjdk - Using Code here would be ideal and use AI where it gives you the most value. 4o-mini is a good model to use here and has a very low cost per million tokens.

I would recommend using code to count and use APIs to summarize (if they support this) and to do the updates in bulk (if they support this).

Make sure you are using strongly typed plugins so the LLM is not guessing.

Hopefully this helps.

1 reply

rwjdk Oct 29, 2024
Author

Thanks for the Tips...

Regarding Count:
I do already have a count method but you can only cover so many scenarios with code (like the user might request "how many cards are more than 3 days overdue" or "how many are on a friday")

Bulk updates:
I tried doing bulk updates but AI kept messing up the inputs (example when input is string[] taskIds it somtimes sends a single entry with task ids seperated by a comma 🤷)

Summarization is not supported at the API

It is a strongly typed Plugin

The most fustration part is the lack of repeatability... Example today I had 4 out of 20 overdue tasks... I request those to be tagged and it does it perfectly... I then reset the scenario and restart the code, give exactly the same prompt, and end up with 3 of the 4 tagged missing the 4th for a reason one can't debug (since it from inside the LLM) 😕

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are we stuck with Party-tricks until LLMs get smarter and cheaper? #9455

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Are we stuck with Party-tricks until LLMs get smarter and cheaper? #9455

rwjdk Oct 29, 2024

My Scenario (Using C# and ChatGpt4o/ChatGpt4o-mini)

Issue 1: Accuracy

Issue 2: The Cost/token usage

Am I doing it wrong?

Replies: 1 comment · 1 reply

evchaki Oct 29, 2024 Collaborator

rwjdk Oct 29, 2024 Author

rwjdk
Oct 29, 2024

Replies: 1 comment 1 reply

evchaki
Oct 29, 2024
Collaborator

rwjdk Oct 29, 2024
Author