GPTs and Assistants for Semantic Kernel #3393
Replies: 16 comments 20 replies
-
Will the AssistantKernel be a feature released with SK v1.0 ? |
Beta Was this translation helpful? Give feedback.
-
It would be cool, and fairly simple, to show how you can create your own flexible backend with semantic kernel and make it deployable as a GPT (only having endpoints for the input and output of the kernel). Having seamless integrations of all the new API functionalities (especially JSON mode) would be nice too. Are these coming soon to Azure? |
Beta Was this translation helpful? Give feedback.
-
I'd love to see how the "responsibility division" is set between GPTs and Plugins, like for example I train a GPTs to "properly format and validate JSON" but I have a Plugin with formatting abilities... which ones will be used in the planner? How do I tell that I want this to be done by the GPTs (no need for tokens and execution/validations - it will be done in a single step). |
Beta Was this translation helpful? Give feedback.
-
Would love to see a GPTs Store ready deployable GPTs chat sample (without RAG) that demonstrates SK:
I'm seeing a lot of use cases that fit this pattern very well. This sample has some of the same concepts found in the 04-DynamicRag v1 sample, but adds image generation and simplifies things a bit by eliminating the coded Plugin (Math.cs). |
Beta Was this translation helpful? Give feedback.
-
Id love to see an updated version of Chat Copilot with these features. |
Beta Was this translation helpful? Give feedback.
-
How to allow users to use natural language to ask for business metrics. I want to be able to provide schema about tables and convert requests into sql that can be used to query data tables. Can it do joins and group bys? |
Beta Was this translation helpful? Give feedback.
-
Share your GPTs on https://gptspedia.io and apply to be GPT of the week! ❤️ |
Beta Was this translation helpful? Give feedback.
-
Gents, may I propose a 180° u-turn approach. Currently everybody puts Agents into the centre. Honestly, the User (and their proxies) do not and actually shouldn't care about which Agent/subfunction to choose. So why not not putting the task/query into the centre by turning the "thread" into a "queue" where the original task and its sub-completions are taken by listening Agents and its sub-functions to take over what they think can best achieve/contribute to the results of the main or at least a sub-problem until one Agent comes to the final conclusion or assembles the final answer(s) from the sub-results? PS: I am by nature python centric (and I presume the majority within the community) ... please put that first as python has the most leverage/scaling factor in the community. |
Beta Was this translation helpful? Give feedback.
-
Personally, when I was getting started I found this course particularly helpful: https://www.deeplearning.ai/short-courses/microsoft-semantic-kernel/ |
Beta Was this translation helpful? Give feedback.
-
I would like to find out if I can use semantic kernel with all of its powers (now including assistants) to come up with a DDD (Domain Driven Design) modeling tool using NLP processing capabilities of GPT chat to extract entities, properties and its relationship (ERD cardinalities: NavigationProperties represents 1-n and NavigationConnections represents n-n). You can see how one such entity can be serialized to .json structure: https://github.com/IsmailEsadKilic/Vehman2/blob/main/aspnet-core/.suite/entities/Vehicle.json And also I would like to visualize these ERD relationships using a visualization javascript library like GoJS while I am discovering these entities and relationships among them. See this page for visualization: https://gojs.net/latest/samples/entityRelationship.html I would like to say something like this: Category and Product have NavigationConnection relationships. And somehow the engine should conclude (based on some predefined configuration like standard string length: 100) I would like to see how to connect an external dynamic visualization to the current in-memory structure I am developing. I found that in python there is a way to use Pydantic library, Instructor and Marvin to force GPT engine to provide structured json response. https://medium.com/@jxnlco/bridging-language-model-with-python-with-instructor-pydantic-and-openais-function-calling-f32fb1cdb401 https://www.askmarvin.ai |
Beta Was this translation helpful? Give feedback.
-
Hey Matthew, very nice project! Will the assistant package be available for python? |
Beta Was this translation helpful? Give feedback.
-
Kudos to the Semantic Kernel team! Keep up the great work. I like that the kernel was made service agnostic in v1. Does this apply to the AssistantKernel, too? Is it designed with the same service-agnostic mindset, or does it serve as a specialized layer tailor-made for the OpenAI assistant? |
Beta Was this translation helpful? Give feedback.
-
I was wondering if the Assistants implementation currently has or will in the future have support for uploading files to assistants and threads (as opposed to templated assistants as discussed in a previous blog post). I could be wrong, but one reason why this would be useful is that when using file's instead of stuffing in the instructions, OpenAI will return annotations which mention which file it pulled its answer from VIA RAG. One of the reasons I wanted to try assistants is to lean on OpenAI's RAG implementation, but citations/annotations are important for grounding and user fact checking. Also, file upload could load non-text filetypes such as PDF's without need for the programmer to pre-parse into text. |
Beta Was this translation helpful? Give feedback.
-
Regarding accommodation for generalized Start menus, completion, desktop OS, file manager, apps, LLM preprocessing for synonym building with offline domain-specific NLP command query, with responsive hot words and fast, context-aware, mission-critical effectiveness with some voice cues and visual discovery and 2-way COMs, and multiple hotwords: I'm not sure if this is the right place for this topic, and I haven't had time to thoroughly evaluate its semantic kernel, but I want to put it here because you guys are doing something you should know about, one of these use cases that I think is probably very important, can revolutionize the way we command, compete, query our pc apps and OS's. at least for destop. I asked GPT to clean it up because my hand is half-paralyzed now from computer overuse, but it takes out too much of the points I'm trying to make after working on bloated UI for Autodesk and custom game level design IDEs and visualization tools for audio scientists and physicists for 20 years. I've already gone through the voice access community feedback, but really I would rather just explain it to a senior product designer and hope that there's consensus on this whole thing. Telling you because I would think that semantic kernel might provide a lot of the necessary infrastructure to implement it, or something similar. I think they understand at least some of what I've been trying to push for. I really still be consensus on this and it's not that hard. So sorry if this is TLDR or off-topic, but I'm starting with just what I think might have to go into the semantic kernel. The rest is just details of the use case that I'm envisioning. You don't have to read all that, it's something that's or if it just pushes, puts too much strain on what you've already done. -Domain reduction for voice recognition that's low latency offline as in voice assist. This is like taking the really wonderful real-time offline voice access feature that's in Windows 11 Insider and making it context-aware, so that you don't have to say everything exactly and so that you don't have to switch between dictate and command modes because it should know. Also, the ability for hot words to be quickly applied to these completions when they're only partially spoken. So I've spoken to the product design and they put teardrops but for the most part, they put the voice access over the UI. Or keys if they click when you're not clicking anything, you just want to be activating a function in the active application domain. -Preprocessing by LLMs to make dictionaries of synonyms after it has scraped and go over the object models of the file explorer and all the application UI, settings, value enums, and types, test APIs that are exposed, that can be accessed from out of process or by plugin. You usually use by automating test systems or automation. -Hooks to allow for the system to have a natural voice response or an audio response and or visual response and a user dossier are sort of something that it's learning about the user's current preferences. This is something that an assistant should be doing if you say "never speak to me out loud or call me this name" or "Don't speak until spoken to until further notice," don't warn me unless I'm doing something that might cause data loss or something that can't be undone, or it should just go ahead and do it. So one example of what I tried is I talked to the LLM chatGPT4 with Dalle, and OCR, and it said the best it could do is take what the voice access wrong context unaware guesses that I wanted and then take a screenshot of the menu and then tell me what I really wanted. And that worked, but that's slow. So I would love it if there is a way to hook in something that can limit the guesses of voice to text while expanding synonyms. That was pretty much more geared towards whisper, which isn't really a real-time completion or offline DSPs. So I'm not sure which where this should go, but I'm going to run it by you. -So ideally, precise mission-critical app command and also -QUERY- if I want to say "is HDR on?" I don't even need to go ask Bing where the HDR switches are that I have to click. I don't have hands! It will just tell me yes or no. And if I say "never speak to me again out loud until further notice," it will show me on my projector the setting and bring me there. And then I can say, turn HDR off for example. Another great example is what's my WiFi password we know in windows that you have to take through really old settings that are hard to find and are duplicated and it could just say "let me verify who you are." Get closer to the camera and then tell you. And when you implement a plug-in, you're actually less safe if you damage the process.. So what I'm proposing is a really generalized assistant that somebody in Microsoft could make, even if it's closed sorce using all the things that you've got. because it's going to as soon as it opens that file with a nice start to say something. It's going to begin to do completion and how it does it now. that if you say instead of a you say alphaGeta Charlie Delta foxtrot and they did start to integrate this NATO bit into the product. Maybe there's several reasons. One disabled people often have served in the Armed Services., and its standard on radio , maritime and internation radio 2 way coms. less likely to be misinterpreted so can be used for actions wiht consequences. If you want to change the name of an app. It should let me, but I shouldn't have to set all these things up. I should just be telling it. From now on you whenever I say dev studio whenever I say dev studio, I want you to use the 2022 preview version. So after that initial LLM dictionary building, you'll be able to my voice make adjustments to it. I'm probably visualize it. But I can't see end users type into JSON files or coding any of this stuff. And again, these assistants. They don't need to be that many of them. It can't be perfect. There's going to be several, but I could picture a team of like 5 people at Microsoft foot nailing together the best UI. That's right out of Star Trek. With everything that Microsoft has already spent billions on this would only be a few months of work for maybe five or six very people, 1-2 data scientists and some DSP experts, someone with pretty deep knowledge of windows and automation and testing and interfacing and multiple languages and such, silk, Clang and semantic Kernel. Well, if I say Windows, Start launch dev studio. It shows me a letter and I say alpha and it launches that then I'm at a menu context. So if I'd say it's only I'll open close exit or whatever that will be fed into the small list to guess from. I guess semantic kernel so that something which isn't an LLM but this is another system can use that again. It's DSP or its voice to text and use that to greatly limit the amount of the space of guessing that it must do. So is that actually could be a lot using a lot fewer cycles and more sensitive to being interrupted by multiple hot word listeners so you can say "no scratch that." If you're saying something that it's being misunderstood as soon as you say something that's relevant, it'll just put up something and then you hit that hot word and then it does something so right then. And this could be greatly generalized and it doesn't have to be a plugin or in the process. It could be a plugin to semantic kernel, but it could be something that runs separately and only done once. It's general. It works for the OS, and it works for the apps. And it does kind of set a model for every app as it sort of shows that as soon as there's a. That the system has scraped. Did it somehow inserts a microphone. And a one-line command. a short command text. field to existing apps or encourages the app developer to place that in there. But I think it could do it to legacy at a little mic or. it could have that little mic in one place and have it float near the app that you're on right near where the old ribbons file used to be. doesn't have to be done, especially for every app. And in fact, it could be done. for OSX. and Linux. It could pretty much get universal UI for everything. One of the reasons why they had so many plane crashes in World War 2 is that every airplane lever had may and opposite effect, flaps up, down level down, or such, resulting in user error and unrecoverable crash. And with this, you could sit at a friend's Mac and not worry about accidentally with a misplaced drag and drop that I thought would merge, would instead it did a replace. So the way they've implemented this feature is sad because it's commanding the mouse or click this or that button, we want to talk directly to those applications and the UI is bound to that. The other thing I thought that could be combined into this feature is the great indexing of code feature search that's in developer studio 2000 and two. Combines the discoverability of user interfaces that are visual with the direct access of the command line without having to remember all those switches and such. Voice-driven hands-free 100% mission critical 99.9% effective context-aware large language model plus domain-specific language scraping. UI Text spying, UI indexing, OCR on some UI to command, this isn't something for disabled people, it's for everybody. And I'm seeing, I've been putting in some feedback, but I thought I would run it by people more technical, so that they can give more thought to some of the possible complexities in it, and provide any hooks or necessary APIs in this kernel if it's relevant. These aren't separate plugins at all. These would be operating as out-of-process test drivers. I would think it might be more safe than actually having a lot of plugins that go in the process. The file open, the view, all the windows, the commands, anything that you can do via automation, automated testing, if it has to go through the UI, that's unfortunate, but it isn't a big problem. If there not COM API, C, C Lang, parameter types and with other scraping methods that we see in silk. It's also combining the command and code search that's in Visual Studio 2022 which unfortunately has been split into two separate search buttons And that's when I stopped using it. It only takes 1 extra hunt an peck orclick to make a feature not worth using. Again, it's overly complex and it's something that should be able to figure out. Or if if it's both in the code and in the features, it just should bring them both up and put markers on them and I get to choose. Or if I don't have any text open, obviously it's going to be a code, It's going to be a feature. If it matches some of the features, it's a lot more limited than the domain of all the possible codes they might be seeing. LLM preprocessing can happen periodically when you first set up the system. This would go over the existing ui. And most people would probably stop using that if it's successful. If it's not, they'll, they'll go back to the way they were. There's no harm in it. The UI. mess, people don't have to care because they're going to access it by saying wahr they see oee want. the initial exposure to the UI might be by. by browsing through it or. saying what you thought you saw before it presented to you and then say a hot word which is placed near it as it's done in voice access with those little teardrops that say 123 or A,B,C,D. So you could say alpha bravo. I think they did add this, but they added it mostly for dictation And you have to say, select this and then start. But I'm not sure sure that they understood what I meant. Sitting with users , They see a feature and say it. . and if you say send it because you're because you're going to hear a little beep. Beep means it's going to upload data and then you have a sec to say, No, don't. Don't. And if there's something not safe for work, it's going to warn you. Or if it's something that you haven't don't usually do, it might say it.. I'm not so sure that they've added it to the mission critical command execution that I was hoping for. Once the synonym and completion dictionaries are set up. beforehand, synonym dictionary, construction. as a preprocessing.step by an LLM. and then at runtime the existing.,amazing low latency offline voice access feature that appeared in windows Eleven insider builds., will be have a lot less work to do because instead of a lot of different possibilities, it's probably going to have five states on the file menu. It's going to have open, close, exit, and it's going to start showing those indicators as soon as you say anything that sounds like what it already has. So this makes it that low latency even more low latency and the ability for it to make a mistake or to not do anything is almost nil. And you definitely can step all over yourself. You can just keep talking until you see what you want showing the screen right now. You have to stop talking and wait until it like resets or say scratch that or escape or whatever. It's terrible. It's a big mess. It's something that you rage quit. But really, if it's just partially done, it would be a huge improvement. . Right now I am using the microphone. I'm using an SM58 voice microphone, which is a cardioid dynamic mic that tends to soften on only pick up the voice and it works really well at blocking out everything else. And if you I. either ship a really good mic and headset or tie mic, but something that isn't a condenser because those tend to pick up high frequencies and you apply digital filter. I don't think when you're already applying a convolution filter to do the text. basic DSP level. Stuff intuitively, I think that that wouldn't be a good idea. To have two filters on it. The use case, the UI is just a Windows logo with a microphone icon next to it, one line. Icon that has. that has some yellow circles that can glow to indicate that it's getting a signal. A cardiod dynamic mic It cuts out a lot of the high frequency filter on you don't have to put any digital voiced bandpass filter processing on it because you're going to be doing that For the voice to text. It would have to multiple hot words, I mean, this is to work 100% or not at all. I do not want to touch or even always look at my computer and I don't see any reason to If it's context aware and I say launch devastudio and then I go to file, there's five things on there. There's no way that it can guess wrong. What I'm trying to say. So I told chat GPT about this problem and it said, well, give me the guess from Microsoft voice access, and then give me a screen capture of the file menu and I'll tell you what you meant. And it did as a Post process, it was correct. I think it should be doing is everything with an llm as a preprocess. it is as though the The product designer. designer did the entire almost completely implemented it.,as if they didn't know that for every user interface element that there's a property in the object model or the DOM of every app that's accessible via com and Is accessible in English and usually as the user interface directly bound to it and it's typed, used by testing rigs or plugin, and that the is onften bound to it. |
Beta Was this translation helpful? Give feedback.
-
I am experimenting with the Experimental Agents package - and can't seem to figure out how to upload a file in a particular user message. In my use case my assistant should read and answer questions on an uploaded file in a user message (not seeded into OpenAI storage for Retrieval, preferably). How would I go about doing that? Even if I do upload the file into OpenAI storage for retrieval - how do I reference the file id in my user message? basically mimicking this: |
Beta Was this translation helpful? Give feedback.
-
The threads must be severely injured (Ref: 3:47 on the video ) |
Beta Was this translation helpful? Give feedback.
-
Share your thoughts about how you'd like to see GPTs integrated into Semantic Kernel.
Beta Was this translation helpful? Give feedback.
All reactions