MASSIVE Step Allowing AI Agents To Control Computers (MacOS, Windows, Linux)

Matthew Berman

Просмотров 85 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 29 июн 2024
OS World gives agents the ability to fully control computers, including MacOS, Windows, and Linux. By giving agents a language to describe actions in a computer environment, OS World can benchmark agent performance like never before.
Try Deepchecks LLM Evaluation For Free: bit.ly/3SVtxLJ
Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Наука

Комментарии • 284

@donrosenthal5864 2 месяца назад ⁺⁵⁶
OSWorld project video? Yes, please!!!
@reidelliot1972 2 месяца назад ⁺⁴
Yes, tutorial please! Please elaborate more on the relationship to CrewAI-like frameworks and potential implications for the rumored YAML endpoints!
@user-wz3qe3vw6h Месяц назад ⁺³
@@reidelliot1972 Yes Matthew, pls!
@KCM25NJL 2 месяца назад ⁺⁵⁵
It's great an all, but I kinda think one of two things will end up happening:
1. An AI layer will become a standard for interoperability as part of the OSI and App Dev Stacks
2. A whole new OS will be developed that serves this very purpose.
I suspect we may start with 1 and end up with 2 in the longer term.
@theterminaldave 2 месяца назад ⁺⁵
When i was helping to write test steps for an automated software testing app, I was required to basically open up the developer tools and get the name of the object that was needed to be interacted with, the HTML code/name for a particular button, or a certain drop down textbox.
I don't understand the whole "lay a grid over the screen and guess the coordinates." That's just the user interface, the computer utilizes all the code in the background I don't get why the AI isn't navigating by looking at the underlying code for the page instead of the graphical output of the page?
@DaveEtchells 2 месяца назад
@@theterminaldave Interesting point. I'd say though that the point is to have the AI interact with the UI based on what a human would see. on a related note, there have been tools for doing software regression testing dating back many years that'd let you interact with UI elements, but it was a PITA to write the scripts for them and they were very fragile in that tiny changes could send them off the rails.
@Daniel-jm8we 2 месяца назад ⁺¹
@@theterminaldave Would the AI always have access to the code?
@ich3601 2 месяца назад
@@Daniel-jm8we Allmost. When using RPA-Tools you're scanning the HTML, the OS-Events, or the application events. Would be great if an AI would eat this stuff, because nowadays RPA-Tools are very sensitive to changes.
@theterminaldave 2 месяца назад
@@Daniel-jm8we Open any webpage, and press f12, and click on the inspector tab, that's the code I'm referring to.
It's basically the code for the graphical interface, so yes, the AI would always have "access" because if you don't have access it's because it's not appearing on the page.
After you open the inspector, click on any line and hit delete, it will disappear from the page. If you hit refresh it will come back.
@haroldpierre1726 2 месяца назад ⁺²⁹
It would be helpful to have a catalog of pre-built open source AI agents that can be easily downloaded and used for specific tasks. My brain shuts off trying to follow video tutorials on programming my own AI agent from scratch.
@lorenzoleongutierrez7927 2 месяца назад
Yes !
@jamesheller2707 2 месяца назад ⁺⁶⁷
Please make more videos testing and running this yourself🙏🏼, I'll be great
@reidelliot1972 2 месяца назад ⁺¹
Yes, tutorial please! Please elaborate more on the relationship to CrewAI-like frameworks and potential implications for the rumored YAML endpoints!
@Carnivore69 2 месяца назад ⁺⁶⁶
User: What happens between the steps in these Ikea instructions?
Agent: A fuckton of swearing!
User: Test passed.
@BangaloreYoutube 2 месяца назад ⁺¹
Legit laughed for 10 mins 😂
@ScottzPlaylists 2 месяца назад ⁺¹⁵
Yes Please 👍 Need lots of OSWorld Videos ❗❗❗
We need a video tutorial watching AI, that creates a training set item for OSWorld on how to do X, by watching a video on how to do X (and fills in missing details not shown). 🤯🤯🤯🤯❗❗❗❗
@AGIBreakout 2 месяца назад ⁺⁸
Great Idea!!!!
@CryptoMetalMoney 2 месяца назад ⁺⁷
YT Tutorials videos would be a huge ready to go dataset... Great Idea
@CryptoMetalMoney 2 месяца назад ⁺⁵
Continuous learning will be huge in the future, and using computers will be a big part of that.
@NWONewsGod 2 месяца назад ⁺⁵
YT is a treasure trove for more Advanced forms of AI Training and even Training Now.
@BlankBrain 2 месяца назад ⁺⁵
The most difficult part of making something like OSWorld is security. When you open your OS to computer manipulation, it's a lot easier for computers to manipulate it.
@threepe0 2 месяца назад ⁺⁴
Really look forward to your videos. You’ve helped me get the gist of developments as they come out and determine which technologies are useful and worth spending my time on, and which ones I am equipped to handle, for my personal use-cases.
I have and will continue to recommend your channel to friends and co-workers.
Seriously man when I see your name, I click. Thank you for continuing to do what you do.
@alanhoeffler9629 2 месяца назад ⁺²
This was good video showing what had to be done to make LLM’s agentic using computer OS’s. It showed me two things. The first was why self autonomous cars are so hard to set up. The auto system has to not only know what the “rules of the road” are, what the automobile’s driving characteristics are, and how to make the car do what it needs to do, but it has to be able to correctly parse at high speed what a situation that it has never encountered before is, what is the correct action to take is and pull off executing it in real time. The second is that a system that can do that well is way closer to AGI than any LLM.
@jimbo2112 2 месяца назад ⁺²
Yes please! Tutorial on this would be great. I see agents as being a driving force behind vast amounts of commercial AI adoption. Companies want greater efficiency and agents are the tools to bring this.
@justjosh1400 2 месяца назад ⁺¹
Can't wait for the tutorial. Wanted to say thanks for the videos Matthew.
@reidelliot1972 2 месяца назад ⁺¹
Yes, tutorial please! Please elaborate more on the relationship to CrewAI-like frameworks and potential implications for the rumored YAML endpoints!
@AhmedMagdy-ly3ng Месяц назад ⁺¹
I will be more than happy to see you testing it in real world examples, not complex task but just everyday tasks, like summarize a bunch of pdfs or make a research, and things like that.
And also i need to say that a really appreciate your work❤
@pvanukoff 2 месяца назад ⁺⁵¹
Not long before we have star-trek style computers, where we just say "computer ... do x, y and z for me".
@theterminaldave 2 месяца назад ⁺²
That's the goal. Agentic AI.
@ericspecullaas2841 2 месяца назад ⁺²
You can do that now. Although food replicator and hollowdecks are still far off
@shooteru 2 месяца назад ⁺⁶
Working on it, many of us
@JBulsa 2 месяца назад
2 - 9 years
@tomaszzielinski4521 2 месяца назад
Who do you mean by "we"?
@ACTION-PLAY-SAFARI 2 месяца назад ⁺²
always awesome and informative videos Matt, love it brother. And feel like that much smarter after watching them. Keep up the awesome work!
@iwatchyoutube9610 2 месяца назад ⁺⁵
I was waiting for your own test the whole video. Git'r done son!
@marshallodom1388 2 месяца назад ⁺⁷
Computer! Computer?
[Handed a mouse, he speaks into it]
Hello, computer.
The Dr. says just use the keyboard.
Keyboard. How quaint.
@PhoebusG 2 месяца назад ⁺¹
Yes, def set it up that would be a good video. Keep up the cool videos :)
@BThunder30 2 месяца назад ⁺¹
This is amazing. I think you need a team to help you set it up fast. We want to see a demo!
@rupertllavore1731 2 месяца назад
NICE is see you getting Brand deals! May your channel Keep getting more brand deals!
@darwinboor1300 2 месяца назад
Thanks Matt.
The change-the-background task is like an Optimus realworld task. Using the mouse requires a collection of basic motion skills (eg move in XY, click right/left, scroll up/down, etc). Moving and activating the mouse on a screen are simple subtasks necessary to build actual realworld tasks (on the PC these basic skills and subtasks and more can be accomplished using AutoHotkey). The reactive sequence of mouse subtasks (including motions) is the equivalent of FSD navigation from location A to B in the realworld or Optimus stepping through a set of realworld subtasks to complete a realworld task. The advantage for a change-the-background task AI is the paucity of edge cases that make realworld tasks so difficult for Optimus and for FSD. All three AI systems need to evaluate the realworld changes they evoke before executing the next subtask. Optimus and FSD repeatedly face infinite realworld variations between subtasks. These variations are introduced by independent external agents (cars, animals, fallen trees, etc.) The change-the-background task AI will mostly face changes due to software upgrades and due to different starting states. Most computer issues can be resolved by deeper searches on the web. AutoHotkey can programatically solve simple issues (hiding open windows). Having an AI to navigate the process would fundamentally change the ability to execute complex computer tasks based upon simple sequences of verbal commands.
Here is an example: Convert the most recent Matt Berman RUclips to mp4 and then extract unique screenshots to a Powerpoint file and the youtube transcript without timestamps to a text file. The filename for each file is MB1.
@DefaultFlame Месяц назад ⁺¹
Nice! I'd love to see you test it out.
@nqnam12345 2 месяца назад ⁺¹
Great ! Pls more on this topics
@wardehaj 2 месяца назад
Great explanation video. Thanks a lot!
@arinco3817 2 месяца назад
This is really interesting. I've been thinking for ages about how to go from vllm to action. It's a bit like us sitting in front of your computer and describing what you want to happen.
@tigs9573 2 месяца назад
Yes I would like to learn more about OSworld , keep up with the great content
@EduardoJGaido 2 месяца назад
Great video!
@LauraMedinaGuzman 2 месяца назад
Amazing! I want to try it for Revit, a software for architecture. Actually I did try something that worked! However I truly need more knowledge, so your help is very very aprecciated! Thanks!
@ThinkAI1st 2 месяца назад
You are a very good teacher…so keep teaching.
@scottwatschke4192 2 месяца назад
Very interesting. I would love a testing video.
@dilfill 2 месяца назад
Would love to see you test this out doing a few different tasks! Also curious if this could run someone's social media etc.
@2106chrissas 2 месяца назад
Great Project,
it would be interesting to have a video on RAG and programs available for the RAG (example H2OGPT)
@yugowatari2935 2 месяца назад
Yes.. please do a tutorial in osworld. Have been waiting for this for some time.
@CharlesFinneyAdventure 2 месяца назад
I would love to watch you setting up OS world on your own
machine testing it out and using it to create
a tutorial from it of it.
@AGI-Bingo 2 месяца назад ⁺¹
A new golden age of open source is upon us ❤
@marcfruchtman9473 2 месяца назад
Thanks for the video! Yes, this seems like it will be very useful.
@adtiamzon3663 Месяц назад
Good start. Excellent. 🤫 🌞👏👏
@moses5407 2 месяца назад
Great presentation! Too bad the accuracy levels are currently so low but this seems to be a framework that can self-grade and, hopefully, self-adjust for improvement.
@yenielmercado5468 2 месяца назад
Excited for the Humane Ai pin Agents feature coming .
@nangld 2 месяца назад ⁺⁸
20% success rate is super impressive a start. As soon as they iterate on that and train a proper model, it will reach 99%, leading to all office workers getting fired.
@andrada25m46 2 месяца назад
Yeah prolly not.
I use AI at work, I’m one of the few who do. A lot of data is confidential and extra security measures are needed, sth like this breaches contractual agreements since the AI provider would have access to the data.
Not to mention proprietary apps running in containers which the AI wouldn’t be able to navigate..
@marcussturup1314 2 месяца назад ⁺⁵
@@andrada25m46 Local LLM's could fix the data access issue.
@WolfeByteLabs 2 месяца назад ⁺¹
This.
@stefano94103 2 месяца назад
@@andrada25m46 All of the big player MicroSoft, IBM, Google all have enterprise software that is data privacy compliant. The price varies with the solution. The only problem with the enterprise LLMs are they do not move at the speed of other models for obvious reasons. But open source or enterprise is the way to go if your company has compliance requirements.
@greenleaf44 2 месяца назад ⁺¹
@@marcussturup1314 I feel like people underestimate how possible it is for large businesses to run their own inference
@alpineparrot1057 2 месяца назад
I enjoy your content Matt. You put me on to LM Studio, then Ollama, then Crewai. CrewAI has excellent case use, so thank you so much. Could you please do some more stuff with CrewAI (I have mine setup in the one file approach, but am not too sure how to set it up with multiple files and calling to and from (I'm not to familiar with Python, chat gpt is excellent help, but it still only goes so far)..
@gotemlearning Месяц назад
great vid!
@monnef 2 месяца назад
Very nice project. I would find interesting to see success rates in different OSes (or in case of Linux even DE/WM). Also GUI vs CLI - I can imagine on some tasks CLI would be a king, while in others it could fail miserably. Still, it could be useful to see for which use cases different OSes or GUI/CLI are better and might be worth of trying to utilize an AI for them.
@OSWALD569 2 месяца назад
For performing actions on desktops there is a macro recorder available and suitable.
@gatesv1326 2 месяца назад
Very similar to RPA (Robotic Process Automation) that I’ve been developing for 10 years now. Nothing new, but being able to do this with a typed or vocal prompt is what’s going to be interesting when it does get as good as a human can do (which is what RPA has been successful at doing for a long time), also understanding that RPA licences are expensive.
@joe_limon 2 месяца назад ⁺¹¹
How close until I have a locally run agentic system that can install all future improved agentic systems and/or github projects autonomously?
@fullcrum2089 2 месяца назад ⁺²
With this, a person's ideas, dreams and personalities can become immortal.
@nickdisney3D 2 месяца назад
Id share my repo but i think youtube comments deletes it automatically.
@electiangelus 2 месяца назад
Already there. Im actually passed this.
@fullcrum2089 2 месяца назад
@@nickdisney3D yes, i can't see it, just share the path repo/name.
@electiangelus 2 месяца назад
@@fullcrum2089 Apotheosis was thinking that 6 months ago.
@ThomasEWalker 2 месяца назад
Cool - This is moving SO fast! I think we will get AIs with the ability to recognize what is on the screen more directly, much like a self-driving car sees the world. This would become 'go click the button that does X', without screenshots. I bet that happens this year. Real world agents with AGI for a Christmas present!
@christopheboucher127 2 месяца назад
Of course we want to see more about that ;) thx 4 all
@Treewun2 2 месяца назад
Please do a series on Fine Tuning open source models!
@settlece 2 месяца назад
i would definitely like to see more OSWorld
thank for bringing this exciting news to us
@dreamphoenix 2 месяца назад
Awesome Thank you.
@beckettrj 2 месяца назад
OSworld project videos please! This could be a series of videos?
I could see this helping me do my job five times faster ! Helpdesk support tool to check and update XYZ application user account then email user letting them know we have updated their account and that they should be able to login. Complicated processes, such as opening VPN connection and checking active directory account settings, and then logging into administrative program(s) to search and open users account to check their settings. The user account Settings in active directory must match the user login settings in the application(s). Email the findings and let them know what was altered or changed, etc..
@BelaKomoroczy 2 месяца назад
Yes, test it out, go deeper, it is a very interesting project!
@galaxymariosuper 2 месяца назад
16:40 think of temperature as of maneuverability. the higher it is the more flexible the system, which is basically a closed loop control system at this point.
@buggi666 Месяц назад
Soooo we basically arrived at Reinforcement Learning using LLMs? Thats sounds so awesome!
@ma77yg 2 месяца назад
would interesting to have a tutorial on this setup
@systemlord001 2 месяца назад
I think temp is set to 1 because if it fails and does another attempt it will have different approaches. When temp is set to lower values it might not get to a working solution because the tried method’s are not divergent enough to contain a valid solution.
But i think having an llm fine tuned on datasets generated by humans in the format of OSWorld (the tree, screenshots ect…). Could improve the succes rate.
If I am not mistaken this is what Rabbit R1 was doing. It’s basically teach mode but with more examples then just the one you give it.
@francoislanctot2423 2 месяца назад
Thanks Yes please install it and show us the procedure. I think it is going to be useful for a lot of people.
@ayreonate 2 месяца назад
I think they set the temp @ 1.0 to test how hard it will hallucinate if given more creative freedom, then added it to the presentation just to show off
@Maisonier 2 месяца назад
This is great! I'm going to wait for a Linux distro that has these agents built-in to automatically configure Wi-Fi, printers, drivers, or even VMs with Windows (for specific programs that don't work in Wine).
@ScottSummerill 2 месяца назад
Actually your video, specifically the table, convinced me that agents at least in this interaction are not all that spectacular. They will likely get there but right now it’s a lot of hype.
@youjirogaming1m4daysago 2 месяца назад
Taking a screenshot and guessing is an impractical implementation, for desktop agents to truely work we would totally have to create new apis that directly alters the desktop state and best operating system to do this is Linux right now, but if max and Windows also provide them I think then it is possible for agents to make a significant impact
@ThomasTomiczek Месяц назад
I think a lot of the current problems are training - if GPT-5 is trained on videos from youtube and that includes a lot of videos of people USING THE COMPUTER - the AI may be more prepared for this.
@jamalnuh8565 Месяц назад
Update us always like this, especially the new research papers
@DailyTuna 2 месяца назад
I think as this evolves it’s time for somebody to create a Linux system that would work directly with this, you need an operating system, catering directly to the agents
@MeinDeutschkurs 2 месяца назад ⁺¹
Temperature of 0.1 could lead to “I cannot click, I’m just an LLM.”
@DonDeCaire 2 месяца назад
This is why simulated data is so important, if you can replicate REAL world environments you can test an infinite amount, of environmental conditions and infinite amount of times.
@DamielBE 2 месяца назад
hopefully one day we'll get agents like the Muses in Eclipse Phase or the Alt-Me in Peter F Hamilton's Salvation trilogy
@Justin-1111 2 месяца назад
Let's see it!
@dafunkyzee Месяц назад ⁺¹
I strongly feel this is the completely wrong way of going about using agents. I respect the project is basically "Use what we got" We have windows we have MacOS... so now we want an agent to figure out how to use these interfaces to get things done.... but that idea is wrong because the OS is designed as a human interface with the machine... I'm working on an ai based os where the agents would directly work with the kernel to get shnizz done. Still hats off to the team to try this round of experimentation to see what the limits and capabilities of agents in their current form.
@waqaskhan-uw3pf 2 месяца назад ⁺¹
Please make video about Romo AI- super AI tools in one place and learnex AI - world's First fully AI powered education platform. My favorite AI tools
@mshonle 2 месяца назад
16:38 It depends on the specific formula used for the temperature setting, so a 1 here is by no means the maximum. The use of top-p implies there is nucleus sampling being used, which prevents the most improbable completions from even being considered. They are looking for a wider sampling to establish a baseline and setting the temperature too low would create more repetitive results (repeats across different runs and also repeating the same phrase in a single run until the context is full) and thus would be too easy dismiss as a strawman.
@timduck8506 2 месяца назад
Are we able to programme new action's? or create new connection? like what we can already do with macros?
@ayreonate 2 месяца назад
maybe the LLMs are vastly better in the daily and professional tasks because thats whats widely available online aka their training data. while workflow based tasks dont have that much resources. case in point, the example they used (viewing photos of receipts and logging them on a spreadsheet) that wont have the same amount of online resources as daily or professional tasks.
@davidhoracek6758 2 месяца назад
This only needs to work once and you basically build the universal installer. Soon you just tell a computer "make the latest stablediffusion (or whatever) work on my computer, including all the hardware-specific optimizations that apply to my specific system. Then it just needs to bootsrap in the newest interaction AI for my OS, have a little conversation with the system, try promising settings, and if they fail, come up with others, and (importantly) update the weights of the remote installer system based on the successes and errors of this particular interaction.
@luxaeterna00 Месяц назад
Any link to the presentation? Thanks!
@xxxxxx89xxxx30 2 месяца назад
Interesting take, but again, trying to go to general. I am curious if there is a team working on a real "AI OS". Not using screenshots and these half-solutions, but actually having predefined built in functions that control the device through code and track the progress in the same way to do the "grounding" step?
@camilordofficial 2 месяца назад
This video was great thanks. Could this work with IoT like devices?
@DaveEtchells 2 месяца назад
I guess this is interesting, but I don’t understand why I should be so excited about it over Open Interpreter.
The need to have predefined accessibility for the apps seems very limiting and a purely transitional step.
In the relatively near term, AI agents will just interact directly with UI elements, figuring out what they need to do based on what they see on the screen. In the case of mainstream apps, they’ll know the general operation from their training, so will have little to deduce in specific instances, just as you can ask ChatGPT how to do things in Excel, etc.
Longer term there may be direct hooks for AIs built in, but I don’t know to what extent that’ll make sense, as inference costs plummet.
@minissoft 2 месяца назад
Please do a test. Thanks!
@tanuj.mp4 2 месяца назад ⁺¹
Please create an OSWorld Tutorial
@ktolis 2 месяца назад
will be interesting to see ReALM geting benchmarked
@andreluistomaz3930 2 месяца назад
Ty!
@interchainme 2 месяца назад
Feels like a talk form the future o_O
@AetherTunes 2 месяца назад
ive always wondered if you could incorporate vision for llm into something like shadowplay
@roharbaconmoo 2 месяца назад
Does anything change for your video with their addition of memory sharing?
@Copa20777 2 месяца назад ⁺¹
Thank you for your journalism Matthew.. we ❤ you bro from Africa, Blessed sunday everyone.. sipping my coffee on this one
@kevinehsani3358 2 месяца назад
can a multimodal model scroll up or down on a screen and see more than just wha tis displayed? Can they actually read the text on cmd terminal and then act on it instead of us copy and paste it the reply into an input cotext?
@oratilemoagi9764 2 месяца назад ⁺¹
Did you see Apple's New Open Source LLM
@user-lb5cp5mw4u 2 месяца назад
Often restricting model to output code only reduces the accuracy, especially on complex tasks. It's worth trying to allow it to print chain of thought (even better if there is a self-critical inner dialogue loop) and then output the final code piece.
@japneetsingh5015 2 месяца назад
I am already waiting for a linux where i could enter commands in natural language and the llm model gemerates a set of possible true commans and i just have to choose one or make a minor change
@cmelgarejo 2 месяца назад
MASSIVE agents, noice
@canadiannomad2330 2 месяца назад
In Linux there is the xserver.. I've been thinking it would be neat to plug a system into the xserver backend, and have an llm communicate with that directly... Somewhat bypasses most visual interpretation, except what is actually rendered as graphic
@mikezooper 2 месяца назад
Copilot on Windows already allows control of the OS. For example, you can ask it to switch to night mode and it will.
@slomnim 2 месяца назад
That's pretty simple compared to where this project is going. Maybe soon yeah Microsoft will have copilot do some of this stuff but so far this seems like the first real attempt
@WilsonCely 2 месяца назад
Please do it! Tutorial for OS world
@paketisa4330 2 месяца назад
Considering a project where a person documents daily experiences, thoughts, feelings and personal history in a diary specifically for a future AGI’s learning. Do you think such a personalised dataset could enhance an AGI’s ability to understand and interact with individuals on a deeper level? And lastly, is it feasible to expect an AGI to become a close, personal companion based on this method, or would it somehow be redundant useless data? Thank you for the answer.
@sergedeh 2 месяца назад
The next level is an AI as the gateway of all the OS.
I am working on it with AndyAi.
Using the mouse to get the AI to control the system is really the hardest way to do it...
@Daniel-jm8we 2 месяца назад
It's more advanced than the starship Enterprise. They have to use people to push buttons.
@SimenStaabyKnudsen 2 месяца назад
Yes! Make a tutorial of it! :D
@johnkintree763 2 месяца назад
I want the digital agent in my phone to download my monthly invoice from the electric utility, merge that and other data I want recorded publicly into a decentralized graph representation that is maintained in collaboration with digital agents running in other personal devices to create a shared world model for planning collective action.

Следующие

Автовоспроизведение

New INSANE AI Chip, GPT4o Voice Update, Claude 3.5 Dominates, SpaceX Double Landing, AI Video Games