Thank you for the respectful discussion in the comments. To be honest, I sweated a little bit before uploading this video due to the general climate in the internet these days. Awesome community
@@sammathew535 Well, youtube still favors click- and some sort of rage-bait. Less controversial videos will follow. Look forward to see the responses to my RAG Tier list video :-D. Probably will be similar to the responses here.
I think for these open source models to be useful, they are meant to be fine tuned for specific tasks. I'm sure companies do this and even add RAG to their work flows
@@codingcrashcourses8533 I worked on (sort of) a sentiment analysis project, and I compared gpt-4o, gpt-4o-mini and phi-3 (before 3.5 was released). The output is basically structured (json), so it's not really a text generation task gpt-4o = 91% accuracy gpt-4o-mini = 89% accuracy phi-3 mini = 65% accuracy BEFORE finetuning phi-3 mini = 91% accuracy AFTER finetuning. I think we can get even better with a better training. Still not amazing (you can get better with more traditional NLP models), but it shows that finetuning can make open source SLM as good as closed source LLM.
The LHS, RHS part is quite understandable for those from countries who do Math in English, like here in India. And I imagine GPT4O does some localisation to suit your geography
For me in Sweden as well. Although I do know quite a lot of English and have a strong interest in math. Point being I might not be the best reference of a typical Swede. Being exposed to English way of speaking through youtube-videos with math.
I hope you realize that you're running a quantized version of the open source models? You only mentioned their 4 bit version, which is heavily quantized. You could using at least the 8 or 16 bit versions of those models and might come up with better results? You're comparing quantized models to a GPT model that is unquantized.
Using Azure OpenAI is much better in terms of quality of the generated content. On top of this, you get enterprise contract that your data will not be used for improving OAI models. The hardest part with OSS models is the cost for hosting them. I recall one of my colleagues ran an experiment to host an OSS LLM that resulted in a bill of 2k euros a month. Our monthly cost for running OAI models is under 200 euros a month. OSS LLMs/ SLMs can be used in settings where data should not leave the network. For example, a factory setting. Also, SLMs can be fined tuned to improve their performance if someone decides to through the OSS path. But for managing the models and quality, AOI is a clear winner.
I doubt. Do I understand you right that he hosted one single instance of an open source LLM? How much idle time? How much inference time? How many tokens per second? How much RAM/VRAM usage? 2k euros in germany means 10k watts 24/7.
@@saurabhjain507 I run quite big models right here under my desk. Even if I would run inference 24/7 it would cause only 100 euros per months on the electricity bill.
@@Anzeigenhatemeister-jo5be most large organizations will run cloud hosted models on the big 3 providers that require VM computing classes. These are often hefty requirements and shutting them down isn't too significant in terms of cost savings. I control most of my own projects right below the org admin level so usage is at my discretion... In moderation. Most organizations have policies blocking themselves from experimenting and/or cost optimization
I think frontier models are great for rapidly prototyping and iterating over an idea for a workflow. However, once you have settled on a workflow, it is often the case that the frontier model is overkill. For example, people are using LLMs in cases where even spaCy is sufficient. Saying lower powered models “suck” would be like claiming that a hashmap data structure “sucks” because GPT4o could handle those cases too. You may be in a situation where “cost doesn’t matter, I’ll spend an extra $20 to get it done quicker” but you need to appreciate enterprise cases where, when you get tens of thousands of requests (per second, per hour, or even per day) suddenly saving $0.15 makes the difference between making or losing money.
This is indeed a problem that people are simply using LLMs, that too, frontier models, for something they can easily do with spaCy or at times, even with TF-IDF scores that are reminiscent of traditional NLPs!
Would be interesting to see how much better claude is compared to the rest of the responses since claude up there if not the best nowadays. Also, I wonder if providing these models with the same system prompt as claude to do things step by step would be any better. That is, are the open source models not as good because there aren't good system prompts or is it the model itself or are some of the proprietary model service providers routing uses cases to different llms and using a more agentic model behind the scenes.
Thanks for the good content. Answering your question about whether I agree or not: The real problem is not in open source Small Large Language Models (SLLMs) like Phi or llama... etc. The real problem is in the people behind building such SLLMs, i.e.: The way they think and the way they create such SLLMs.. They want an SLLM that can make everything and that can do anything (Muti-tasks SLLM) and the result is a poor weak SLLM as you see. What they should try to do is to build an SLLM that specializes in a single task. 1- They should not try to make a SLLM that can talk more than one and only one language. (for example either English or Geman..) 2- They must concentrate on training such single-lang-SLM on the reasoning using that single language. 3- They should make such single-lang-SLLM specialized on a single task or single domain of tasks. For example: Imagine an SLLM trained only in the English language and only in programming using Python and only Python. I expect that such SLLM will be a super Python programmer that may be better or at least equal in capability in Python programming to closed source Huge LLMS. Specialization leads to creativity. Even humans, you cannot find a human who is good in medicine and engineering at the same time. You cannot find a person who can say my native languages are English and German. Reaching such a level of good open source SLLM is a must for privacy concerns and for reducing the real cost of running LLMs. For multi-tasks, we can use agent frameworks that use many SLLMs to do multi-domain tasks.(query decomposition +router + plan-and-execute). AI and LLMS are going to control our life, we should not give such companies all the power to control our life. Again, Thanks for the good content.🌹🌹🌹
I really like the idea of routing to a lot of small different, specialized models and agree that this might be the future. "AI and LLMS are going to control our life, we should not give such companies all the power to control our life." -> well, at the end the Open source models also come from Meta, Microsoft etc. I have not heard of any "crowd funded" model yet
I agree with the sentiment in general… but it’ll still cost a lot to serve all these LLMs in a production grade setup to the point that it doesn’t worth it. Plus you will probably end up sending data to the cloud (which is simply putting your trust in another provider).
@@codingcrashcourses8533 When saying Open Source SLLM, I mean Fully Open Source (Even the weights). This should be controlled by laws (I think you know the EU law concerning the browser COOKIES). True Open Source == We do not care from where it came, since if the devil inside the SLLM builder moves we can then easily avoid it and have good acceptable alternatives. We need to reach what I call "Safe Private General Personal AI"
@@bsarel To run SLLMs you do not need high resources: (The Router): Every SLLM will work alone and will consume around 8~10 GB of RAM/VRAM which is available on consumer devices. For personal usage, one does not need to run multiple SLLMS simultaneously (sequential running can be applied very easily). The general term used for this in the meantime is, as I remember, "On Edge".
Do we know how big gpt 4o-mini is though? For all we know it could be 100b, that would still be much smaller than 4o. We have no idea if they're even operating 4o mini at a profit or if they're running on investor money and waiting for the competition to disappear. OpenAI has never published the size or compared it to a model with known size
Yes, we don´t know. But most (if not all) open source models are trained by companies who sit on tons of cash, so in my opinion there is a difference here in comparison to let´s say Amazon and small online stores.
I developed a system that can process a prompt multiple times. That way, if the local large language model gets the code wrong, it can easily be run again. From there, it is a simple process of saving "good" code and rejecting "bad" code. My point is that like humans, code generating LLMs do not need to be correct on the first try. Also, I am using an open source local LLM tuned for writing code (like Deepseek Coder). Moreover, a lot can be said for practicing coding as a prompt engineer. It is possible to draft pseudocode so that it is easy for the LLM to get the code right. And, I must add that using OOP may not be the best solution when working with LLMs. The benefits of writing OOP are diminished when using an LLM as a development tool.
@@clray123 Please review software quality assurance. 1. How do you handle quality assurance for code written by humans? 2. Can those same techniques and tools be used to do quality assurance for code generated by the LLM? 3. Are you able to list various automated tools and techniques for validating code? If you replace the LLM with a junior developer in your mind, I am confident that you'll quickly see how to write a procedures needed to detect the difference between good and bad code. Of course, you could take the position that quality assurance tools and techniques are ineffective or do not exist. Human devs often overlook the rest of reality (which includes hundreds of quality assurance tools and techniquest) because they only think about what LLMs can do.
@@caLLLendar 1. Primarily by reviewing code to ensure the algorithms behave as specified. Sometimes by formal verification. The cherry on top is end-to-end tests. 2. No. Except of tests, which are the weakest and least effective. 3. Yes. None of which apply to verifying behavior of AI models. There is an entire new field of research called interpretability, with little useful revelations when it comes to testing neural networks. I suspect you work in a domain in which shallow testing is the only available tool and where bugs don't matter much, maybe frontend development.
@@clray123 No. I can see that I was not clear enough and you're having a difficult time understanding what I mean. I appreciate the discussion here. Hopefully, it will help me communicate better in the future. As I mentioned in my previous comment, "Human devs often overlook the rest of reality (which includes hundreds of quality assurance tools and techniquest) because they only think about what LLMs can do." I can see that you're still stuck on the LLM. Just replace the LLM with a junior developer. I'm not sure whjy you believe "tests are the weakest and least effective". Can you elaborate on that? After reading your comment a few times, I am guessing that you're thinking I am developing the LLMs or neural networks. I'm not. My domains are fintech and proptech, (but lately I spend most of my time developing an automated system related to quality assurance). In these domains, precision and accuracy are very important. I'd love to figure out where the miscommunication is. Nonetheless, the bottom line is the position you seem to be taking is that you are unable to use various tools and techniques to do "high quality" quality assurance. If a junior dev or LLM writes code, you're primarily "reviewing code" rather than primarily relying on automation. In contrast, I am implementing hundreds of tools and techniques related to quality assurance. Of course, the tools I am referring to have been written by thousands of senior developers and competitors charge thousands of dollars for the service. Are you able to develop such a system? I hope so (since several systems exist). Do you believe that you are skilled enough to assure the quality of code . . . automatically? If not, what specific things does the softare need to do that you would find too difficult to test?
@@caLLLendar Oh, I see, you suggest LLM as a replacement for a clueless junior dev whose code must be reviewed just as stringently. The problem with this reasoning is that you do not hire a junior dev to do work that you also have to bother senior dev to review. Instead, you let the senior dev write (and review) correct code. The only tasks which you hire junior devs for are the kind of low stakes tasks that I mentioned previously. But they are not really the pain point. First, there''s plenty of junior devs to choose from. Second, those junior devs, unlike a faulty uninterpretable LLM, will one day become senior devs, so hiring them is actually a future investment (if you manage to keep them working for your company). Your mention of "hundreds of tools and techniques related to quality assurance" sounds like bullshit. After 20+ years in business I happen to know how business critical enterprise software dev works and trust me it is not done with junior devs compensated for by hundreds of magical tools that somehow assure correctness automatically. Primarily, it is done by hiring a few people who have a clue about and can explain to each other what they are doing (unlike junior devs and LLMs). The most critical algorithms are verified using model checking, but this primarily works in embedded systems where the state space to automatically analyze is constrained and where little IO or dependencies on external, uncontrolled code or hardware resources exist. The rest is primarily quality-assured by expert reviews, but you don't want to waste your expert review time on finding trivial bugs or on educating the juniors (that is what schools and training on the job are for; but here again the difference is that you can reasonably train junior dev while you can't train an LLM with equally predictable results). So no, we will not have our senior devs spend time on reviewing LLM bugs, we will just don't use LLMs for such tasks, just like we don't use junior devs at present.
are the actual concerns relating to data breach (lack of security) or data privacy (not wanting to share the data with the host provider e.g. Open-AI) ? ps. I have purchased all your courses and I'm wondering what the latest/best method is for managing our PGVector database: - is the SQLRecordmanger (indexing API) still the preferred method for production or should we write our own manager ? - - the community_postgres is depricated and ported to langchain_postgres but it requires psycopg3 which so far works fine but any tips and tricks are appreciated! cheers man,
First of all, thank you very much for being such a loyal supporter :). The issues are related to both, but there are more regulatory rules that force us to work that us. Otherwise BAFIN will make us pay millions. Regarding SQLRecordManager: The new and better way seems to be UPSERT, which is not yet implemented for a lot of vectorstores yet. The issue with psycopg3 is not that it might not work, but probably the migration from 2-3. I did not try it yet, since currently it still works and priorities are different.
@@codingcrashcourses8533 Appreciated. The more I learn the more questions ha! I have a specific question regarding multi-tenancy: We're aiming to support multiple tenants, B2B SaaS, efficiently while ensuring proper data isolation. I'm considering two main approaches for implementing multi-tenancy: Using a separate collection (PGVector collection_name) for each tenant Using a shared collection with tenant_id in the metadata and filtering queries I'm leaning towards the metadata approach for its scalability and management benefits, but I have some concerns: a) Will filtering by tenant_id in metadata provide sufficient performance as the dataset grows? b) Are there any security implications I should be aware of when using shared collections? c) How might this choice impact future scalability, especially if we need to support hundreds or thousands of tenants? Additionally, do you see any significant advantages to using both approaches simultaneously (separate collections AND tenant_id in metadata)? Given our B2B context and potential for growth, which approach would you recommend, and are there any best practices or pitfalls I should keep in mind when implementing this with PGVector? ps. any tips and tricks for B2B/B2C deployment would be awesome.
Quite obviously. But the point is that there are no large open source models which could reasonably compete, the reason being the economies of scale (many users offsetting hardware costs by sharing the same instance of the model).
8:30 I have a question about that. How is this supposed to protect the privacy of my data? The graphic looks like you're trying to sell me TLS as a solution to the problem. You're not trying to sell me TLS as a solution to the problem, are you?
No, TLS is not enough. One requirement is that the service is not accessible in the internet and more important that the servers are hosted inside the EU. Sending this kind of data to US servers is not allowed.
@@codingcrashcourses8533 So that is only related to your specific case with financial data? For me private data means something else. If I don't want OpenAI to see my private data, then I don't want any provider in the EU to see it either. In your video, it sounded to me as if your conclusion is that you don't need a local model (means a model that runs on hardware that I control) for private data. This is obviously wrong. But to be honest I don't fully understand what you are showing. What is webapp-1? Edit: Ah, now I see the headline. So you still are using OpenAI but use something that is nothing more than a VPN. Sorry, but that has nothing to do with privacy.
A compelling reason to try open source models is that sooner or later, the prices of OpenAI will have to increase and increase drastically. This is almost the fremium phase of OpenAI as a company. But what do I know; at this time in history, being a meme company is a viable path. I don't really know what OpenAI is anyway, it's not a traditional c corp or LLC. And not really a non profit.
Hi Markus quick question. Have you tested any opensource models that are finetuned for coding? What is your take on that and do you have any recommendations? Thanks
@@codingcrashcourses8533 No I don't. They are out there so I figured you might know. I'm going to use your test on some of them and see how that turns out..
@@codingcrashcourses8533 I don't recommend open source models for coding at all 💀 I ran deepseek coder and it is very underwhelming. The aider benchmarks were misleading too
@@codingcrashcourses8533 I read deepseek-coder and wizardcoder were pretty good. And mistral seemed to be a decent all-rounder. But this was before llama 3 and all I hear now is how llama 3 is killing all the other local llms (dunno if it was original 3 or new 3.1). But it's a general purpose model, so dunno if it's still able to beat the coding fine-tuned ones. And Aider seems like a promising way to improve them?
Thank you for the video About the data safety, it's written on openai's API website that "As of March 1, 2023, data sent to the OpenAI API will not be used to train or improve OpenAI models (unless you explicitly opt in). One advantage to opting in is that the models may get better at your use case over time." So does it mean that the data we send to through the API is comletly safe or not ? I thought it was, but if it's the cas, what's the point of using Azure + Openai private endpoint ? Againt, thanks good quality video, straight to the point !
"written on openai's API website".. and you do believe and trust that!!! Are there any way to make sure that this is true!!!🧐 It is like trusting the crowd that they will not take a look at you nor they will take a shot of you while you are naked. 🤯
Well the issue is that these servers are normally hosted in the US. For germany, we are only allowed to use APIs, which are hosted in a few areas in europe like France or Germany. So yes, we still need to use these private endpoints. Private endpoints also guarantee that the requests can´t be intercepted by bad persons :)
@@codingcrashcourses8533 Except Microsoft is probably the baddest "person" in the world. But ok, they probably already host your other shit, so you've lost control a long time ago. (As has Germany in general. If Microsoft wanted to turn out the lights here, they probably could.)
Even if you adjust for "all" of the hallucinations and crazy outputs, unlike with regular software, you can never prove it won't produce a hallucination and crazy output on a case you did not test explicitly. Which is the exact reason why AI models are a crap proposition for most applications and mostly good for entertainment (where wrong outputs don't matter).
@@clray123 agreed and that's why I'm using it for text to SQL only leaving the analytics to humans or standard etl pipelines. Even then, all SQL executions need to be reviewed
@@42svb58 What is the point of generating code if you need to carefully review it anyway. Finding subtle bugs in someone else's code is not much easier than writing correct code yourself.
I think people are way too naive to let a company like openai handle sensitive data. **So if its something important** its either local open source or nothing if you know what you are doing. (And I'm not talking about "data of your customers" I talk about important scientiific knowledge or technical data mentioned where the information itsel is important. (You can possibly get userdata way easier today).)
I disagree in this point. Microsoft would probably lose billions of dollar if they leave any backdoors and use the data to train their models. The risk is just not worth it. If trust is gone, their cloud business is gone and maybe even the whole company
@@codingcrashcourses8533 You clearly don't need any backdoor with those terms of service you happily use. Things like " we **can**, even with humans, now look over every data you share with our chatprograms in case of " possible harmfull use". This means lets say you share some crticial technical data with the bot etc. Now I wrote a detailed descryption how this loophole leads to you loosing your critical data without any possibility for the law to intervene (and I realized it would be a bad idea haha) But lets just say alone this section which is in every terms of use involving LLMs means they can do it without problems (please check it yourself if you don't believe me).
@@codingcrashcourses8533 Microsoft has been releasing and selling bad code for decades. That is part of the reason open source has succeeded as much as it has. One cannot simply "disagree" that processing sensitive data locally is likely more secure than sending it to a 3rd party company.
I dont agree espcially for text classification information extraction and small scale local agents they are quite good. They just dont jet can be used well for reasoning and infortiv conversations, i.e.: how the main stream is used to interact with these models currently. Atm they are ment for tinkerers and developers but they can surely be used to build usefull tools and I'm hopefull that they will get even better in the future.
Yes you are right, but another question would be if you need an LLM for classification at all, since latency is still relatively high. But classification is often just part of a larger "chain", so you might end up with the need for another LLM.
With small open source models, there is indeed a gap in performance, I would like to see some real experiments to close that gap (with relevant data). For EU use cases, the private endpoint is part of the solution, the other is data residency, so in practice, the list of az openai models available (hosted in EU) is limited.
I don't have enough knowledge about various LLMs to say if I agree or disagree here. Certainly, you make a good case for your use cases. What I would find very interesting is to create a benchmark out of tasks like these because they are much more interesting that many I have seen. It is pretty obvious to me that the existing, accepted benchmarks out there are mostly trash and really don't tell you anything about the various models.
Unfortunately I have to agree. Especially, when you experiment with real-life use cases i.e. quite large context window. The larger the context window the less sense it is to run it locally.
interesting video! here are some of my thoughts. first, i dont think the base models are supposed to be run as is. you should at least try the instruct and fine tuned models for each specific task. that also goes for the proprietary chat gpt where i would have liked to see copilot at the coding question instead. also i want to mention that depending on the local machine, an 8B model might be unnecessarily limiting. On modern nvidia gpus something like the 16B deepseek-coder-v2 should run easily. lastly, i dont think you fully understood the privacy argument but thats a different topic. thanks for giving insight into your view on the topic!
Well very good explanation. However the privacy part was really not very well explained. Suddenly a third party with Azure +OpenAi with private endpoint is introduced without explaining how it works under the hood. So you just enter your client’s financial data from the bank were you work to this “endpoint “ and then yolo? This was by far the biggest argument to use language models on site without sending your data… 😅 maybe I missed it?
You are right, i did not explain how and why this is allowed. That´s actually quite a lot, you can even get dedicated hardware with so called PTUs and would be enough for it´s own video. An no - we can not send users finanicial data to this endpoint. But I would also not be allowed to send this kind of data to a local model. I am not allowed to touch this kind of data at all. Just because you own that kind of data does not mean you are allowed to do anything you want with it :).
@@codingcrashcourses8533 very interesting. Because probably you have a privacy policy in Germany and you would be obligated to explain what data you collect and process for what purposes. Do you explain you transfer data to this endpoint and for what purposes? Very interesting and would be very interested to see if you could make a vid just about this! That would be a very good underlaying for this argument that you make!
You can do a lot with open models, even locally. Saying they are trash is like saying a great car is trash because it isn't a Lambo. I also don't think it is wise to support "open"AI. Furthermore, I think it is important to highlight how unbelievably critical distributed architecture, choice and decentralization is going to be for the future. All the same I like your channel just disagree with such a hard statement.
Thank you for the respectful discussion in the comments. To be honest, I sweated a little bit before uploading this video due to the general climate in the internet these days. Awesome community
@@codingcrashcourses8533 Actually your thumbnail was a bit provocative. But I am glad your audience is mature.
@@sammathew535 Well, youtube still favors click- and some sort of rage-bait. Less controversial videos will follow. Look forward to see the responses to my RAG Tier list video :-D. Probably will be similar to the responses here.
I think for these open source models to be useful, they are meant to be fine tuned for specific tasks. I'm sure companies do this and even add RAG to their work flows
I use the local models for RAG. The quality still lacks though. But does improve
Companies do. A lot.
Do you have examples for good results with fine tuning and a real world usecase? Would be interested
@@codingcrashcourses8533 I worked on (sort of) a sentiment analysis project, and I compared gpt-4o, gpt-4o-mini and phi-3 (before 3.5 was released).
The output is basically structured (json), so it's not really a text generation task
gpt-4o = 91% accuracy
gpt-4o-mini = 89% accuracy
phi-3 mini = 65% accuracy BEFORE finetuning
phi-3 mini = 91% accuracy AFTER finetuning. I think we can get even better with a better training.
Still not amazing (you can get better with more traditional NLP models), but it shows that finetuning can make open source SLM as good as closed source LLM.
Your german-english accent just makes listening to you even better. Things have a more serious tone to it. Fantastic content!
haha glad you like it. Guess people either like it or really hate it ;-)
The LHS, RHS part is quite understandable for those from countries who do Math in English, like here in India. And I imagine GPT4O does some localisation to suit your geography
Thanks for that Information
For me in Sweden as well.
Although I do know quite a lot of English and have a strong interest in math.
Point being I might not be the best reference of a typical Swede.
Being exposed to English way of speaking through youtube-videos with math.
I hope you realize that you're running a quantized version of the open source models? You only mentioned their 4 bit version, which is heavily quantized. You could using at least the 8 or 16 bit versions of those models and might come up with better results? You're comparing quantized models to a GPT model that is unquantized.
That was actually a great spot, I see there all 4 but version from 1:00 in the video.
Using Azure OpenAI is much better in terms of quality of the generated content. On top of this, you get enterprise contract that your data will not be used for improving OAI models. The hardest part with OSS models is the cost for hosting them. I recall one of my colleagues ran an experiment to host an OSS LLM that resulted in a bill of 2k euros a month. Our monthly cost for running OAI models is under 200 euros a month.
OSS LLMs/ SLMs can be used in settings where data should not leave the network. For example, a factory setting. Also, SLMs can be fined tuned to improve their performance if someone decides to through the OSS path.
But for managing the models and quality, AOI is a clear winner.
I doubt. Do I understand you right that he hosted one single instance of an open source LLM? How much idle time? How much inference time? How many tokens per second? How much RAM/VRAM usage? 2k euros in germany means 10k watts 24/7.
@@Anzeigenhatemeister-jo5be it belonged to NC family of VM. These are costly ones on Azure.
@@saurabhjain507Which model? How many tokens per second? How many requests per time?
@@saurabhjain507 I run quite big models right here under my desk. Even if I would run inference 24/7 it would cause only 100 euros per months on the electricity bill.
@@Anzeigenhatemeister-jo5be most large organizations will run cloud hosted models on the big 3 providers that require VM computing classes. These are often hefty requirements and shutting them down isn't too significant in terms of cost savings. I control most of my own projects right below the org admin level so usage is at my discretion... In moderation. Most organizations have policies blocking themselves from experimenting and/or cost optimization
In the video you are using the 4 quantization models which is the lowest accuracy. See 1:00.
I think frontier models are great for rapidly prototyping and iterating over an idea for a workflow. However, once you have settled on a workflow, it is often the case that the frontier model is overkill. For example, people are using LLMs in cases where even spaCy is sufficient. Saying lower powered models “suck” would be like claiming that a hashmap data structure “sucks” because GPT4o could handle those cases too. You may be in a situation where “cost doesn’t matter, I’ll spend an extra $20 to get it done quicker” but you need to appreciate enterprise cases where, when you get tens of thousands of requests (per second, per hour, or even per day) suddenly saving $0.15 makes the difference between making or losing money.
This is indeed a problem that people are simply using LLMs, that too, frontier models, for something they can easily do with spaCy or at times, even with TF-IDF scores that are reminiscent of traditional NLPs!
Flux for pictures, claude for coding, and gpt for translations i guess. But videos are still Schrott with anyhow and an llm-OS still missing.
Could it be the system prompts too?
Meaning you need to fine tune and optimize for what you are doing.
Would be interesting to see how much better claude is compared to the rest of the responses since claude up there if not the best nowadays. Also, I wonder if providing these models with the same system prompt as claude to do things step by step would be any better. That is, are the open source models not as good because there aren't good system prompts or is it the model itself or are some of the proprietary model service providers routing uses cases to different llms and using a more agentic model behind the scenes.
Thanks for the good content.
Answering your question about whether I agree or not:
The real problem is not in open source Small Large Language Models (SLLMs) like Phi or llama... etc.
The real problem is in the people behind building such SLLMs, i.e.: The way they think and the way they create such SLLMs.. They want an SLLM that can make everything and that can do anything (Muti-tasks SLLM) and the result is a poor weak SLLM as you see.
What they should try to do is to build an SLLM that specializes in a single task.
1- They should not try to make a SLLM that can talk more than one and only one language. (for example either English or Geman..)
2- They must concentrate on training such single-lang-SLM on the reasoning using that single language.
3- They should make such single-lang-SLLM specialized on a single task or single domain of tasks. For example:
Imagine an SLLM trained only in the English language and only in programming using Python and only Python. I expect that such SLLM will be a super Python programmer that may be better or at least equal in capability in Python programming to closed source Huge LLMS.
Specialization leads to creativity.
Even humans, you cannot find a human who is good in medicine and engineering at the same time. You cannot find a person who can say my native languages are English and German.
Reaching such a level of good open source SLLM is a must for privacy concerns and for reducing the real cost of running LLMs.
For multi-tasks, we can use agent frameworks that use many SLLMs to do multi-domain tasks.(query decomposition +router + plan-and-execute).
AI and LLMS are going to control our life, we should not give such companies all the power to control our life.
Again, Thanks for the good content.🌹🌹🌹
I really like the idea of routing to a lot of small different, specialized models and agree that this might be the future.
"AI and LLMS are going to control our life, we should not give such companies all the power to control our life." -> well, at the end the Open source models also come from Meta, Microsoft etc.
I have not heard of any "crowd funded" model yet
I agree with the sentiment in general… but it’ll still cost a lot to serve all these LLMs in a production grade setup to the point that it doesn’t worth it. Plus you will probably end up sending data to the cloud (which is simply putting your trust in another provider).
@@codingcrashcourses8533 When saying Open Source SLLM, I mean Fully Open Source (Even the weights). This should be controlled by laws (I think you know the EU law concerning the browser COOKIES).
True Open Source == We do not care from where it came, since if the devil inside the SLLM builder moves we can then easily avoid it and have good acceptable alternatives. We need to reach what I call "Safe Private General Personal AI"
@@bsarel To run SLLMs you do not need high resources:
(The Router): Every SLLM will work alone and will consume around 8~10 GB of RAM/VRAM which is available on consumer devices. For personal usage, one does not need to run multiple SLLMS simultaneously (sequential running can be applied very easily). The general term used for this in the meantime is, as I remember, "On Edge".
@@HassanAllaham any example for such an LLM?
Do we know how big gpt 4o-mini is though? For all we know it could be 100b, that would still be much smaller than 4o. We have no idea if they're even operating 4o mini at a profit or if they're running on investor money and waiting for the competition to disappear. OpenAI has never published the size or compared it to a model with known size
Yes, we don´t know. But most (if not all) open source models are trained by companies who sit on tons of cash, so in my opinion there is a difference here in comparison to let´s say Amazon and small online stores.
I developed a system that can process a prompt multiple times.
That way, if the local large language model gets the code wrong, it can easily be run again.
From there, it is a simple process of saving "good" code and rejecting "bad" code.
My point is that like humans, code generating LLMs do not need to be correct on the first try.
Also, I am using an open source local LLM tuned for writing code (like Deepseek Coder).
Moreover, a lot can be said for practicing coding as a prompt engineer.
It is possible to draft pseudocode so that it is easy for the LLM to get the code right.
And, I must add that using OOP may not be the best solution when working with LLMs.
The benefits of writing OOP are diminished when using an LLM as a development tool.
Lol and which model do you use to distinguish "good code" from "bad code"? And what if this classifier is as bad as the generator?
@@clray123
Please review software quality assurance.
1. How do you handle quality assurance for code written by humans?
2. Can those same techniques and tools be used to do quality assurance for code generated by the LLM?
3. Are you able to list various automated tools and techniques for validating code?
If you replace the LLM with a junior developer in your mind, I am confident that you'll quickly see how to write a procedures needed to detect the difference between good and bad code.
Of course, you could take the position that quality assurance tools and techniques are ineffective or do not exist.
Human devs often overlook the rest of reality (which includes hundreds of quality assurance tools and techniquest) because they only think about what LLMs can do.
@@caLLLendar 1. Primarily by reviewing code to ensure the algorithms behave as specified. Sometimes by formal verification. The cherry on top is end-to-end tests.
2. No. Except of tests, which are the weakest and least effective.
3. Yes. None of which apply to verifying behavior of AI models. There is an entire new field of research called interpretability, with little useful revelations when it comes to testing neural networks.
I suspect you work in a domain in which shallow testing is the only available tool and where bugs don't matter much, maybe frontend development.
@@clray123
No.
I can see that I was not clear enough and you're having a difficult time understanding what I mean. I appreciate the discussion here. Hopefully, it will help me communicate better in the future.
As I mentioned in my previous comment,
"Human devs often overlook the rest of reality (which includes hundreds of quality assurance tools and techniquest) because they only think about what LLMs can do."
I can see that you're still stuck on the LLM. Just replace the LLM with a junior developer.
I'm not sure whjy you believe "tests are the weakest and least effective".
Can you elaborate on that?
After reading your comment a few times, I am guessing that you're thinking I am developing the LLMs or neural networks.
I'm not.
My domains are fintech and proptech, (but lately I spend most of my time developing an automated system related to quality assurance). In these domains, precision and accuracy are very important.
I'd love to figure out where the miscommunication is. Nonetheless, the bottom line is the position you seem to be taking is that you are unable to use various tools and techniques to do "high quality" quality assurance.
If a junior dev or LLM writes code, you're primarily "reviewing code" rather than primarily relying on automation.
In contrast, I am implementing hundreds of tools and techniques related to quality assurance. Of course, the tools I am referring to have been written by thousands of senior developers and competitors charge thousands of dollars for the service.
Are you able to develop such a system?
I hope so (since several systems exist).
Do you believe that you are skilled enough to assure the quality of code . . . automatically?
If not, what specific things does the softare need to do that you would find too difficult to test?
@@caLLLendar Oh, I see, you suggest LLM as a replacement for a clueless junior dev whose code must be reviewed just as stringently. The problem with this reasoning is that you do not hire a junior dev to do work that you also have to bother senior dev to review. Instead, you let the senior dev write (and review) correct code. The only tasks which you hire junior devs for are the kind of low stakes tasks that I mentioned previously.
But they are not really the pain point. First, there''s plenty of junior devs to choose from. Second, those junior devs, unlike a faulty uninterpretable LLM, will one day become senior devs, so hiring them is actually a future investment (if you manage to keep them working for your company).
Your mention of "hundreds of tools and techniques related to quality assurance" sounds like bullshit. After 20+ years in business I happen to know how business critical enterprise software dev works and trust me it is not done with junior devs compensated for by hundreds of magical tools that somehow assure correctness automatically. Primarily, it is done by hiring a few people who have a clue about and can explain to each other what they are doing (unlike junior devs and LLMs).
The most critical algorithms are verified using model checking, but this primarily works in embedded systems where the state space to automatically analyze is constrained and where little IO or dependencies on external, uncontrolled code or hardware resources exist.
The rest is primarily quality-assured by expert reviews, but you don't want to waste your expert review time on finding trivial bugs or on educating the juniors (that is what schools and training on the job are for; but here again the difference is that you can reasonably train junior dev while you can't train an LLM with equally predictable results). So no, we will not have our senior devs spend time on reviewing LLM bugs, we will just don't use LLMs for such tasks, just like we don't use junior devs at present.
are the actual concerns relating to data breach (lack of security) or data privacy (not wanting to share the data with the host provider e.g. Open-AI) ?
ps. I have purchased all your courses and I'm wondering what the latest/best method is for managing our PGVector database:
- is the SQLRecordmanger (indexing API) still the preferred method for production or should we write our own manager ?
- - the community_postgres is depricated and ported to langchain_postgres but it requires psycopg3 which so far works fine but any tips and tricks are appreciated! cheers man,
First of all, thank you very much for being such a loyal supporter :). The issues are related to both, but there are more regulatory rules that force us to work that us. Otherwise BAFIN will make us pay millions.
Regarding SQLRecordManager: The new and better way seems to be UPSERT, which is not yet implemented for a lot of vectorstores yet.
The issue with psycopg3 is not that it might not work, but probably the migration from 2-3. I did not try it yet, since currently it still works and priorities are different.
@@codingcrashcourses8533 Appreciated. The more I learn the more questions ha! I have a specific question regarding multi-tenancy: We're aiming to support multiple tenants, B2B SaaS, efficiently while ensuring proper data isolation. I'm considering two main approaches for implementing multi-tenancy:
Using a separate collection (PGVector collection_name) for each tenant
Using a shared collection with tenant_id in the metadata and filtering queries
I'm leaning towards the metadata approach for its scalability and management benefits, but I have some concerns:
a) Will filtering by tenant_id in metadata provide sufficient performance as the dataset grows?
b) Are there any security implications I should be aware of when using shared collections?
c) How might this choice impact future scalability, especially if we need to support hundreds or thousands of tenants?
Additionally, do you see any significant advantages to using both approaches simultaneously (separate collections AND tenant_id in metadata)?
Given our B2B context and potential for growth, which approach would you recommend, and are there any best practices or pitfalls I should keep in mind when implementing this with PGVector?
ps. any tips and tricks for B2B/B2C deployment would be awesome.
You should do a comparison with the new Claude, which is more comparable to GPT
That is also proprietary
but what about the no of parameters? that could be a reason why gpt-4o-mini works well?
Quite obviously. But the point is that there are no large open source models which could reasonably compete, the reason being the economies of scale (many users offsetting hardware costs by sharing the same instance of the model).
8:30 I have a question about that. How is this supposed to protect the privacy of my data? The graphic looks like you're trying to sell me TLS as a solution to the problem. You're not trying to sell me TLS as a solution to the problem, are you?
No, TLS is not enough. One requirement is that the service is not accessible in the internet and more important that the servers are hosted inside the EU. Sending this kind of data to US servers is not allowed.
@@codingcrashcourses8533 So that is only related to your specific case with financial data? For me private data means something else. If I don't want OpenAI to see my private data, then I don't want any provider in the EU to see it either. In your video, it sounded to me as if your conclusion is that you don't need a local model (means a model that runs on hardware that I control) for private data. This is obviously wrong. But to be honest I don't fully understand what you are showing. What is webapp-1?
Edit:
Ah, now I see the headline. So you still are using OpenAI but use something that is nothing more than a VPN. Sorry, but that has nothing to do with privacy.
A compelling reason to try open source models is that sooner or later, the prices of OpenAI will have to increase and increase drastically. This is almost the fremium phase of OpenAI as a company. But what do I know; at this time in history, being a meme company is a viable path. I don't really know what OpenAI is anyway, it's not a traditional c corp or LLC. And not really a non profit.
Hi Markus quick question. Have you tested any opensource models that are finetuned for coding? What is your take on that and do you have any recommendations? Thanks
No, not yet. Any specific recommendations? I have a ChatGPT Subscription and I am very happy with GPT-4o
@@codingcrashcourses8533 No I don't. They are out there so I figured you might know. I'm going to use your test on some of them and see how that turns out..
@@codingcrashcourses8533 I don't recommend open source models for coding at all 💀 I ran deepseek coder and it is very underwhelming. The aider benchmarks were misleading too
@@codingcrashcourses8533 I read deepseek-coder and wizardcoder were pretty good. And mistral seemed to be a decent all-rounder. But this was before llama 3 and all I hear now is how llama 3 is killing all the other local llms (dunno if it was original 3 or new 3.1). But it's a general purpose model, so dunno if it's still able to beat the coding fine-tuned ones. And Aider seems like a promising way to improve them?
Great take. Do you do any side work for projects / offer your paid services?
Thank you. Sorry, but i dont :)
Thank you for the video
About the data safety, it's written on openai's API website that "As of March 1, 2023, data sent to the OpenAI API will not be used to train or improve OpenAI models (unless you explicitly opt in). One advantage to opting in is that the models may get better at your use case over time."
So does it mean that the data we send to through the API is comletly safe or not ? I thought it was, but if it's the cas, what's the point of using Azure + Openai private endpoint ?
Againt, thanks good quality video, straight to the point !
"written on openai's API website".. and you do believe and trust that!!!
Are there any way to make sure that this is true!!!🧐
It is like trusting the crowd that they will not take a look at you nor they will take a shot of you while you are naked. 🤯
Well the issue is that these servers are normally hosted in the US. For germany, we are only allowed to use APIs, which are hosted in a few areas in europe like France or Germany. So yes, we still need to use these private endpoints. Private endpoints also guarantee that the requests can´t be intercepted by bad persons :)
@@codingcrashcourses8533 Except Microsoft is probably the baddest "person" in the world. But ok, they probably already host your other shit, so you've lost control a long time ago. (As has Germany in general. If Microsoft wanted to turn out the lights here, they probably could.)
I'm writing a text-to-sql using 70b models and it is insanely difficult and time consuming to adjust for all of the hallucinations and crazy outputs.
Even if you adjust for "all" of the hallucinations and crazy outputs, unlike with regular software, you can never prove it won't produce a hallucination and crazy output on a case you did not test explicitly. Which is the exact reason why AI models are a crap proposition for most applications and mostly good for entertainment (where wrong outputs don't matter).
@@clray123 agreed and that's why I'm using it for text to SQL only leaving the analytics to humans or standard etl pipelines. Even then, all SQL executions need to be reviewed
@@42svb58 What is the point of generating code if you need to carefully review it anyway. Finding subtle bugs in someone else's code is not much easier than writing correct code yourself.
Amazing!
thank you :), but I would already do some things differently due to feedback
Small models are made to be fine tuned further for specific use cases. It will not perform wide variety of tasks.
I think people are way too naive to let a company like openai handle sensitive data. **So if its something important** its either local open source or nothing if you know what you are doing. (And I'm not talking about "data of your customers" I talk about important scientiific knowledge or technical data mentioned where the information itsel is important. (You can possibly get userdata way easier today).)
I disagree in this point. Microsoft would probably lose billions of dollar if they leave any backdoors and use the data to train their models. The risk is just not worth it. If trust is gone, their cloud business is gone and maybe even the whole company
@@codingcrashcourses8533 You clearly don't need any backdoor with those terms of service you happily use. Things like " we **can**, even with humans, now look over every data you share with our chatprograms in case of " possible harmfull use". This means lets say you share some crticial technical data with the bot etc. Now I wrote a detailed descryption how this loophole leads to you loosing your critical data without any possibility for the law to intervene (and I realized it would be a bad idea haha) But lets just say alone this section which is in every terms of use involving LLMs means they can do it without problems (please check it yourself if you don't believe me).
@@codingcrashcourses8533 Microsoft has been releasing and selling bad code for decades. That is part of the reason open source has succeeded as much as it has.
One cannot simply "disagree" that processing sensitive data locally is likely more secure than sending it to a 3rd party company.
I dont agree espcially for text classification information extraction and small scale local agents they are quite good. They just dont jet can be used well for reasoning and infortiv conversations, i.e.: how the main stream is used to interact with these models currently.
Atm they are ment for tinkerers and developers but they can surely be used to build usefull tools and I'm hopefull that they will get even better in the future.
Yes you are right, but another question would be if you need an LLM for classification at all, since latency is still relatively high. But classification is often just part of a larger "chain", so you might end up with the need for another LLM.
Thank for your usefull content
All of MS computer goes to OpenAI. Phi3.5 shows how committed to open source they are
I would also like to know what the business value is behind these models.
With small open source models, there is indeed a gap in performance, I would like to see some real experiments to close that gap (with relevant data). For EU use cases, the private endpoint is part of the solution, the other is data residency, so in practice, the list of az openai models available (hosted in EU) is limited.
Yes, we are always a bit behind, ~1-2 months.
I don't have enough knowledge about various LLMs to say if I agree or disagree here. Certainly, you make a good case for your use cases. What I would find very interesting is to create a benchmark out of tasks like these because they are much more interesting that many I have seen. It is pretty obvious to me that the existing, accepted benchmarks out there are mostly trash and really don't tell you anything about the various models.
Unfortunately I have to agree. Especially, when you experiment with real-life use cases i.e. quite large context window. The larger the context window the less sense it is to run it locally.
I agree with you❤
I agree that Azure with OpenAI is the best option.
I mean, that we can experiment with OpenAI credentials in dev, staging environment, and then deploy on Azure OpenAI for prod
interesting video! here are some of my thoughts. first, i dont think the base models are supposed to be run as is. you should at least try the instruct and fine tuned models for each specific task. that also goes for the proprietary chat gpt where i would have liked to see copilot at the coding question instead. also i want to mention that depending on the local machine, an 8B model might be unnecessarily limiting. On modern nvidia gpus something like the 16B deepseek-coder-v2 should run easily. lastly, i dont think you fully understood the privacy argument but thats a different topic. thanks for giving insight into your view on the topic!
ask gpt4o how many r in strawberry & it says 2 but llama3.1 says 3. case closed
ok that changed my mind
Almost didn't watch this because of the click bait title
Only a little bit. But for my test with 3.5. it´s actually close to reality^^
Well very good explanation. However the privacy part was really not very well explained. Suddenly a third party with Azure +OpenAi with private endpoint is introduced without explaining how it works under the hood.
So you just enter your client’s financial data from the bank were you work to this “endpoint “ and then yolo?
This was by far the biggest argument to use language models on site without sending your data… 😅 maybe I missed it?
You are right, i did not explain how and why this is allowed.
That´s actually quite a lot, you can even get dedicated hardware with so called PTUs and would be enough for it´s own video.
An no - we can not send users finanicial data to this endpoint. But I would also not be allowed to send this kind of data to a local model. I am not allowed to touch this kind of data at all. Just because you own that kind of data does not mean you are allowed to do anything you want with it :).
@@codingcrashcourses8533 very interesting. Because probably you have a privacy policy in Germany and you would be obligated to explain what data you collect and process for what purposes. Do you explain you transfer data to this endpoint and for what purposes? Very interesting and would be very interested to see if you could make a vid just about this! That would be a very good underlaying for this argument that you make!
agree :)
finally someone is saying what needs to be said
I hope somebody also says how 8:30 protects the privacy of my data.
@@Anzeigenhatemeister-jo5be The answer is not at all, but it satisfies the German regulations (made up with much input from Microsoft, without doubt).
You can do a lot with open models, even locally. Saying they are trash is like saying a great car is trash because it isn't a Lambo. I also don't think it is wise to support "open"AI. Furthermore, I think it is important to highlight how unbelievably critical distributed architecture, choice and decentralization is going to be for the future. All the same I like your channel just disagree with such a hard statement.
Please understand that some kind of polarisation helps to increase my reach ;-). I am currently experimenting with Llama 3.2 and Agents
llama 3.1 405b is better, compare apples to apples, not 3.5b to chatgpt 😭
You did not watch the video but just write an angry comment right?
@@codingcrashcourses8533 forgot to remove it 💀