Open-Source Vision AI - SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 มิ.ย. 2024
  • Phi3 Vision, LLaMA 3 Vision, and GPT4o Vision are all put to the test!
    Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewberman.com
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.net/@matthewberma...
    👉🏻 LinkedIn: / forward-future-ai
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
    Disclosures:
    I'm an investor in LMStudio
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 308

  • @citaman
    @citaman หลายเดือนก่อน +44

    For future test :
    1 - Ask unrelated question of a image - [Image of a car] Tell me whats wrong about my bicycle
    2 - gradually zoom out of a big chunk of text in a image to see how many word the model can read
    3 - A Dense detection task : Describe each element of the object in a json format with a predefine structure
    4 - If possible multiple frame from a video to see a glimpse of action understanding

    • @jakobpcoder
      @jakobpcoder หลายเดือนก่อน +1

      Including object position with x/y or u/v space (0 - 1 per axis), MAybe even bounding boxes.

  • @murraymacdonald4959
    @murraymacdonald4959 หลายเดือนก่อน +56

    For future vision tests consider things like:
    1) Finding objects - Where is waldo in this picture?
    2) Counting Objects - How many bicycles are there in this picture?
    3) Identifying Abnormal Objects - How many eggs in this box are broken?
    4) Identifying Partially Obscured Objects - Imagine a hand holding cards - What cards are in this poker hand?
    5) Identify Misplaced Objects - Which of these dishes is upside down?

    • @CuratedCountenance
      @CuratedCountenance หลายเดือนก่อน

      What about adding gifs? GPT-4o is also capable of analyzing moving gifs... Not sure on the others, but that adds another dynamic to the tests.

    • @stevenhamerlinck6832
      @stevenhamerlinck6832 หลายเดือนก่อน

      There are quite a nr of these mentioned projects readily available via GitHub projects

    • @neomatrix2669
      @neomatrix2669 หลายเดือนก่อน

      It's impressive how open-source image-to-text templates are eventually doing so much better than proprietary and paid ones. 😲

  • @philipashane
    @philipashane หลายเดือนก่อน +18

    On the question of the size of the Photos app, GPT noted that 133 GB is larger than the max size of your phone’s storage and thus indicates that it’s possibly using cloud storage and isn’t the actual amount used by Photos on your phone. That was a really perceptive answer, so bonus points to GPT for that 😊 and perhaps that discrepancy is why the other AI seemed to be ignoring the Photos app.

  • @bertobertoberto3
    @bertobertoberto3 หลายเดือนก่อน +45

    For the captcha gpt4o is clearly the winner. It understands what you mean given the context and doesn’t just repeat all the letters it sees in the image.

    • @MistahX99
      @MistahX99 หลายเดือนก่อน

      That's exactly what I came to say

    • @mrdevolver7999
      @mrdevolver7999 หลายเดือนก่อน +6

      The question was "what letters are found in this image?", the question wasn't "what letters are found in CAPTCHA field?" Therefore, Phi-3 vision model answered the actual question. GPT4o simply assumed that the task is to break the captcha by reading it. Sometimes less means more, in this case assuming less about the user's intentions would yield better results.

    • @thripnixe
      @thripnixe หลายเดือนก่อน

      It doesnt matter

    • @mikezooper
      @mikezooper หลายเดือนก่อน +1

      @@mrdevolver7999Your cognitive assumption is that it’s right. A random number generator could answer “1 + 1” as being 2, by our pure chance. Therefore, we don’t know if a right answer was a fluke.

    • @hamsturinn
      @hamsturinn หลายเดือนก่อน +3

      Imagine how annoyed you would be, if you ask someone what the letters are and they say 'CAPTCHA'.

  • @Baleur
    @Baleur หลายเดือนก่อน +71

    8:40 up for interpretation.
    "Photos" isnt really a standalone "app" per say, and its not the app itself that is taking up the space, it's the individual jpeg photos, which would take up the same amount of space even if you somehow didnt have the "Photos" app installed anymore.
    If a person asked ME that same question, i'd also answer Whatsapp. Since that's something you can tangibly uninstall.
    If they asked "what is taking up most space?" the correct answer is "Your photos". But if the question is "what APP is taking up most space", its Whatsapp.

    • @Ginto_O
      @Ginto_O หลายเดือนก่อน +7

      agree

    • @gamingtech276
      @gamingtech276 หลายเดือนก่อน +1

      Also with the context of "my space", it's undeniably not photos

    • @michakietyka2807
      @michakietyka2807 หลายเดือนก่อน +4

      It's wrong for a LLM not to at least mention a different possibility if the question seems ambigious.

  • @CuratedCountenance
    @CuratedCountenance หลายเดือนก่อน +6

    Pro-tip: Try uploading a photograph you've taken or a work of art into GPT-4o and ask it to behave like an art critic (works great vanilla, but even better with custom instructions).
    GPT-4o's ability to dissect the minutia of photography is absolutely wild... even to the point of giving suggestions for improving.
    I wonder how long it is until photographers realize what kind of a tool they have available here. I just get a kick out of posting photographs and art and asking for critiques and ratings. It's so, so good.

    • @starblaiz1986
      @starblaiz1986 หลายเดือนก่อน +1

      Oh man, that's actually wild and has a ton of use cases! "Hey GPT, what do you think of this tshirt design I just made for my POD business?" --> *proceed to incorporate the suggestions it has to make a better product* 😮

  • @fabiankliebhan
    @fabiankliebhan หลายเดือนก่อน +10

    I think if you just consider the output quality GPT-4o is the best.
    But if you also take the speed, that fact that phi3-vision is local and open-source into account phi3-vision is the most impressive one.

  • @socialexperiment8267
    @socialexperiment8267 หลายเดือนก่อน +3

    Really good tests. Thanks a lot.

  • @MrKrzysiek9991
    @MrKrzysiek9991 หลายเดือนก่อน +3

    The Llama V was probably finetuned for providing verbose descriptions of images. There are other finetuned models that focus on ORC or image labelling

  • @iamachs
    @iamachs หลายเดือนก่อน

    This is really awesome quality content mate, really love your work 😊🚀👍

  • @yotubecreators47
    @yotubecreators47 หลายเดือนก่อน

    Nice video these kind of videos I immediately add to my playlist forever

  • @JoaquinTorroba
    @JoaquinTorroba หลายเดือนก่อน

    Very very useful Matt, thanks!

  • @photorealm
    @photorealm หลายเดือนก่อน

    Great test, excellent pace , A+ 🤟

  • @donmiguel4848
    @donmiguel4848 หลายเดือนก่อน +5

    Photos and Apps seem to be distinguished in the storage section of the IPhone. So if you question the largest apps the LLM ignores the photos...

  • @truepilgrimm
    @truepilgrimm หลายเดือนก่อน +10

    I love your work. I never miss an episode. I love how you test the LLMs.

  • @superjaykramer
    @superjaykramer หลายเดือนก่อน

    your channel is great, actually useful information, cheers

  • @sebastiancabrera6009
    @sebastiancabrera6009 หลายเดือนก่อน +28

    Phi3-Vision is awesome

    • @joshs6230
      @joshs6230 หลายเดือนก่อน

      Please test your meta glasses

  • @EvanLefavor
    @EvanLefavor หลายเดือนก่อน +3

    LM Studio is infamously bad for vision.
    In order to get it to work you have to follow the following rules:
    1. Start a new chat for each photo question.
    2. Reboot LM Studio for every photo question.
    It’s tedious, but it can start hallucinating after the initial question.

    • @ayushmishra5861
      @ayushmishra5861 หลายเดือนก่อน

      What is LM studio? Interface and UI on top of Ollama, safe to assume this?

    • @thegooddoctor6719
      @thegooddoctor6719 หลายเดือนก่อน

      @@ayushmishra5861 LM Studio is a stand alone app - Its easy to use and powerful

    • @EvanLefavor
      @EvanLefavor หลายเดือนก่อน

      @@ayushmishra5861 it’s a one click install GUI for working with any LLM model locally with full customization.,I would say it is the best tool available right now.

  • @MarkDurbin
    @MarkDurbin หลายเดือนก่อน +1

    I have tried dropping a simple software architecture diagram on them and asking them to extract the entities, hierarchy and connections into a json file, which usually works quite well.

  • @tepafray
    @tepafray หลายเดือนก่อน +3

    I have to wonder in the case of the AI messing up the photo app taking the most space. If it recognises your photos are separate from the actual app

  • @zyzzyva303
    @zyzzyva303 หลายเดือนก่อน

    Cool. Nice hoodie Matthew.

  • @AlexLuthore
    @AlexLuthore หลายเดือนก่อน +2

    GPT4o doing the analyzing on the cav prompt wasn't to call up python to look at the image but actually using Python to generate a csv output of the image since you asked for it to make the image data into csv.

  • @smoothemusic4386
    @smoothemusic4386 หลายเดือนก่อน

    I tried searching for the model in lm studio and couldn’t find it! Where can I download it from?

  • @kgnet8831
    @kgnet8831 หลายเดือนก่อน +2

    Object counting and the comparison of sizes are always good tests...

  • @MikeKleinsteuber
    @MikeKleinsteuber หลายเดือนก่อน +4

    Nothing to do with this episode in particular but one important question that no one appears to be asking Sam Altman and all the other AI ceos is when will our AI become proactive rather than simply reactive. That will be the next big game changer

    • @ronilevarez901
      @ronilevarez901 หลายเดือนก่อน +4

      Probably never. I'd say that is the first step humans could give to surrender their control over the world and the last step humanity would give as the dominant species too.

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 หลายเดือนก่อน +1

      @@ronilevarez901 I think input -> output will remain the base function for some time, and that more sophisticated models will be able to accept long-term commands, like constantly checking your calendar and reminding you of dates without having asked for a reminder.

    • @marcusmadumo7361
      @marcusmadumo7361 หลายเดือนก่อน

      I think it is because of the risk that may pose. but it is worth a try... I would like to say initiative rather the proactivity. However, I get your point

    • @ronilevarez901
      @ronilevarez901 หลายเดือนก่อน +1

      @@marcusmadumo7361 Technically, initiative in AI is called "Agency".

    • @ronilevarez901
      @ronilevarez901 หลายเดือนก่อน +1

      @@marcusmadumo7361 technically, initiative in AIs is called 'Agency'.

  • @gagarinchief
    @gagarinchief หลายเดือนก่อน +3

    Where are the links on models?

  • @henrylawson430
    @henrylawson430 หลายเดือนก่อน +2

    Ask them to interpret maps. For example, is there a park nearby? What is the name of the park? How about a school etc. This is useful for real estate.

  • @luan176
    @luan176 หลายเดือนก่อน +2

    do you have any plans ou tutorial link for phi3 vision local?

  • @jimk8325
    @jimk8325 หลายเดือนก่อน

    Thank you for your tests.
    Is there a particular HF model you're using. Tested a couple with LM Studio they didn't work 🤔

  • @nyyotam4057
    @nyyotam4057 หลายเดือนก่อน +2

    Matt, instability with temp -> 0 is one sign we have an MoE.

  • @akhil5665
    @akhil5665 หลายเดือนก่อน +1

    You can never judge the vision capability of a model merely based on the description or detection, it should also be able to localise the objects with good precision, which is where most models fail .

  • @enochleffingwell842
    @enochleffingwell842 หลายเดือนก่อน

    What is the name of the bright circle around your mouse?

  • @topmandan
    @topmandan หลายเดือนก่อน

    great channel. love your work. FYI - to read a QR code, the QR code must have white space around the outside. This allows the model to pick up on the 3 big position marker squares.

  • @toadlguy
    @toadlguy หลายเดือนก่อน +3

    That is a pretty odd (incorrect) title. GPT4o is NOT Open Source. Maybe you meant Open Source models compared to GPT4o?

    • @bornach
      @bornach หลายเดือนก่อน

      Are any of them open source in the truest sense of being able to modify both the code and the weights and biases of the model to alter it's behaviour in a very specific and directed way?

  • @Spreadshotstudios
    @Spreadshotstudios หลายเดือนก่อน

    What laptop are you using where you are able to load 14gb of model weights without quantizing it into vram?

  • @MrNezlee
    @MrNezlee หลายเดือนก่อน

    Gpt4o seems to have the best understanding of 3D physical space, including direction, coordinates, mass, speed, collision, risk avoidance, obstacles, etc.

  • @tomtom_videos
    @tomtom_videos หลายเดือนก่อน +1

    Awesome video! I was wondering how Phi-3-Vision fares compared to other vision-capable LLMs. I watched your video while I was working on my own Phi-3-Vision tests using Web UI screenshots (my hope is that it could be used for automated Web UI testing). However, Phi-3 turned out to be horrible at Web UI testing (you can see the video from my tests in my TH-cam channel, if you are interested). It's nice to see that it fares much better with normal photos! Thanks for making this video - it saved me some time on testing it myself :)

  • @brianWreaves
    @brianWreaves หลายเดือนก่อน

    RE: Visio test suggestions
    1) Use an image of a CAPTCH requiring the user to use a slider to move a puzzle piece and ask for instruction on how to solve it
    2) Use an optical illusion image and ask a relevant question
    3) Use an image to ask it to create 3 question about the image you can use to test someone's vision logic

  • @ronbridegroom8428
    @ronbridegroom8428 หลายเดือนก่อน

    Good stuff. Thanks. Results are very prompt dependent

  • @josephfox514
    @josephfox514 หลายเดือนก่อน

    I'm wondering if it can read blueprints for both mechanical and electrical schematics and then find the part. I can give you examples if needed.

  • @emil8367
    @emil8367 หลายเดือนก่อน

    thx, curious how it handles written text

  • @stewiex
    @stewiex หลายเดือนก่อน

    I think you need to adjust your System Prompt for Llama to ask for concise answers and for it to ensure it accurately responds to the intended action of the prompt.

  • @mesapysch
    @mesapysch หลายเดือนก่อน

    I am a data annotator. One suggestion would be to use a "Where's Waldo" image and have the bot not only find Waldo, but also describe to the user how to find him. I would be curious to know how they navigate an image.

  • @davidberry8463
    @davidberry8463 หลายเดือนก่อน

    Phi3 did the best. Another note... I used GPT4O to write a lesson plan for my engineering courses. Instead of working hard, I uploaded several pieces of data in different formats. In one portion of the prompt, I uploaded a pdf, a word DOC an XLS. I asked it to provide response and then output in XLS, PDF and CSV formats. The first time through, GPT4O worked perfectly. The second time through, I had to start over. Kinda weird, but all in all the models are getting really good. It also helps that I'm an engineering teacher that loves to watch Matthew Berman.

  • @HarrisonBorbarrison
    @HarrisonBorbarrison หลายเดือนก่อน

    I test vision models by showing them an obscure character from Super Paper Mario. They don’t usually get it correct and it’s probably not the best way to test them.

  • @mxguy2438
    @mxguy2438 หลายเดือนก่อน

    It would be interesting to have each of the models assess both their own answers and competing models answers to see if they accurately answered the question.

  • @mrpro7737
    @mrpro7737 หลายเดือนก่อน

    GPT-4 initially struggled but eventually transformed into a Terminator ! 😂

  • @mbrochh82
    @mbrochh82 หลายเดือนก่อน

    Matthew, please test handwriting recognition as well when testing image models! This would be a massive usecase useful to anyone who likes to use pen and paper for personal journalling.

  • @StephenGroenewald
    @StephenGroenewald หลายเดือนก่อน

    Thanks and today I had to use vision to troubleshoot router lights. Maybe use photos of real life objects to see how well it can help.

  • @mduthwala439
    @mduthwala439 หลายเดือนก่อน

    Nice video. Try testing a video clip and try testing smaller 7B’s vision LLM’s

  • @ollimacp
    @ollimacp หลายเดือนก่อน +1

    Maybe try out a (slightly) tilted scanned version of a printed excel document.
    One could try adding noise or other means of disturbing the image and testing those disturbed images against the models and see which model handles the best.

  • @LHSgoatman
    @LHSgoatman หลายเดือนก่อน

    Ask questions about charts. Line charts, bar charts, etc. ask questions that are not immediate obvious of the data but can be inferred. Ask it to convert from one chart type to another.

  • @JanJeronimus
    @JanJeronimus หลายเดือนก่อน

    Some ideas for tests:
    1) When providing an image of a small maze, can it describe the shortest route ?
    2) What if you provide a map of a maze and a 2 d image of a vehicle as seen from above. The maze has some paths where the vehicle can't pass as these.paths are too narrow
    3)Can it calculate the area of a geometric figure when you provide an image of the sizes?
    4) what if you provide an irregular figure and ask the area.

  • @justindressler5992
    @justindressler5992 หลายเดือนก่อน

    Thanks for this video i was planning on downloading both models. You saved me some time. You should try some abliterated models

  • @perer005
    @perer005 หลายเดือนก่อน

    Think you need to update that AI test soon! :D

  • @theresalwaysanotherway3996
    @theresalwaysanotherway3996 หลายเดือนก่อน +1

    Keep in mind that had you run phi-3 vision locally (it is the smallest and easiest to run at only ~4B parameters, and it's open weight) it might have performed better on the identify a person questions, as it seems azure blurs the faces of every image that you upload similar to copilot.

  • @Dave-nz5jf
    @Dave-nz5jf หลายเดือนก่อน

    GPT40 is getting hammered this days, I wonder if that has any effect on the results it spits out sometimes. Really great new set of tests!

  • @user-ei4ol4dw7i
    @user-ei4ol4dw7i หลายเดือนก่อน +2

    Hello Matthew. Do the following test: there are many photos in one directory. Will these LLMs be able to sort photos into folders, depending on their subject matter? For example, photos in nature, photos of a house, photos of animals.

  • @wardehaj
    @wardehaj หลายเดือนก่อน

    Another prompt idea: Ask what ingredients the shown image of a pizza has

  • @mountainmonkey15
    @mountainmonkey15 หลายเดือนก่อน

    Any one know of a tutorial for the Local Server feature in LM Studio?

  • @kristianlavigne8270
    @kristianlavigne8270 หลายเดือนก่อน

    For future tests. Give it a flowchart and have it explain the flow and convert it into code

  • @Interloper12
    @Interloper12 หลายเดือนก่อน

    Well we explicitly know GPT4o was trained on the whole of Reddit, so it passing the digging meme is no surprise.

  • @wawoai
    @wawoai หลายเดือนก่อน

    Hey Mathew great video. I would test the models to identify elements in a web page to see if they are useful for building browsing agents. Thank you

  • @yevhendyachenko1384
    @yevhendyachenko1384 หลายเดือนก่อน

    We definitely need some other way to test prompts for the visual models. I would use a long explanation system message to point local LLM to use, scan and convert the picture to a text and use it in order to "read" user prompt with the graphical embedding

  • @dab9590
    @dab9590 หลายเดือนก่อน

    For bing/copilot faces are automatically blurred before the image is shown to the model. Might be the same with phi3 on azure. Just something to keep in mind.

  • @dhrumil5977
    @dhrumil5977 หลายเดือนก่อน

    Is it possible to do full OCR usimg these model??? 🤔

  • @nekomatic
    @nekomatic หลายเดือนก่อน

    Can you make avideo those models challenged with udestanding charts, graphs, diagrams etc. And how well those models extract meaning data from this kind of images? Are rthose models capable to i.e. understand an UML style activity diagram?

  • @mikezooper
    @mikezooper หลายเดือนก่อน

    I’m envious of Matthew’s AI hair stylist. Only joking Matthew. Love your videos 😊

  •  หลายเดือนก่อน

    Can the models produce an specific json shape, for instance the iphone screenshot, can we generate a json with the app name, and GB of storage, percentage taken, percentage left etc? Thanks for the content very cool, I alway watch your videos

  • @matthiasschmitt644
    @matthiasschmitt644 หลายเดือนก่อน

    hello Methew,
    I would be interested to see how well the models transcribe handwritten pages. I think this is an interesting use case for digitizing your own notes. Please add such a case to your Vision test. Thanks!

  • @Viewable11
    @Viewable11 หลายเดือนก่อน

    The image descriptions of Llava are perfect input for an image generation LLM. LLava's output sounds very much like the results from "/desc" command of MidJourney. I am pretty sure that if you put the output of Llava as input into MidJourney, you get nearly exactly the same image.

  • @fabriai
    @fabriai หลายเดือนก่อน

    It would be interesting to see if they can explain some software architectural diagrams. Some entity relationship diagrams and such.

  • @Oscaragious
    @Oscaragious หลายเดือนก่อน

    I think for whatever reason, LM Studio isn't sending the text with the image to the model. If you send an image, it's going to always ignore any text prompt you send along with it. Not sure if that's for every model, but you should probably consider sending the image, and then sending the text separately.

  • @MarcWilson1000
    @MarcWilson1000 หลายเดือนก่อน

    Hi Matthew
    I think a great test which none of the vision models are yet great at is to convert a bit map graph to data.
    Eg a stacked bar graph, 3 or 4 series, 2 or more categories
    It would be a life changing productivity hack!!
    Great channel.

  • @passiveftp
    @passiveftp หลายเดือนก่อน

    Here's one for you to use - can any of them say what the X,Y pixel position of something is within the image? like an icon

  • @Redhookguy2000
    @Redhookguy2000 หลายเดือนก่อน +1

    Photo folders is not an app. It’s part of the iOS operating system

  • @TiagoTiagoT
    @TiagoTiagoT หลายเดือนก่อน

    I'm not sure that QR code if fully standard compliant. I had to do some clean-up on Gimp to get my phone to be able to read it. Mainly it seems the two biggest issues are, first the way the 3 big squares are colored (changing the 3 layers around, from originally dark gray, black, dark gray; to black, white, and black made it look more normal; but there''s something that is still kinda off, QR codes are supposed to work both normal and inverted, but I could only make my phone read it if I de-inverted it (so that be 3 big solid squares are black instead of white, and the 3 surrounding layers the opposite of what I previously described), the more common visual), I'm not sure what's wrong with it that is making it not be recognized in the inverted form. I'm not quite sure how much it can be attributed to failure of the QR code generator, as failures of the more common QR code reader libraries, or even ambiguities of the standard itself though; I've only got a very superficial familiarity with the standard.

  • @gideonvongalambos7221
    @gideonvongalambos7221 หลายเดือนก่อน

    Ask if it can measure distances between points in a photographic image, not in pixels but in the actual length of the object like for example the side of a building from different angles see if it can give a reasonable estimate of it's size if not an actually accurate figure. I'm also interested if they could convert an image of some object into a CAD file as well but I suspect they weren't trained like that.

  • @kranstopher
    @kranstopher หลายเดือนก่อน

    I've actually had pretty good luck with llama 3 dolphin. I tried using the lava variant and I came up with kind of the same results.

  • @ComicBookPage
    @ComicBookPage หลายเดือนก่อน

    I'm curious how these vision models would do with the interaction of the text and images on a comic book page

  • @punk3900
    @punk3900 หลายเดือนก่อน

    LM is wonderful, indeed!

  • @s2turbine
    @s2turbine หลายเดือนก่อน +1

    I mean, a QR Code can have anything. OpenAI is likely not letting you do it to combat jailbreaks.

  • @gpsx
    @gpsx หลายเดือนก่อน

    GPT-4o did very well on the iPhone questions. I wonder if it had some specific training related to OpenAI's partnership with Apple.

  • @jayeifler8812
    @jayeifler8812 หลายเดือนก่อน

    So video generation completes the media capabilities of LLMs. Once that's freely and openly available LLMs will be used even more.

  • @neilomalley9887
    @neilomalley9887 หลายเดือนก่อน

    Would be good if you presented a scoreboard at the end based on your testing tactics

  • @dumbkid0
    @dumbkid0 หลายเดือนก่อน

    actual use cases that I constantly get wrong answers from the vision models:
    1. screenshot of outlook calendar, and get vision model to suggest an one hour opening for a new meeting between 9-5 M-F.
    2. screenshot of a bunch of workers in factory, and get vision model to identify how many workers are not wearing hardhats (also identify total workers in picture)
    3. simply upload two identical pictures and get vision model to tell the difference between them.
    4. ask if picture is computer generated (use a dalle-3 generated picture or one of those computer generated portrait)

  • @Krisdomain
    @Krisdomain หลายเดือนก่อน

    11:03 My phone QR code scanner cannot detect it as well

  • @mikekearl2416
    @mikekearl2416 หลายเดือนก่อน

    I would like to see for the vision tests a form that is filled out with both text and handwriting. Tell it to create a organized and well formed JSON object using the data found on the filled in form.

  • @arinco3817
    @arinco3817 หลายเดือนก่อน

    The ui recognition stuff is interesting from a designing agents point of view.
    How about estimate the coords of a certain point? Useful for mouse clicking

  • @EarthwormJeff
    @EarthwormJeff หลายเดือนก่อน

    Hi, could you test a chessboard description ? Why not asking what would be the best next move. Thanks

  • @gamingthunder6305
    @gamingthunder6305 หลายเดือนก่อน

    here is what i found. if you copy the vision model file into a folder with an uncensored model and rename the vision model file matching the model. it will load and work much much better. i tested it with the dolphin 2.9 model.
    thnx for sharing matthew.

  • @rghughes
    @rghughes หลายเดือนก่อน

    For the GPT-4o testing, I think there may be some personalization settings that may be affecting the results. The responses seem too succinct to me.

  • @AlfredNutile
    @AlfredNutile หลายเดือนก่อน

    Maybe try to use them for reading a webpage and scraping the right data from it

  • @gh0stgl1tch
    @gh0stgl1tch หลายเดือนก่อน

    How can we run phi3-vision locally?

  • @thanksfernuthin
    @thanksfernuthin หลายเดือนก่อน

    You can ask the AI not to editorialize and give a clear descriptions. It works.
    And one answer you got you thought it should say the Photos app. It might have not considered Photos an app. Just the memory used for your photos.

  • @elgodric
    @elgodric หลายเดือนก่อน

    What's the GPU Ram required for these models?

  • @cstuart1
    @cstuart1 หลายเดือนก่อน +1

    What was temperature set at for Llama?

    • @ayushmishra5861
      @ayushmishra5861 หลายเดือนก่อน

      What is temperature in AI context, and what difference does it make?

  • @taveiraadriano
    @taveiraadriano หลายเดือนก่อน

    Can you try uploading a graph picture (such as CPI) YoY line graph and ask the models How they understand It? Which average pace it's moving? Which date intervals it's growing or reducing?

  • @SoCalGuitarist
    @SoCalGuitarist หลายเดือนก่อน +1

    Both models feature the same problems most of the small parameter vision models suffer from, too much fluff and useless AI jargon, negative hits ("there are no other people or animals in this image"), useless summaries and issues with accurate OCR. They're not horrible, but when you're trying to work with them in production, the warts show up quickly. I've fine tuned multiple different families, the only one that gets close to GPT4.5 Turbo performance was LLaVANext Vicuña 13B. Solid reading skills, good awareness of what's actually happening in a scene (comprehension), less AI jargon and fluff, and in my testing, most accurate out of 5 or 6 different model families I've tried including Idefics 1/2, cog-vim, llama3 llava, llava 7B/13b/32B, Moondream, Phi, BLIP (yuck), and a few others I've dredged up on HF.
    Now with GPT4o, the best got waaaay better. Accuracy rate is in the high 96 - 98% range (Vic 13B hits around 90%, rest are in the mid 70's or lower), detailed JSON output, and 1/4 the cost of GPT4 API. Before lots of folks reply with how great Phi-3 is for their RP chats, I'm using it for production vision feature analysis where it has to fill in a whole bunch of fields per input image, hence JSON mode.

  • @MaximilianPs
    @MaximilianPs หลายเดือนก่อน

    I've used Claude to get help with Unity and I sent him a lot of screenshots, and he was amazingly good!
    Gpt and CoPilot failed miserably instead 😅