What If Someone Steals GPT-4?

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 มิ.ย. 2024
  • Links:
    - The Asianometry Newsletter: www.asianometry.com
    - Patreon: / asianometry
    - Threads: www.threads.net/@asianometry
    - Twitter: / asianometry

ความคิดเห็น • 398

  • @emuevalrandomised9129
    @emuevalrandomised9129 6 หลายเดือนก่อน +469

    Honestly, it would be a very curious idea to see how the model would behave in the absence of all the limiting systems.

    • @100c0c
      @100c0c 6 หลายเดือนก่อน +74

      From what I've read, not as good as you'd assume. Just more erratic and wrong...

    • @quickknowledge4873
      @quickknowledge4873 6 หลายเดือนก่อน +39

      @@100c0c mind sharing what you read specifically? Very interested in coming up with my own conclusion on this.

    • @amandahugankiss4110
      @amandahugankiss4110 6 หลายเดือนก่อน

      endless child porn
      that seems to be the goal of all of this

    • @nobafan7515
      @nobafan7515 6 หลายเดือนก่อน +16

      ​@@100c0cwhat's weird is I've been hearing the main one is already making more errors from users inputting incorrect data.

    • @obsidianjane4413
      @obsidianjane4413 6 หลายเดือนก่อน

      It will just do any dumb sht the meat puppets tell it to.

  • @mikebarushok5361
    @mikebarushok5361 6 หลายเดือนก่อน +147

    A very good friend of mine did some recent work upgrading storage for the research division of a very large pharmaceutical corporation.
    Their security protocols were good, but also inflexible, creating motivation to work around restrictions that slowed the upgrade down to a near standstill.
    The financial incentives, combined with a sense of hubris resulted in several major potential risks of security being temporarily bypassed in ways that weren't fully auditable.
    If an insider was waiting for the moment when exfiltration of very expensive and proprietary data and software was possible, then they got their chance.
    Security is always in tension with getting work done and there's no such thing as perfect security.

    • @fxsrider
      @fxsrider 6 หลายเดือนก่อน +4

      Even on my level, typing my password every time I wake up my computer gets on my nerves. Encrypted files are fun as well. I have removed security numerous times only to swing the other way worrying about malware etc. This is on my personal PC.
      I worked for decades at an aerospace company that had sign in and log on requirements that were super annoying to repeat many times a day. Then I had to change my password all the time it seemed. Everyone had to do it every 3 months or so. To the point I had rolled the entire alphabet as the last character and was well into the upper case when i retired.

    • @mikebarushok5361
      @mikebarushok5361 6 หลายเดือนก่อน +2

      @@fxsrider I know that same frustration with frequently having to change passwords at aerospace companies, having worked for a couple of them myself. It was an open secret at one of them that everyone left post-it notes with their most recent password under the keyboard.

    • @craigslist6988
      @craigslist6988 6 หลายเดือนก่อน +5

      as an engineer I've never once seen a company that wasn't compromised by China. China has a lot of people trying and small US companies are such easy fodder. People act like best security practices simply existing somewhere makes the tech world safe... but if you graphed population vs competency of IT, it would look like wealth in the US - almost all of the high competency is in a very small number of people. The other 99% are abysmal. It's hard to be smart enough about security now, there are so many attack vectors and corporations see it as an expensive cost with a low risk high punishment, so they justify not paying for it. And tbh the amount of money to compete for those few people who are actually very competent might not be worth it to the company.

  • @michaelpoblete1415
    @michaelpoblete1415 6 หลายเดือนก่อน +48

    Llama 2 is now almost at the level of GPT-3.5, even without breaches, Llama 3 might be at the level of GPT4, in that case, isnce Llama series is open source, then the question of what would happen if GPT4 is stolen might become moot and academic since anyone can just download open source Llama which at some point in the near future might reach the level of GPT4.

    • @ebx100
      @ebx100 6 หลายเดือนก่อน +1

      Well, Llama is only sort of open source. If you commercialize it, you pay.

    • @michaelpoblete1415
      @michaelpoblete1415 6 หลายเดือนก่อน +6

      @@ebx100 this video's topic is about the ramifications of GPT4 getting stolen. With a stolen model, you dont even have the option to pay for it, you go straight to jail.

    • @96nico1
      @96nico1 6 หลายเดือนก่อน

      Yeha I had the same thought

    • @joaosturza
      @joaosturza 5 หลายเดือนก่อน

      @@ebx100 it doesn't prevent people from commercializing it covertly, to prove it would require you to prove a certain work was done by a specifc AI, something that we curretnly cannot

  • @nixietubes
    @nixietubes 6 หลายเดือนก่อน +37

    Commoncrawl doesnt provide data only for machine learning, it's for research of all sorts. And the 45TB number is inaccurate, the dataset is measured in PB

  • @Nik.leonard
    @Nik.leonard 6 หลายเดือนก่อน +30

    This already happened in the Image generation space when the NovelAI model got leaked from a badly secured Github account, downloaded and used as a (somewhat) foundational model for a lot of anime image generation models.

  • @asdkant
    @asdkant 6 หลายเดือนก่อน +12

    Small correction: SSH is used for remotely operating (unix and linux) machines, for API and web traffic it's more common to use TLS (also called SSL coloquially , technically ssl is older)

  • @nexusyang4832
    @nexusyang4832 6 หลายเดือนก่อน +7

    Just a matter of time we see a "Folding at home" equivalent project that can train a single distributively and decentralized. Then it isn't about theft, but what can be done with such a tool....

  • @dingodog5677
    @dingodog5677 6 หลายเดือนก่อน +5

    If AI is based on what’s on the internet. It’s gonna be the dumbest thing around. Garbage in garbage out. It’ll probably become sentient and commit suicide from depression.

  • @sangomasmith
    @sangomasmith 6 หลายเดือนก่อน +35

    It is darkly hilarious to watch AI companies spend enormous effort and resources to to fend off the theft of their models, when the models themselves were build off of stolen and public-domain data.

    • @makisekurisu4674
      @makisekurisu4674 6 หลายเดือนก่อน +5

      Hence stealing stolen goods is perfectly fair.

    • @relix3267
      @relix3267 5 หลายเดือนก่อน +2

      not exactly

    • @vidal9747
      @vidal9747 4 หลายเดือนก่อน +3

      There is public in public-domain... You can argue it is wrong to train in non-public domain data.

  • @magfal
    @magfal 6 หลายเดือนก่อน +5

    0:44 I don't know how successful OpenAI would be in enforcing the proprietary nature of their model if it leaked.
    It's built upon mountains of stolen and misappropriated data after all.

  • @insom_anim
    @insom_anim 6 หลายเดือนก่อน +8

    I think the AI companies are probably more afraid of an open source competitor that makes all of these protections irrelevant. There's no need to steal something built on publicly accessible information with enough time and effort.

  • @TheOwlGuy777
    @TheOwlGuy777 6 หลายเดือนก่อน +1

    I work next door to a movie studio. Our own IT department monitors all traffic in the area and there are multiple mobile piracy attempts a week.

  • @moth.monster
    @moth.monster 6 หลายเดือนก่อน +21

    What people think large language models are: Skynet, HAL-9000
    What large language models really are: Your keyboard's predictive text if it read the entirety of Reddit

    • @SalivatingSteve
      @SalivatingSteve 6 หลายเดือนก่อน

      This x1000. The fear mongering over AI is way overblown. The models are useless without new human-created data to feed into the system. My CS professor pointed out that if people stop posting on Stackoverflow or Quora because now they’re using ChatGPT instead, then it will just regurgitate old info and get outdated very quickly. It turns into this weird bootstrap paradox feedback loop where “knowledge” effectively stagnates.

    • @guilhermealveslopes
      @guilhermealveslopes 4 หลายเดือนก่อน

      The entirety of Reddit plus some lots of other sources

  • @LimabeanStudios
    @LimabeanStudios 6 หลายเดือนก่อน +17

    The effectiveness of generating training data off of existing public models has been really impressive. The open source community has been embracing it for obvious reasons to some real results. As of right now fine tuning off of generated data is where it's most used

  • @florianhofmann7553
    @florianhofmann7553 6 หลายเดือนก่อน +44

    So ChatGPT pulls all these answers out of only one TB of data? Sounds like the most efficient data compression we've ever created.

    • @tardonator
      @tardonator 6 หลายเดือนก่อน +31

      its lossy

    • @Greyboy666
      @Greyboy666 6 หลายเดือนก่อน +19

      1TB of /parameters/, working on 45TB of text. thats an absolutely staggering amount of information for what it can manage

    • @dtibor5903
      @dtibor5903 6 หลายเดือนก่อน +25

      LLMs are not storing the training data like a database, it is remembering it more like humans. It is lossy, it has gaps, it has mistakes.

    • @Geolaminar
      @Geolaminar 6 หลายเดือนก่อน

      That's because AI don't store their answers. I don't know how many times it has to be explained that AIs are not lookup tables. They're not compression, lossy or otherwise. That's made up by the NoAI crowd to try to pretend a generative AI can't produce original work. it was literally never true. Compression doesn't let you retrieve something that wasn't in the original dataset.

    • @gorak9000
      @gorak9000 6 หลายเดือนก่อน +4

      They must be using Hooli Nucleus

  • @RandomPerson-bv3ww
    @RandomPerson-bv3ww 6 หลายเดือนก่อน +9

    as usual with these questions its not if but when

  • @bbirda1287
    @bbirda1287 6 หลายเดือนก่อน +8

    You have to remember he mentions state actors many times during the presentation, so a lot of the hardware / software / resource limitations for anonymous hackers don't really apply. State actors can easily have servers to store Petabytes of information and have multiple hi speed connections for download.

    • @aspuzling
      @aspuzling 6 หลายเดือนก่อน +2

      I think the reason data has to be exfiltrated slowly is that it probably sits behind hardware that limits the speed of any outgoing network connection.

    • @SalivatingSteve
      @SalivatingSteve 6 หลายเดือนก่อน +4

      @@aspuzlingit has to be done stealthily with lots of connections masked to look like normal traffic, because trying to download a massive amount of data to a single user would raise red flags.

  • @dr.eldontyrell-rosen926
    @dr.eldontyrell-rosen926 6 หลายเดือนก่อน +4

    "Malicious capabilities?" please define.

    • @retard1582
      @retard1582 หลายเดือนก่อน

      generation of spam that is so complex that it will fool 90% of laymen. Help with the creation of fake bank login landings, and fake shopping sites. There's all kind of stuff that's possible.
      Voice spoofing, fake news generation, propaganda creation.

  • @aniksamiurrahman6365
    @aniksamiurrahman6365 6 หลายเดือนก่อน +9

    For LLM to be truely embedded all around people's lives, it needs to be open sourced. There are many importatnt things can be done with GPT-4, like using it to automate corporate paperwork, to use it to aid peer review of scientific research, summerizing and investigating documents etc. What Microsoft is doing will never do these. The closed source nature also ensures that there can't be anything better than what they got, essentially inhibiting any proper growth and application.

  • @isbestlizard
    @isbestlizard 6 หลายเดือนก่อน +90

    What if someone steals the collective writing of humanity, every book, news article, reddit post ever written, and uses it to train a model they then consider propietary trade secret? Can you really 'steal' something that was already stolen and hoarded?

    • @dr.eldontyrell-rosen926
      @dr.eldontyrell-rosen926 6 หลายเดือนก่อน +23

      They hope to build these institutions amass huge investment and valuations and then cash out when regulations really hit.

    • @TwistersSK8
      @TwistersSK8 6 หลายเดือนก่อน +22

      When you read a book and acquire new knowledge, are you stealing the knowledge from the author of the book?

    • @stevengill1736
      @stevengill1736 6 หลายเดือนก่อน +2

      Apparently the use of synthetic data is supposed to avoid DRCM or copyright issues as well as speed up processing, but I had to look up synthetic data:
      en.wikipedia.org/wiki/Synthetic_data

    • @howwitty
      @howwitty 6 หลายเดือนก่อน +14

      ​@@TwistersSK8Uhhh... not the same as a machine "reading" the book. Isn't that obvious? Pirates made a similar argument that copying digital files isn't theft because the owner still has the original copy. Maybe you should try stealing this book?

    • @EpitomeLocke
      @EpitomeLocke 6 หลายเดือนก่อน +10

      ​@@TwistersSK8 lmao are you seriously equating a human and an ai model

  • @joaosturza
    @joaosturza 5 หลายเดือนก่อน +2

    the companies would imediatly be massively sued if the training data is leaked, as it gives every party with works in it the possibility of suing the company, it is an unwinnable battle as hundreds potentialy tens of thousands of IP holders will sue chat GPT and openAI

  • @theobserver9131
    @theobserver9131 6 หลายเดือนก่อน +3

    Not being an IT guy, I'm a little bit confused. I thought that open AI meant open source code, which I thought means that anyone can copy and use it and even modify it?

  • @raylopez99
    @raylopez99 6 หลายเดือนก่อน +51

    The biggest risk to GPT "theft" is simply an employee walking out the door with the knowledge of GPT. In California you cannot stop an employee from using what they remember. You can stop them from taking files with them however. It's a delicate balance but in general, "information wants to be free" and it's hard to keep stuff proprietary. At the core, GPT is matrix multiplication which cannot be copyright per se.

    • @raylopez99
      @raylopez99 6 หลายเดือนก่อน +5

      Also non-compete agreements have to be reasonable and in California are generally not enforced much by law except in specific circumstances.

    • @dtibor5903
      @dtibor5903 6 หลายเดือนก่อน +5

      Absolutely true, but to recreate the same training data costs a lot.

    • @vvvv4651
      @vvvv4651 6 หลายเดือนก่อน +4

      nobody can remember 1tb of data out the door buddy 😂. true tho.

    • @dtibor5903
      @dtibor5903 6 หลายเดือนก่อน +9

      @@vvvv4651 it's more important how the training data was organized, structured, formatted and the training methods. If these informations would be really that secret, other LLMs would be far far behind.

    • @theobserver9131
      @theobserver9131 6 หลายเดือนก่อน

      @@vvvv4651 there are a few special people that remember absolutely everything they see. They're usually fairly challenged cognitively, but they can remember a whole phonebook just by reading it once. Have you ever heard of rain man?

  • @damien2198
    @damien2198 6 หลายเดือนก่อน +4

    Gonna be nice when will be able to run theses huge model distributed/trained/infered on "Folding@home" systems, uncensored

  • @AmericanDiscord
    @AmericanDiscord 6 หลายเดือนก่อน +3

    The data is available and there are open source models with close to equivalent performance. The problem is the cost curve for more advanced queries. The leaders in AI will likely be determined by access to efficient hardware, not anything else. Worrying about protecting weights, while it shouldn't be ignored, is the wrong direction.

    • @SalivatingSteve
      @SalivatingSteve 6 หลายเดือนก่อน

      This is why the USA has put restrictions on certain GPU & chip exports to China.

    • @AmericanDiscord
      @AmericanDiscord 6 หลายเดือนก่อน

      @@SalivatingSteve I don't think improvements to current hardware architectures are going to get AI past the coming hardware wall. You are going to be looking at something different.

  • @cbuchner1
    @cbuchner1 6 หลายเดือนก่อน +7

    A verbatim copy of those 1TB weights would not be valuable for very long as I am sure OpenAI are continually updating and refining it and I am sure they already have the next big thing in the pipeline. It would just be a momentary snapshot with a fixed knowledge cutoff

    • @joaosturza
      @joaosturza 5 หลายเดือนก่อน

      the training data, however, is so precious it would warrant a massive ransom, as its public release would see every IP holder suing the company, especially since in several jurisdictions you are required to protect you copyright against violations and not suing OpenAI might eventually be interpreted by a judge as not caring if your work appears in any AI

  • @okman9684
    @okman9684 6 หลายเดือนก่อน +8

    Imagine downloading the full version of gpt4 from your internet

    • @florin604
      @florin604 6 หลายเดือนก่อน

      😅

    • @romanowskis1at
      @romanowskis1at 6 หลายเดือนก่อน +5

      Easy with fiber to home, i think it should take few hours to full save on ssd.

    • @michaelpoblete1415
      @michaelpoblete1415 6 หลายเดือนก่อน +12

      the problem is running on what hardware.

  • @AlexDubois
    @AlexDubois 6 หลายเดือนก่อน +3

    Data at rest is only encrypted for the layers below the encryption process. If done by the OS, the client of the OS sees the data in clear. So which layer does the encryption is important. For encryption of data in use. Intel SGX is a very common way to secure cloud playloads, however an application vulnerability on the code running in SGX negate the security properties of SGX. This is why languages such as Rust should be used and the number of lines running inside the enclave needs to be limited as much as possible to limit the attack surface. A Man in the process for such enclave is very hard to detect.

  • @aleattorium
    @aleattorium 6 หลายเดือนก่อน +4

    9:30 - worth also researching Okta and Microsoft Azure hackings of their ticketing and supporting systems.

  • @Charles-Darwin
    @Charles-Darwin 6 หลายเดือนก่อน +10

    I would think Quora is a massive source of conversational Q&A made available and contributes to the dataset - unfettered. Adam D’Angelo is a senior board member basically at both companies.
    Also, what OpenAI did with going live on such a simple interface was 100% stroke of genius. I firmly beleive this format allowed for not only training, but providing a very solid baseline of what humanity cares about OF the data set - else there is just way too much data to model on. This 2x bootstrapped a 'scope' to start from and trained errors out based on the acceptance of the result to a query. This is prob some secret sauce as to why they're able to iterate so fast. Its the end user.

    • @SalivatingSteve
      @SalivatingSteve 6 หลายเดือนก่อน

      Exactly the project narrows the scope on its own as it trains out errors.

    • @aapje
      @aapje 6 หลายเดือนก่อน +3

      Quora is extremely low quality data, though, for the most part.

  • @obsidianjane4413
    @obsidianjane4413 6 หลายเดือนก่อน +9

    Meh.
    The LLM datasets are less important than the algorithms that build them. GPT is just a chatbot. A big, good, training set is valuable for its functionally and the cost it took to build. Lots of datasets are being built these days. They are going to be like cyptos. The first one was valuable, but then everyone made one and the value of all dropped.
    Chatbots are good at "talking", as in it can predict what a human would say based upon the keywords in the prompt input. But the model does not "know" or "think" anything. Most of them are dumb. There best utility is in making serendipitous connections of concepts and ideas from masses of data.

    • @isbestlizard
      @isbestlizard 6 หลายเดือนก่อน +1

      What do you think a human mind is, but lots of chatbots talking with each other, supervising each others output, correcting, analysing, reviewing, rating, amending in a way that creates the epiphenomenom of intelligence?

    • @obsidianjane4413
      @obsidianjane4413 6 หลายเดือนก่อน +7

      @@isbestlizard That is not what the human mind is any more than it is a computer, or any other poor metaphor used before.

  • @EyesOfByes
    @EyesOfByes 6 หลายเดือนก่อน +2

    13:13 Glad I'm not the only one thinking that was Sam

  • @behindyou702
    @behindyou702 6 หลายเดือนก่อน +3

    Love the way you present your research, can’t believe I wasn’t subscribed!

  • @johnmoore8599
    @johnmoore8599 6 หลายเดือนก่อน +21

    Tavis Ormandy found Zenbleed where the CPU was exposing data from the system. I think hardware vulnerability security testing is in its infancy and he's one pioneer using software.

    • @SurmaSampo
      @SurmaSampo 6 หลายเดือนก่อน

      Travis is rockstar in the field!

    • @honor9lite1337
      @honor9lite1337 6 หลายเดือนก่อน +1

      ​@@SurmaSampois he still at Google?

  • @monad_tcp
    @monad_tcp 6 หลายเดือนก่อน +9

    I would say that it happened would be overall a good thing.
    It's too much of a powerful thing to be in the hands of a few persons.
    I don't believe anyone has magical ethic to be able to decide or "protect" humanity from any bad outcome.
    Actually the other way around, in trying to do good, but without the input of the rest of humanity, they for sure are going to end up doing evil.

  • @johnbrooks7350
    @johnbrooks7350 6 หลายเดือนก่อน +93

    It’s crazy to me that these models are so huge. I do wish many of these would be released entirely to the public. Even with the risks, I think open source and open development lead to the best long term production for everyone

    • @Fs3i
      @Fs3i 6 หลายเดือนก่อน +18

      Llama-2 is the biggest open source model. It’s very mid.

    • @H0mework
      @H0mework 6 หลายเดือนก่อน +7

      @@Fs3i Goliath-120B is based on llama and I heard it's very good.

    • @magfal
      @magfal 6 หลายเดือนก่อน +1

      ​@@Fs3iit's not open source, it's relatively permissively licensed.

    • @henrytep8884
      @henrytep8884 6 หลายเดือนก่อน +3

      Yes lets give everyone nuclear weapons....NO WE DON'T DO THAT

    • @johnbrooks7350
      @johnbrooks7350 6 หลายเดือนก่อน +25

      @@henrytep8884 homie…. So only give private companies nuclear weapons??? What the hell is this ancap logic

  • @jjj8317
    @jjj8317 6 หลายเดือนก่อน +14

    The goal is to build things in America, Canada, Europe etc by said people. The thing is, Chinese Canadians are also Canadian, and Chinese Americans are also Americans. It is not possible to ignore the issues that arises from people who have links or are literally part of the Chiense state in the aforementioned countries.
    Also, there is nothing wrong with being proud of your roots, and being proud of having a direct association with the Peoples Republic of China. You dont really want Chinese nationalist to actively manage a data center when there are other people who sre perfectly capable.
    I think people who cant differentiate the PRC snd chinese people are an issue, just like it is true that companies dealings with critical tech should be aware of people who have links to other states.

    • @stefanstankovic4781
      @stefanstankovic4781 6 หลายเดือนก่อน +4

      I'd rather not have any nationalist actively manage a data center, thank you very much.
      ...assuming we're using the term "nationalist" in a fanatical/irrational sense here.

    • @bruceli9094
      @bruceli9094 6 หลายเดือนก่อน +1

      I think the future is India though. They are currently the world's biggest population.

    • @SalivatingSteve
      @SalivatingSteve 6 หลายเดือนก่อน

      I think tech companies who pull the H1-B visa scam to save a few bucks on payroll are especially at risk of IP theft from foreign actors.

    • @jjj8317
      @jjj8317 6 หลายเดือนก่อน

      @@bruceli9094 A bit of the same issue. Huge nationalism issue that puts india or sikh values over Canada or America. In Canada there are riots where these two beat fuck out of each other. There has been assasinations and terrorists attacks. You have to prioritize the needs of the country above everything. I can tell you as an immigrant that some of the people who move to north america are a testament of bad screening practices. In Canada there has been cases of Chinese nationals who were somhow allowed to worked in defense programs, and took blue prints from frigates and signaling codes and handed them to the chinese state. In the case of the UK, they had a dude who worked in their nuclear program steal blue prints and recreated the bomb in Pakistan. So the whole it doesnt matter if a person is loyal to the country is ridiculous

    • @jjj8317
      @jjj8317 6 หลายเดือนก่อน

      @@stefanstankovic4781 You want to assure that you have your tech companies and data centers filter out of people who have direct ties with foreign states. Canada has suffered a lot of security breakdowns due to a lack of oversight and security clearence. It is very simple: you dont have to like american or western doctrine, but as long as you are western, you will be targeted. so you dont want people whose entire goal is to disrupt the enviroment you work and live in to control your data.

  • @damien2198
    @damien2198 6 หลายเดือนก่อน +21

    That s why OpenAI is planning to have their own hardware, who control the hardware controls the model (that would only be able to run on that specific hardware)

    • @nekogami87
      @nekogami87 6 หลายเดือนก่อน +3

      Pretty sure they don't ? The CEO opened a new company and used the name OpenAI to sell it to investors, but i'm pretty sure that new entity has nothing to do with OpenAI (and is fully for profit)

    • @sumansaha295
      @sumansaha295 6 หลายเดือนก่อน +2

      unless they are running their models off of quantum computers it makes no difference, At the end of the day it's still just matrix multiplication in a specific order.

    • @dtibor5903
      @dtibor5903 6 หลายเดือนก่อน +2

      ​@@sumansaha295matrix multiplications do not need quantum computers,

  • @GungaLaGunga
    @GungaLaGunga 5 หลายเดือนก่อน

    Basically as the compression gets better, all of human knowledge can be copy pasted onto any device in seconds.

  • @Quast
    @Quast 6 หลายเดือนก่อน +2

    8:25 Finally we know what John Doe looks like!

  • @whothefoxcares
    @whothefoxcares 6 หลายเดือนก่อน +3

    someone like 3lonMu$k could teach machines that greed is good.

  • @ikuona
    @ikuona 6 หลายเดือนก่อน +1

    Just copy it on floppy disk and run away, easy.

  • @nwalsh3
    @nwalsh3 6 หลายเดือนก่อน +6

    While I refuse to call things like ChatGPT for AI, I can't deny that the security and usage scenarios fascinate me to no small degree. Partly because of my work background in security, but also of how, with little regards to what they type in, these text generators are being used.
    When companies activly have to go out in their internal communication channels and say "don't put personal or business data into [insert system here]", then you know that access, use and filters on people are basically non-existent.
    Some years back MS did a video on how the carious security layers in their datacentres are supposed to work (or was it AWS?). A good watch but, as with all things, a bit rosy. I worked at a company that had what they called a "secure facility". It was in fact so secure, that when a cleaner was going to clean in one of the server rooms, they yanked out a cable to run their machine... and 3/4 of the servers just stopped responding. Very Secure indeed.

    • @SurmaSampo
      @SurmaSampo 6 หลายเดือนก่อน

      Cleaners are the natural predator of DC's.

    • @SalivatingSteve
      @SalivatingSteve 6 หลายเดือนก่อน

      The janitor unplugging a critical server sounds like my ISP Charter Spectrum.

    • @nwalsh3
      @nwalsh3 6 หลายเดือนก่อน +1

      @@SalivatingSteve It wasn't just the server... it was a section of server racks that went. :D
      AND it was not an isolated incident either.

    • @NATANOJ1
      @NATANOJ1 6 หลายเดือนก่อน +1

      i worked in several it offices, there was always someone who had a similar story where a cleaner just pulled a plug to clean in the server room.

  • @binkwillans5138
    @binkwillans5138 6 หลายเดือนก่อน +1

    Open the pod bay doors, HAL.

  • @szaszm_
    @szaszm_ 6 หลายเดือนก่อน +1

    I wonder whether NN model parameters fall under the copyright law, or if not, whether there's anything protecting it from copying. It's not really art, and it's not clear whether it's even a human creation.

  • @av_oid
    @av_oid 6 หลายเดือนก่อน +2

    Steals? Isn’t it OPEN AI? Or should it be called ClosedAI?

  • @yeshwantpande2238
    @yeshwantpande2238 6 หลายเดือนก่อน +2

    You mean to say it's not yet stolen by traditional thiefs ? And will GTP4 help in stealing itself?

    • @glennac
      @glennac 6 หลายเดือนก่อน

      “Isn’t it ironic?” - Morissette

  • @svankensen
    @svankensen 6 หลายเดือนก่อน +3

    Great video as always, but... you didn't answer the main question in the title. You went about how it would happen, not what consequences could be.

  • @Kyzyl_Tuva
    @Kyzyl_Tuva 6 หลายเดือนก่อน

    Fantastic video. Really appreciate your channel.

  • @hermannyung7689
    @hermannyung7689 6 หลายเดือนก่อน +1

    the only way to prevent model being stolen is to keep pushing new and better models

  • @Nathan-ko8um
    @Nathan-ko8um 6 หลายเดือนก่อน +1

    gpt-5: the girthening

  • @scarvalho1
    @scarvalho1 5 หลายเดือนก่อน

    I love this video. Excellent and interesting title, and very good research.

  • @thomasmuller6131
    @thomasmuller6131 5 หลายเดือนก่อน

    it sounds like sooner or later everyone has their own personal LLM and there is no money to be made with providing the service itself.

  • @Dissimulate
    @Dissimulate 6 หลายเดือนก่อน

    The most humorous part of that deer picture was the word humorous in the caption.

  • @user-cd4bx6uq1y
    @user-cd4bx6uq1y 3 หลายเดือนก่อน +1

    0:10 btw that Andrej guy is pretty controversial

  • @VEC7ORlt
    @VEC7ORlt 6 หลายเดือนก่อน +2

    What will happen? Nothing, nothing at all, world will not implode, internet will be fine, LLM will give same half assed answers as before, maybe some stock numbers will fall and poor ceo heads will roll, but I'm fine with that.

  • @fffUUUUUU
    @fffUUUUUU 6 หลายเดือนก่อน +2

    Yeah, someone * cough cough * China Iran Russia

  • @marcfruchtman9473
    @marcfruchtman9473 6 หลายเดือนก่อน +1

    A little bit misleading... obviously "Nothing" happens. It's like asking, what if Actor A steals the open source content for public plays... There are so many open source near equivalents to GPT-4 now. And the data is simply out there to be scraped -- without having to do any hacking at all.

  • @vvvv4651
    @vvvv4651 6 หลายเดือนก่อน +1

    haha this popped up on my feed right after fantasizing possibly leaked no limits gpt models. well done.

  • @JoseLopez-hp5oo
    @JoseLopez-hp5oo 6 หลายเดือนก่อน +2

    Secure multi-party computing allows sensitive data to processed in secret without revealing the plaintext, however this is more to protect medical data for research and such. To protect a language model or some other complex business logic is best not to put the code in the hands of the attacker and use the glovebox / API methods to interact with with the sensitive IP without revealing it.
    Everything is so easy to hack, all your XPUs belong to me!

  • @redo1122
    @redo1122 5 หลายเดือนก่อน

    This sounds like you want to present a plan to someone

  • @Lopson13
    @Lopson13 6 หลายเดือนก่อน

    excellent video, would love to see more security videos from you!

  • @lilhaxxor
    @lilhaxxor 6 หลายเดือนก่อน +1

    TLDR: Databases with user and business information are far more valuable.
    I honestly doubt anything will happen. You need a whole infrastructure and competent staff to make use of these large models. Stealing those is completely pointless. You can't even really do ransomware with it (albeit you mentioned personal data might be used in the training set, there are ways to alter such data enough to remove personally identifiable information). There is honestly nothing to worry about here in my opinion.

  • @Manbemanbe
    @Manbemanbe 6 หลายเดือนก่อน

    Good to see SBF taking that Home Ec class from prison there at 13:15 . You gotta stay busy, that's the key.

  • @nekoill
    @nekoill 6 หลายเดือนก่อน

    Whoever knows better please correct me, but I'm pretty sure the source code of model, most likely alongside dataset (but probably on different storage devices, both physically and virtually), is stored somewhere on a machine that isn't connected to the web at large, if connected to any kind of network at all. That doesn't eliminate the risk of data being stolen, but you need to be physically present at the storage site fairly close to the computer (like *really* close) with a SATA cable shaped in a way that would allow it to serve as an antenna, or something like that. I expect OpenAI to take at least that kind of precaution, but who knows, dumb screwups happen in IT just as well.

    • @maht0x
      @maht0x 6 หลายเดือนก่อน +2

      there is no "source code" of the model, the model is the output of the training program which takes PB of text as it's input + HFRL (Human feedback, Reinforcement learning) feedback (this bit was missed out of the description and is arguably the hardest to replicate). Search for openAi's "Learning from human preferences" paper

    • @nekoill
      @nekoill 6 หลายเดือนก่อน +1

      @@maht0x yeah, sounds like it. Thank you for correction. My familiarity with ML/NNs is superficial, I know a couple of high-level concepts and a very coarse approximation of how it works under the hood.

  • @MO_AIMUSIC
    @MO_AIMUSIC 6 หลายเดือนก่อน

    Well, consider how big is the file, steal the parameter would be impossible to be unoticed. and even it is possible, would required physical move of the storage instead of transfer it over internet.

  • @buzzlightyear3715
    @buzzlightyear3715 6 หลายเดือนก่อน

    "The time has come." It would be surprise a number of nation states havn't been stealing the LLM today😂

  • @stachowi
    @stachowi 6 หลายเดือนก่อน

    This channel is unbelievably awesome

  • @flioink
    @flioink 6 หลายเดือนก่อน +1

    That's totally happening in the near future!

  • @lobotomizedamericans
    @lobotomizedamericans 6 หลายเดือนก่อน +1

    I'd fucking *love* to have a personal GPT4 or 5 with all BS ethical guard rails removed.

  • @astk5214
    @astk5214 6 หลายเดือนก่อน +1

    I think i would love for open-source unix skynet

  • @joelcarson4602
    @joelcarson4602 6 หลายเดือนก่อน +1

    And your interface for the model is not going to parse the model's parameters using a Commodore 64 either. You will need some serious silicon to really make use of it.

  • @g00rb4u
    @g00rb4u 6 หลายเดือนก่อน

    Get that hacker @01:02 a space heater so he doesn't have to wear his hoodie indoors!

  • @Steven_Edwards
    @Steven_Edwards 6 หลายเดือนก่อน +8

    There are so many open source LLMs trained on public resources that it is a moot point. Proprietary will never be able to keep up with open as far rate of improvement.
    When last I checked there was something like a dozen different LLMS most of them coming out of China but plenty coming out of other places in the world they've all been trained on different data sets many are up to gpt3-5 equivalence, exponentially faster than it took OpenAI to get to the same level.
    Honestly the big bottleneck is the same for everyone and that is inference. Processing prompts is an expensive proposition. I've seen used with home systems of up to 1pb of compute with GPUs that still are not performant enough to be realtime.
    As of right now only the largest in online services and state actors can afford inference that performs reasonable, that is the only thing that prevents true Democratization of AI at this point.

  • @MostlyPennyCat
    @MostlyPennyCat 6 หลายเดือนก่อน

    I wonder if you could ask gpt to steal itself for you.

  • @lashlarue7924
    @lashlarue7924 6 หลายเดือนก่อน +7

    8:45 Look, it isn't that we here in the US don't appreciate the contributions of Chinese nationals (and others too) to our infrastructure projects. We do. The issue is that if you have family, real estate, or other ties to China, or if you LIE about those ties, then you are susceptible to being manipulated, blackmailed, or otherwise vulnerable to coercion by regimes that can snap their fingers and send your parents or children into a gulag. That's why you guys get your clearances held up. It's not that we don't like you guys, it's that we have to face the cold hard facts about what happens when someone gets their arm twisted by the Ministry of State Security.

  • @ronaldmarcks1842
    @ronaldmarcks1842 6 หลายเดือนก่อน

    Yan Xu has created a somewhat misleading graphic. For both GPT-2 and GPT-3, the architecture doesn't involve separate *decoders* in the way that some other neural network architectures do (like the Transformer model, which has distinct encoder and decoder components). Instead, GPT-2 and GPT-3 are based on the Transformer architecture, but they use only the decoder part of the original Transformer model. What Yan probably refers to are not decoders but *layers*:
    GPT-2 has four versions with the largest having 48 layers.
    GPT-3 is much larger, with its largest version having 175 billion parameters across 96 layers.

  • @GavinM161
    @GavinM161 หลายเดือนก่อน

    Hasn't IBM been doing the encryption at 'line speed' for years with their mainframes?

  • @Narwaro
    @Narwaro 6 หลายเดือนก่อน +11

    I have yet to see any positive impacts of any of this stuff. Im kinda deep into the state of the art of reasearch in this field and its really not that impressive. The only thing I can see is that it replaces many stupid people in useless job positions. Which is yet to be seen if positive or negative.

  • @nahimgudfam
    @nahimgudfam 6 หลายเดือนก่อน

    OpenAI's value is in their industry partnerships, not in their subpar LLM product.

  • @johnkraft7461
    @johnkraft7461 6 หลายเดือนก่อน

    Remember what happened with the Bomb when only one guy had it ? Strangely, the use of the Bomb stopped when the Other Guy got one too ! Probably a good argument for open source from here on.

  • @Urgelt
    @Urgelt 6 หลายเดือนก่อน +2

    Purely open source models are not far behind Chat-GPT, and are advancing rapidly.
    We are approaching a tipping point: AI that is able to goal-seek and self-optimize, at which point curation of training data will no longer be much of an obstacle. AI will do it.
    The cat is almost out of the bag. It's probably too late to contain it.
    One obstacle remains: compute cycles. Training requires a lot of them. But advances are coming there, too - more compact models and better, cheaper chips tailored for training.
    AI is moving at blinding speed now. Anything proprietary you could steal will soon be obsolete - and even open source models will quickly surpass what was stolen.
    AI will fall into hands we might prefer not get it. No security protocols could prevent it, I'm thinking.
    What happens next, I can't even begin to guess.

  • @wrathofgrothendieck
    @wrathofgrothendieck 6 หลายเดือนก่อน +1

    Just don’t forget to steal the 40k computer chips that run the model…

  • @coolinmac
    @coolinmac 6 หลายเดือนก่อน

    Great video as usual!

  • @staninjapan07
    @staninjapan07 6 หลายเดือนก่อน

    Fascinating, thanks.

  • @jcdenton7914
    @jcdenton7914 6 หลายเดือนก่อน +1

    Ignore this, I am doing research and my own comment will show at the top when I revisit this.
    13:53 Model Leeching: An Extraction Attack Targeting LLM's
    attacked a small LLM
    14:39 Membership Inference Attacks on Machine Learning: A Survey
    14:50 Reconstructing Training Data from Trained Neural Networks
    Goes onto how extracting training data and lead to copyright lawsuits
    Insider threats
    16:10 "Two Former Twitter Employees and a Saudi National Charged as Acting as Illegal Agents of Saudi Arabia
    URL not shown.
    16:58 Verizon 2023 Data Breach Investigation
    Not sure if useful but it's recent

  • @adamgibbons4262
    @adamgibbons4262 5 หลายเดือนก่อน

    If all chips had a unique identifier value then couldn’t you encode data to be only executed on a specific set of chips? Then you can simply forget about all the headaches of theft? Data would then be secure once, execute multiple times (on a set list of cpus)

  • @benjaminlynch9958
    @benjaminlynch9958 6 หลายเดือนก่อน +1

    I’m not terribly worried about any of these models being stolen or otherwise made non-proprietary by malicious actors. State of the art models only remain state of the art for a few months. We went from GPT 1 to GPT 4 in just 5 years. We went from DALL-E to DALL-E 3 in 33 months.
    Worst case scenario is that the stolen ‘foundational’ model becomes obsolete in 12-18 months, and likely much sooner unless it’s stolen immediately after being released. And that assumes that competing models don’t surpass it either.

    • @Valkyrie9000
      @Valkyrie9000 6 หลายเดือนก่อน

      Which is exactly why nobody steals Lamborghinis older than 6 months old. They'll just build a faster/better one. /s

  • @luxuriousturnip181
    @luxuriousturnip181 6 หลายเดือนก่อน

    If it is theoretically cheaper to steal the data than to reproduce or create something able to compete with it, the question of the security of the data is a matter of when not if. We should all be asking when this will happen, and an even more troubling question is if that when has already passed.

  • @Bluelagoonstudios
    @Bluelagoonstudios 6 หลายเดือนก่อน

    It happened already, they could extract training data from GPT, by repeating a word 50x and it spit out these data. Even personal details from who wrote the data in the LLM. OpenAI closed the door by now. By noting this is against regulations from OpenAI. But is it solid enough? A lot of research has to be done to close off that one.

  • @JordanLynn
    @JordanLynn 6 หลายเดือนก่อน

    I'm surprised Meta's (facebook) Ollama isn't mentioned, their model was literally leaked onto the internet, so starting with Ollama 2 Meta just releases it to the public. It's all over huggingface.

  • @MostlyPennyCat
    @MostlyPennyCat 6 หลายเดือนก่อน

    Maybe it's cyber thieves complaining it's too slow so they _don't_ encrypt memory! 😮

  • @grizwoldphantasia5005
    @grizwoldphantasia5005 6 หลายเดือนก่อน +15

    FWIW, I think the problem of stealing intellectual property is overblown, because if you rely on copying someone else's work, you have fewer resources to develop your own knowledge in the field, you are always one or two generation behind, and you don't know what to copy until the market decides what is successful. A business which relies on copying will never develop the institutional knowledge of all the hard work which is never published and can't be copied. A business which wants to do both has to put a lot more resources into the redundant efforts.
    A State-sponsored business might look like it has solved the money problem, but money is not resources, it is only access to resources, and States can only print money, not resources. The more inefficient a State-sponsored business is, the higher the opportunity cost, the fewer other fields can be investigated or exploited. It's one reason I do not fear CCP expats stealing proprietary IP; it weakens the CCP overall. The more they focus on copying freer market leaders, the more fields they fall behind in.

    • @greatquux
      @greatquux 6 หลายเดือนก่อน +3

      This is a good point and one he has brought up in some other videos on computing history.

    • @bilalbaig8586
      @bilalbaig8586 6 หลายเดือนก่อน +9

      Copying is viable strategy when you are significantly behind the market. It allows you to keep pace with fewer resources. It might not be something China may be satisfied with but other players with fewer resources like North Korea or Iran would definitely find value in.

    • @durschfalltv7505
      @durschfalltv7505 6 หลายเดือนก่อน

      IP is evil anyway.

    • @obsidianjane4413
      @obsidianjane4413 6 หลายเดือนก่อน +3

      Except most development is based upon prior work. When you have a bot that can churn thru a million patents and papers it can put A, B, and Z together better than any human, or even collection of humans can.
      The intellectual theft problem isn't in the stealing of the LLM, its the theft of the documents or works by the company that builds the training model. Its common to pay for research papers and for books etc. The claim is that they are scraping the internet for these documents without compensation or paying royalties.
      Yeah, the CCP being able to develop a 5th gen fighter aircraft really weakened them. More insidious is that the authoritarian states like the PRC have institutionalized IP theft. They do this by forcing expats to spy with extorting them with implied threats to family and themselves. Chinese nationals really are a security threat to other countries and companies. That isn't sinophobia, its just reality.

    • @SpaghetteMan
      @SpaghetteMan 6 หลายเดือนก่อน

      @@obsidianjane4413 then you'd be stuck in the same quandary as the folks at the Manhattan Project when they were looking for "Jewish Communist Spies", and never suspected that the German-born Englishman Klaus Fuchs was the Soviet Spy after all.
      "Intellectual theft" is just a politician's word for "Corporate espionage" or "headhunting for skilled experts". Only idiots cut off their own nose to spite their face; there are plenty of ways businesses and industries insulate themselves from IP theft without kicking out highly capable workers from their potential hiring pool.

  • @simonreij6668
    @simonreij6668 6 หลายเดือนก่อน

    "just as chonk" i have a man crush on you

  • @MyILoveMinecraft
    @MyILoveMinecraft 6 หลายเดือนก่อน +1

    Honestly with the importance of AI and the significant advantage of those who have full access to AI in compared to those who don't NOTHING about AI should be propietery.
    Especially openAI still pisses me off. AI was promised to be open source. Now we are further from that than ever (despite much off the foundations actually being created as open source code)

  • @simonreij6668
    @simonreij6668 6 หลายเดือนก่อน

    thank you so much

  • @Kyzyl_Tuva
    @Kyzyl_Tuva 6 หลายเดือนก่อน

    Thanks!

  • @Game_Hero
    @Game_Hero 6 หลายเดือนก่อน

    8:26 Woah there! Did that IA succesfully put text, actual meaningful correct text, in a generated image???

    • @Veylon
      @Veylon 6 หลายเดือนก่อน

      Dall-E actually does okay at that sometimes these days. Hands even have five fingers most of the time and are rarely backwards.

  • @Sebastian-gf2fk
    @Sebastian-gf2fk 6 หลายเดือนก่อน

    If it happens, modern internet will die, if it happens...

  • @SalivatingSteve
    @SalivatingSteve 6 หลายเดือนก่อน

    I would split up their model among machines based on subject areas of knowledge. Each server running its own “department” at what I’m dubbing ChatGPT University 🎓