Dealing With Big Data - Computerphile

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 ก.พ. 2025
  • Big Data sounds like a buzz word, and is hard to quantify, but the problems with large data sets are very real. Dr Isaac Triguero explains some of the challenges.
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottsco...
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

ความคิดเห็น • 173

  • @griof
    @griof 3 ปีที่แล้ว +419

    Developer: we use a 3Gb database to plot some dashboards with statistical information about the customer's behavior
    Marketing team: we use big data, machine learning and artificial intelligence to analyze and predict customer's action at any given time.

    • @NeinStein
      @NeinStein 3 ปีที่แล้ว +65

      But do you use blockchain?

    • @Dontcaredidntask-q9m
      @Dontcaredidntask-q9m 3 ปีที่แล้ว +5

      Hahaha this is so true

    • @klontjespap
      @klontjespap 3 ปีที่แล้ว

      exactly this.

    • @laurendoe168
      @laurendoe168 3 ปีที่แล้ว +9

      Not quite.... "Marketing team: we use big data, machine learning and artificial intelligence to MANIPULATE customer's action at any given time."

    • @johnjamesbaldridge867
      @johnjamesbaldridge867 3 ปีที่แล้ว +2

      @@NeinStein With homomorphic encryption, no less!

  • @letsburn00
    @letsburn00 3 ปีที่แล้ว +104

    I never realised just how much information there was to store until I tried downloading half a decade of satellite images from a single satellite at a fairly low resolution. It was a quarter Terabyte per Channel and it was producing over a dozen channels.
    Then I had to process it....

    • @isaactriguero3155
      @isaactriguero3155 3 ปีที่แล้ว +1

      what did you use to process it?

    • @letsburn00
      @letsburn00 3 ปีที่แล้ว +18

      @@isaactriguero3155 Python. I started with CV2 to convert to Numpy arrays, then work with Numpy. But it was taking forever until I learnt about Numba. Numba plus pure type Numpy arrays are astonishingly effective compared to pure python. I'll never look back now I got used to using Numba.
      I need my work to integrate with Tensorflow too, so Python works well with that.

    • @gherbihicham8506
      @gherbihicham8506 3 ปีที่แล้ว +1

      @@letsburn00 yeah that's still not big data, since you are only using python on presumably on a single node cluster.
      If this data is comming in real time and needs to be processed instantly than you'll need streaming tools like Apache Kafka. To be stored analysed/mined than it needs special storage and processing engines like Hadoop, NoSQL stores and Spark, rarely do you use traditional RDBMs as stores unless they are special enterprise level appliances like Teradata, greenplume or Oracle appliances etc..
      Data processed using traditional methods like single node machines and traditional programming language libraries are not Big data problems. Many people confuse that because they think big volumes of data are Big data.

    • @letsburn00
      @letsburn00 3 ปีที่แล้ว +5

      @@gherbihicham8506 Oh, I know it's not big data. "The Pile" is big data. This is just a tiny corner of the information that is available. But it was a bit of interesting perspective for me.

    • @thekakan
      @thekakan 3 ปีที่แล้ว

      @@NoNameAtAll2 I switch back and forth between Python, R and Julia. I love all three!
      Julia is the fastest, but R and Python have far better support and (usually) easier time in development. When you need the best compute power, Julia it is! It's quite amazing

  • @RealityIsNot
    @RealityIsNot 3 ปีที่แล้ว +149

    The problem with the word big data is that it went from a technical jargon to a marketing one.. and marketing department don't care what the word means.. they create their own meaning 😀. Other examples include AI ML

    • @kilimanjarocruz660
      @kilimanjarocruz660 3 ปีที่แล้ว +4

      100% this!

    • @Piktogrammdd1234
      @Piktogrammdd1234 3 ปีที่แล้ว +2

      Cyber!

    • @landsgevaer
      @landsgevaer 3 ปีที่แล้ว +3

      ML seems pretty well defined to me...?
      For AI and BD, I agree.

    • @danceswithdirt7197
      @danceswithdirt7197 3 ปีที่แล้ว +3

      Big data is two words. ;)

    • @nverwer
      @nverwer 3 ปีที่แล้ว +4

      More examples: exponential, agile, ...

  • @SlackWi
    @SlackWi 3 ปีที่แล้ว +29

    i work in bioinformatics and i would totally agree that 'big data' is anything i have to run on our university cluster

    • @urooj09
      @urooj09 3 ปีที่แล้ว +1

      @you- tube well you have to study biology a bit. In my college bioinformatics students take atleast one semester if biology and then they take courses on biology depending on what they wanna code.

  • @Jamiered18
    @Jamiered18 3 ปีที่แล้ว +119

    It's interesting, because at my company we deal with petabytes of data. Yet, I'm not sure you could call that "big data", because it's not complex and it doesn't require multiple nodes to process.

    • @ktan8
      @ktan8 3 ปีที่แล้ว +6

      But you'll probably need multiple nodes to store petabytes of data?

    • @jackgilcrest4632
      @jackgilcrest4632 3 ปีที่แล้ว +2

      @@ktan8 maybe only redundancy

    • @Beyondthecustomary
      @Beyondthecustomary 3 ปีที่แล้ว +2

      @@ktan8 large amounts of data are often stored in raid for speed and redundancy.

    • @mattcelder
      @mattcelder 3 ปีที่แล้ว +7

      That's why big data isn't the same thing as "large volume". "Large" is subjective and largely dependant on your point in time. 30 years ago, you could've said "my company deals with gigabytes of data" and that would've sounded ridiculously huge, like petabytes do today. But today we wouldn't call "gigabytes" big data. For the same reason, we wouldn't call "petabytes" "big data" unless there's more to it than sheer volume.

    • @AyushPoddar
      @AyushPoddar 3 ปีที่แล้ว +1

      @@Beyondthecustomary not necessarily, most of the large data I've seen being stored (think pb) is stored in a distributed storage like HDFS which came out of Google GFS, since RAID would provide redundancy and fault tolerance but there are no HD that I know of that can store a single PB file and it'll surely not be inexpensive as RAID suggests.

  • @kevinhayes6057
    @kevinhayes6057 3 ปีที่แล้ว +25

    "Big Data" is talked about everywhere now. Really great to hear an explanation of it's fundamentals.

    • @AboveEmAllProduction
      @AboveEmAllProduction 3 ปีที่แล้ว

      More like 10 years ago it was talked about alot

    • @codycast
      @codycast 3 ปีที่แล้ว

      @@AboveEmAllProduction no, gender studies

  • @iammaxhailme
    @iammaxhailme 3 ปีที่แล้ว +47

    I used to work in computational chemistry... I had to use large GPU-driven compute clusters to do my simulations, but I wouldn't call it big data. I'd call it "big calculator that crunches molecular dynamics for a week and then pops out a 2 mb results .txt file" lol

    • @igorsmihailovs52
      @igorsmihailovs52 3 ปีที่แล้ว

      Did you use network storage for MD? Because I was surprised to hear in this video how it is specific to big data. I am doing CC now but not MD, QC.

    • @iammaxhailme
      @iammaxhailme 3 ปีที่แล้ว

      ​@@igorsmihailovs52 Not really. SSH into massive GPU comp cluster, start the simulation, SCP the results files (which were a few gigs at most) back to my own PC. Rinse and repeat.

    • @KilgoreTroutAsf
      @KilgoreTroutAsf 3 ปีที่แล้ว +1

      Coordinates are usually saved only once every few hundred steps, with intermediate configurations being highly redundant and easy to reconstruct from the nearest snapshot.
      Because of that MD files are typically not very large.

  • @sagnikbhattacharya1202
    @sagnikbhattacharya1202 3 ปีที่แล้ว +4

    5:10 "If you're using Windows, that's your own mistake" truer words have never been spoken

  • @sandraviknander7898
    @sandraviknander7898 3 ปีที่แล้ว +12

    Freaky! I had this exact need of data locality on our cluster for the first time in my work this week.

  • @mokopa
    @mokopa 3 ปีที่แล้ว +30

    "If you're using Windows, that's your own mistake" INSTANT LIKE + FAVORITE

  • @leahshitindi8365
    @leahshitindi8365 ปีที่แล้ว

    We had three hours lecture with Isaac last month. It was very interesting

  • @gubbin909
    @gubbin909 3 ปีที่แล้ว +17

    Would love to see some future videos on Apache Spark!

    • @recklessroges
      @recklessroges 3 ปีที่แล้ว

      Yes. There is so much more to talk about on this topic. I'd like to hear about ceph and Tahoe-LAFS.

    • @isaactriguero3155
      @isaactriguero3155 3 ปีที่แล้ว +3

      I am working on it :-)

    • @thisisneeraj7133
      @thisisneeraj7133 3 ปีที่แล้ว +1

      *Apache Hadoop enters the chat*

    • @albertosimeoni7215
      @albertosimeoni7215 3 ปีที่แล้ว

      Better to spend some words even on apache Druid

  • @NoEgg4u
    @NoEgg4u 3 ปีที่แล้ว +22

    @3:23 "...the digital universe was estimated to be 44 zeta-bytes", and half of that is adult videos.

    • @Sharp931
      @Sharp931 3 ปีที่แล้ว

      *doubt*

    • @Phroggster
      @Phroggster 3 ปีที่แล้ว +2

      @@Sharp931 You're right, it's probably more like two-thirds.

    • @G5rry
      @G5rry 3 ปีที่แล้ว +3

      The other half is cats

  • @lightspiritblix1423
    @lightspiritblix1423 3 ปีที่แล้ว +13

    I'm actually studying these concepts at college, this video could not have come at a more convenient time!

  • @shiolei
    @shiolei 3 ปีที่แล้ว +3

    Awesome simple explanation and diagrams. Loved this breakdown!

  • @nandafprado
    @nandafprado 3 ปีที่แล้ว +8

    "If you are using windows that is your own mistake" ...well that is the hard truth for data scientists lol

  • @evilsqirrel
    @evilsqirrel 3 ปีที่แล้ว

    As someone who works more on the practical side of this field, it really is a huge problem to solve. I work with data sets where we feed in multiple terabytes per day, and making sure the infrastructure stays healthy is a huge undertaking. It's cool to see it broken down in a digestible manner like this.

  • @BlueyMcPhluey
    @BlueyMcPhluey 3 ปีที่แล้ว +2

    thanks for this, understanding how to deal with big data is one elective I didn't have time for in my degree

  • @chsyank
    @chsyank 3 ปีที่แล้ว +1

    Interesting video. I worked on and designed big data building large databases for litigation in the early 1980... that was big at the time. Then a few years later creating big data for shopping analysis. The key is that big data is big for the years that you are working on it and not afterwards as storage and processing gets bigger and faster. I think that while analysis and reporting is important, (otherwise there is no value to the data) I do believe that designing and building proper ingestion and storage designs are as important. My two cents from over 30 years of building big data.

  • @quanta8382
    @quanta8382 3 ปีที่แล้ว +10

    Take a drink everytime they say data for the ultimate experience

    • @seancooper8918
      @seancooper8918 3 ปีที่แล้ว +2

      We call this approach "Dealing With Big Drinking".

  • @GloriousSimplicity
    @GloriousSimplicity 3 ปีที่แล้ว +3

    The industry is moving away from having long-term storage on compute nodes. Since data storage needs grow at a different rate than compute needs, the trend is to have a storage cluster and a compute cluster. This means that applications start a bit slower as the data must be transferred from the storage cluster to the compute cluster. However it allows for more efficient spending on commodity hardware.

  • @nikhilPUD01
    @nikhilPUD01 3 ปีที่แล้ว +13

    In few years "Super big data."

    • @recklessroges
      @recklessroges 3 ปีที่แล้ว

      Probably not as technology expands at a similar rate and the problem space doesn't change now that the the cluster has replaced the previous "mainframe" (single computer) approach.

    • @Abrifq
      @Abrifq 3 ปีที่แล้ว

      hungolomghononoloughongous data structures :3

  • @Veptis
    @Veptis 3 ปีที่แล้ว

    At my university there is a masters programme in data science and artificial intelligence.
    It's something I might go into after finishing my bachelor in computational linguistics. However I do need to do additional maths courses, which I haven't looked into yet.
    Apparently the supercomputer at the University has the largest memory in all of Europe. Which is 8 TB per nodd

  • @lookinforanick
    @lookinforanick 3 ปีที่แล้ว +1

    Never seen a numberphile video with so much innuendo 🤣

  • @Skrat2k
    @Skrat2k 3 ปีที่แล้ว +12

    Big data - any data set that crashes excel 😂

    • @godfather7339
      @godfather7339 3 ปีที่แล้ว

      Nah, excel crashes at like 1 million rows, that's not much actually...

    • @mathewsjoy8464
      @mathewsjoy8464 3 ปีที่แล้ว

      @@godfather7339 actually it is

    • @godfather7339
      @godfather7339 3 ปีที่แล้ว

      @@mathewsjoy8464 trust me, it's not, it's not at all.

    • @mathewsjoy8464
      @mathewsjoy8464 3 ปีที่แล้ว

      @@godfather7339 well you clearly don’t know anything, the expert in the video even said we can’t define how big or small data needs to be to be big data

    • @godfather7339
      @godfather7339 3 ปีที่แล้ว

      @@mathewsjoy8464 ik wht he defined, and I also know, PRACTICALLY, THT 1 MILLION ROWS IS NOTHING.

  • @Pedritox0953
    @Pedritox0953 ปีที่แล้ว

    Great video!

  • @jaffarbh
    @jaffarbh 2 ปีที่แล้ว

    One handy trick is to reduce the number of "reductions" in a map-reduce task. In other words, more training, less validation. The downside this could mean the training coverages more slowly

  • @Georgesbarsukov
    @Georgesbarsukov 3 ปีที่แล้ว +1

    I prefer the strategy where I make everything super memory efficient and then go do something while it runs for a long time

  • @advanceringnewholder
    @advanceringnewholder 3 ปีที่แล้ว +1

    Based on what I watch till 2:50, big data is Tony stark of data

  • @kellymoses8566
    @kellymoses8566 6 หลายเดือนก่อน

    If you had the money and the need you could fill a 1U server with 32 61.44TB E1.L SSDs and then fill a rack with 40 of them for a total of 78,643TB of RAW storage. Subtract 10% for redundancy and add 2x for dedupe/compression and you get 141,000TB usable in one rack. Or an Exabyte in 7 racks.

  • @rickysmyth
    @rickysmyth 3 ปีที่แล้ว

    Have a drink every time he says DATE-AH

  • @LupinoArts
    @LupinoArts 3 ปีที่แล้ว +4

    "Big Data" Did you mean games by Paradox Interactive?

  • @myothersoul1953
    @myothersoul1953 3 ปีที่แล้ว

    It's not the size of your data set that matters, nor is how many computers you use or the statistical they apply, what matters is how useful is the knowledge you extract.

  • @quintrankid8045
    @quintrankid8045 3 ปีที่แล้ว +2

    How many people miss the days of Fortran overlays? Anyone?

  • @bluegizmo1983
    @bluegizmo1983 3 ปีที่แล้ว +1

    How many more buzz words are you gonna cram into this interview? Big Data ✔️, Artificial Intelligence ✔️, Machine Learning ✔️.

  • @Ascania
    @Ascania 3 ปีที่แล้ว +2

    Big Data is the concerted effort to prove "correlation does not imply causation" wrong.

  • @sabriath
    @sabriath 3 ปีที่แล้ว

    Well you went over scaling up and scaling out, but you missed scaling in. A big file that you are scanning through doesn't need all of the memory to load the entire file, you can do it in chunks and methodically. If you take that process and scale that out with the cluster, then you end up with an automated way of manipulating data. Scale the allocation code across the raid and you have automatic storage containment. Both together means that you don't have to worry about scale in any direction, it's all managed in the background for you.

  • @laurendoe168
    @laurendoe168 3 ปีที่แล้ว +2

    I think the prefix after "yotta" should be "lotta" LOL

  • @glieb
    @glieb 3 ปีที่แล้ว

    VQGAN + CLIP image synthesis video in the works I hope?? and suggest

  • @MaksReveck
    @MaksReveck 3 ปีที่แล้ว

    I think we can all agree that when you have to start using spark over pandas to process your datasets and save them on partitions rather than pure csvs then its big data

  • @Sprajt
    @Sprajt 3 ปีที่แล้ว +9

    Who buys more ram when you can just download it? smh

    • @_BWKC
      @_BWKC 3 ปีที่แล้ว +2

      Softram logic XD

  • @kees-janhermans910
    @kees-janhermans910 3 ปีที่แล้ว

    'Scale out'? What happened to 'parallel processing'?

    • @malisa71
      @malisa71 3 ปีที่แล้ว

      Didn't the meaning changed few years ago?
      Parallel processing is when it is working on the same or part of a problem at the same time.
      Horizontal scaling is when you can add nodes that does not need to work on the same problem at same time. Only the result will be merged in end.
      But the meaning is probably industry dependent.

  • @kellymoses8566
    @kellymoses8566 6 หลายเดือนก่อน

    Big data is what it takes a full rack of servers to store.

  • @dAntony1
    @dAntony1 3 ปีที่แล้ว +1

    As an American, I can hear both his Spanish and UK accents when he speaks. Sometimes in the same sentence.

    • @leovalenzuela8368
      @leovalenzuela8368 3 ปีที่แล้ว +2

      Haha I was just going to post that! It is fascinating hearing him slip back and forth between his native and adopted accents.

    • @isaactriguero3155
      @isaactriguero3155 3 ปีที่แล้ว +2

      haha, this is very interesting! I don't think anyone here in the UK would hear my 'UK' accent haha

  • @forthrightgambitia1032
    @forthrightgambitia1032 3 ปีที่แล้ว +3

    "Everyone is talking about big data" - was this video recorded 5 years ago?

    • @malisa71
      @malisa71 3 ปีที่แล้ว

      Why? If you work in this industry you will hear about it few times a month

    • @forthrightgambitia1032
      @forthrightgambitia1032 3 ปีที่แล้ว

      @@malisa71 I haven't heard someone where I work in an unironic way for years. Maybe you're stuck working in some snake-oil consultancy though.

    • @malisa71
      @malisa71 3 ปีที่แล้ว

      @@forthrightgambitia1032 This "consultancy" is around for almost 100 years and is one of top companies. I will gladly stay with them.

    • @forthrightgambitia1032
      @forthrightgambitia1032 3 ปีที่แล้ว

      @@malisa71 Defensive, much?

    • @malisa71
      @malisa71 3 ปีที่แล้ว

      @@forthrightgambitia1032 How is anything i wrote defensive?

  • @_..---
    @_..--- 3 ปีที่แล้ว

    44 zettabytes? seems like the term big data doesn't do it justice anymore

  • @Goejii
    @Goejii 3 ปีที่แล้ว +1

    44 ZB in total, so ~5 TB per person?

    • @busterdafydd3096
      @busterdafydd3096 3 ปีที่แล้ว +1

      Yea. We will all interact with about 5TB of data in our life time if you think about it deeply

    • @ornessarhithfaeron3576
      @ornessarhithfaeron3576 3 ปีที่แล้ว +1

      Me with a 4TB HDD: 👁️👄👁️

    • @EmrysCorbin
      @EmrysCorbin 3 ปีที่แล้ว +1

      Yeeeeeah, 15 of those TB are on this current PC and it still seems kinda limited.

  • @serversurfer6169
    @serversurfer6169 3 ปีที่แล้ว +2

    I totally thought this video was gonna be about regulating Google and AWS… 🤓🤔😜

  • @danceswithdirt7197
    @danceswithdirt7197 3 ปีที่แล้ว

    So he's just building up to talking about data striping right (I'm at 13:30 right now)? Is that it or am I missing something crucial?

    • @G5rry
      @G5rry 3 ปีที่แล้ว

      Commenting on a video part-way through to ask a question. Do you expect an answer faster than just watching the video to the end first?

    • @danceswithdirt7197
      @danceswithdirt7197 3 ปีที่แล้ว

      @@G5rry No, I was predicting what the video was going to be about. I was mostly correct; I guess the two main concepts of the video were data striping and data locality.

  • @joeeeee8738
    @joeeeee8738 3 ปีที่แล้ว

    I have worked with Redshift and then with Snowflake. Snowflake solved the problems Redshift had by actually storing all the data efficiently in a central storage instead of storing in each machine. The paradigm is actually backwards now as storing is cheap (network is still the bottleneck)

  • @yashsvidixit7169
    @yashsvidixit7169 3 ปีที่แล้ว

    Didn't know Marc Márquez did Big data as well

  • @unl0ck998
    @unl0ck998 3 ปีที่แล้ว +2

    That spanish accent *swoon*

  • @klaesregis7487
    @klaesregis7487 2 ปีที่แล้ว

    16GB a lucky guy? Thats like the bare minmum for a developer these days. I want 64GB for my next upgrade in a year or so.

  • @advanceringnewholder
    @advanceringnewholder 3 ปีที่แล้ว +2

    Weather data is big data isn't it

    • @VACatholic
      @VACatholic 3 ปีที่แล้ว

      No its tiny. There isn't that much of it (most weather data is highly localized and incredibly recent)

  • @guilherme5094
    @guilherme5094 3 ปีที่แล้ว

    Nice.

  • @RizwanAli-jy9ub
    @RizwanAli-jy9ub 3 ปีที่แล้ว

    We should store information and lesser data

  • @Thinkingfeed
    @Thinkingfeed 3 ปีที่แล้ว

    Apache Spark rulez!

  • @khronos142
    @khronos142 3 ปีที่แล้ว +1

    "smart data"

  • @KidNickles
    @KidNickles 3 ปีที่แล้ว

    Do a video on Raid storage! All this talk about big data and storage, I would love some videos on raid 5 and parity drives!

  • @treyquattro
    @treyquattro 3 ปีที่แล้ว

    so I'm screaming "Map-Reduce" (well, OK, internally screaming) and at the very end of the video we get there. What a tease!

    • @isaactriguero3155
      @isaactriguero3155 3 ปีที่แล้ว

      there is another video explaining MapReduce! and I am planning to do some live coding videos in Python

  • @maschwab63
    @maschwab63 3 ปีที่แล้ว

    If you need 200+ servers, just run it on a IBM z Server as a plain jane computer task all by itself.

    • @malisa71
      @malisa71 3 ปีที่แล้ว

      Did look at pricing of IBM z?
      My company is actively working on moving to luw and we are not small

  • @thekakan
    @thekakan 3 ปีที่แล้ว +2

    Big data is data we don't know what we can do with it _yet_ 😉
    ~~lemme have my fun~~
    6:08 when can we buy Computerphile GPUs? 🥺

  • @phunkym8
    @phunkym8 3 ปีที่แล้ว

    La dirección de la visitación de Concepción Zarzal

  • @yfs9035
    @yfs9035 3 ปีที่แล้ว

    Where'd the British guy go what did you do with him!!!?? Who is this guy!!! Sorry I haven't even watched the video yet.

  • @shemmo
    @shemmo 3 ปีที่แล้ว

    he uses word data so much, that i only hear data 🤔🤣

    • @isaactriguero3155
      @isaactriguero3155 3 ปีที่แล้ว +2

      hahah, funny how repetitive one can become when doing this kind of video! hehe, sorry :-)

    • @shemmo
      @shemmo 3 ปีที่แล้ว

      true true :) but i like his explanation

  • @AboveEmAllProduction
    @AboveEmAllProduction 3 ปีที่แล้ว

    Do a hit everytime he says Right

  • @grainfrizz
    @grainfrizz 3 ปีที่แล้ว

    Rust is great

  • @COPKALA
    @COPKALA 3 ปีที่แล้ว

    NICE: "if you use windows, it your own mistake" !!

  • @Random2
    @Random2 3 ปีที่แล้ว +6

    Ehm... It is very weird that scale in/out and scale up/down are being discussed in terms of big data, when those concepts are completely independent and predate the concept of big data as a whole...
    As a whole, after watching the entire video, this might be one of the least well-delineated videos in the entire channel. It mixes up parts of different concepts into one as if it all came from big data, or all of it is related to big data, while at the same time failing to address the historical origins of big data and map/reduce. Definitely below average for computerphile.

  • @DorthLousPerso
    @DorthLousPerso 3 ปีที่แล้ว +2

    "1 gig of data". Look at my job as a dev. Look at my entertainment as games on Steam and videos. Yeaaaahhh....

  • @AudioPervert1
    @AudioPervert1 3 ปีที่แล้ว

    Not everyone is talking about big data 😭😭😭😂😂😂 these big data dudes never speak of the pollution, contamination and carbon generated by their marvellous technology. Big data could do nothing about the pandemic for example ...

    • @isaactriguero3155
      @isaactriguero3155 3 ปีที่แล้ว +1

      well, I briefly mentioned the problem of sustainable Big Data, and I might be able to put together a video about this. You're right that not many people seem to care much about the number of resources a Big Data solution may use! This is where we should be going in research, trying to develop cost-effective AI, which only needs big data technology when strictly needed, and when is useful.

  • @lucaspelegrino1
    @lucaspelegrino1 3 ปีที่แล้ว

    I want to see some Kafka

  • @AxeJamie
    @AxeJamie 3 ปีที่แล้ว

    I want to know what the largest data is...

    • @recklessroges
      @recklessroges 3 ปีที่แล้ว +1

      Depends on how you define the set. The LHC has one of the largest data bursts, but the entire Internet could be considered a single distributed cluster...

    • @quintrankid8045
      @quintrankid8045 3 ปีที่แล้ว

      largest amount of data in bits= (number of atoms in the universe - number of atoms required to keep you alive) / number of atoms required to store and process each bit(*)
      (*) Assumes that all atoms are equally useful for storing and processing data and keeping you alive. Also assumes that all the data needs to persist. Number of atoms required to keep you alive may vary by individual and requirements for food, entertainment and socialization. All calculations require integer results. Please consult with a quantum mechanic before proceeding.

  • @austinskylines
    @austinskylines 3 ปีที่แล้ว +1

    ipfs

    • @drdca8263
      @drdca8263 3 ปีที่แล้ว

      Do you use it? I think it’s cool, but currently it competes a bit with my too-large-number of tabs, while I don’t get much use from actively running it, so I generally don’t keep it running?
      I guess that’s maybe just because I haven’t put in the work to find a use that fits my use cases?

  • @JimLeonard
    @JimLeonard 3 ปีที่แล้ว

    Nearly two million subscribers, but still can't afford a tripod.

  • @DominicGiles
    @DominicGiles 3 ปีที่แล้ว +2

    There's data.... Thats it...

  • @jeffatturbofish
    @jeffatturbofish 3 ปีที่แล้ว

    Here is my biggest problem with all of the definitions of 'big data' in that it requires multiple computer. What if it only requires multiple computers because the person who is 'analyzing' it, doesn't know how to deal with large data efficiently? Quality of data? I will just use SQL/SSIS to cleanse the data. I normally deal with data in the multiple TB range on either my laptop [not a typical laptop - 64 GB of ram], or my workstation [again, perhaps not a normal computer with 7 hard drives, mostly SSD, 128 GB of ram and a whole lot of cores] and can build an OLAP from the OLTP in minutes and then running more code doing some deeper analyst taking a few minutes more. If it takes more than 30 minutes, I know that I screwed something up. If you have to run it on multiple servers, maybe you also messed something up. Python is great for the little stuff [less than 1 GB], so is R, but for the big data, you need to work with something that can handle it. I have 'data scientist' friends with degrees from MIT who couldn't handle simple SQL and would freak out if they had more than a couple of MB of data to work with. In the meanwhile, I would handle TB of data in less time with SQL, SSIS, OLAP, MTX.
    Yeah, those are the dreaded Microsoft words.

    • @albertosimeoni7215
      @albertosimeoni7215 3 ปีที่แล้ว

      In enterprise environment you have other problem to handle... Availability made with redundancy of VM and disk over the network (That makes huge latency)...
      SSIS is considered a toy in big enterprises...other uses ODI, BODS (sap) that is more robust...the natural evolution of SSIS sold as "cloud" and "big data" is azure data factory...but the cost is the highest of every competitor...(you pay for every task you run rather than for the time the "machine is on")

  • @lowpasslife
    @lowpasslife 3 ปีที่แล้ว +5

    Cute accent

  • @llortaton2834
    @llortaton2834 3 ปีที่แล้ว

    He still misses dad to this day

  • @NeThZOR
    @NeThZOR 3 ปีที่แล้ว +2

    420 views... I see what you did there

  • @pdr.
    @pdr. 3 ปีที่แล้ว

    This video felt more like marketing than education, sorry. Surely you just use whatever solution is appropriate for your problem, right? Get that hammer out of your hand before fixing the squeaky door.

  • @kevinbatdorf
    @kevinbatdorf 3 ปีที่แล้ว

    What? Buying more memory is cheaper than buying more computers… which just means you’re throwing more memory and cpu at it. I think you meant you solve it by writing a slower algorithm that uses less memory as the alternative. Also, buying more memory is often cheaper than the labor cost of refactoring, especially when it comes to distributed systems. Also, why the Windows hate? I don’t use Windows but still cringed there a bit

    • @malisa71
      @malisa71 3 ปีที่แล้ว

      Time is money and nobody wants to wait for results. Solutions is to make fast and efficient programs that have proper memory utilisation.
      Almost no serious institution is using Windows for such tasks. Maybe on client side but not on a node or server.

  • @syntaxerorr
    @syntaxerorr 3 ปีที่แล้ว

    DoN'T UsE WinDOws....Linux: Let me introduce the OOM killer.

  • @yukisetsuna1325
    @yukisetsuna1325 3 ปีที่แล้ว +10

    first

    • @darraghtate440
      @darraghtate440 3 ปีที่แล้ว +11

      The bards shall sing of this victory in the annals of time.

  • @vzr314
    @vzr314 3 ปีที่แล้ว

    No. Everyone is talking about COVID. And I listened him until he mentioned COVID in first few minutes, enough nof broken English anyway