Common mistakes in big data models

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 ธ.ค. 2024

ความคิดเห็น • 65

  • @Nelmistro
    @Nelmistro 2 ปีที่แล้ว +44

    To see all 4 of these people in one video is kinda like seeing the Avengers of Big Data. I've definitely learned a lot from these guys!

  • @jkschola1
    @jkschola1 ปีที่แล้ว +3

    I've been tuning into both @SQLBI and Guy in a Cube's videos to enhance my Power BI Best Practices knowledge - what an epic crossover! 👏 Kudos to both of you for your valuable contributions! 👏

  • @kostaskolotouros4345
    @kostaskolotouros4345 2 ปีที่แล้ว +1

    The Big Four of Power BI in the same panel! Thank you guys!

  • @Furryfattal
    @Furryfattal 2 ปีที่แล้ว +1

    Great talk gentlemen. What kinda put the bow on the whole thing for me was the impromptu demo of how "follow up" meetings are formed at the very end of the segment. That really hit home, so thank you again!

  • @clausm73
    @clausm73 3 ปีที่แล้ว +6

    This is one of the best Power BI videos ever! Why? Because a lot of concepts are being touched and should lead the spectator to investigate further.

  • @Pooja-pd1ee
    @Pooja-pd1ee 3 ปีที่แล้ว +11

    Loved the entire session. There's so much you are doing for the bi community. Keep up the good work, guys.

  • @Don_Modern_Ancestor
    @Don_Modern_Ancestor 3 ปีที่แล้ว +21

    I've taken courses and consumed SQLbi and Guy in a Cube videos which are really good, but as a group you guys were great! We're starting strategy meetings for data governance, administration and modeling for a large enterprise system and appreciate your insights.

  • @shujashakir9952
    @shujashakir9952 2 ปีที่แล้ว +2

    All masters together. Great!

  • @bbbradl
    @bbbradl 2 ปีที่แล้ว +1

    Thanks! I hope you have more content on very large dataset issues & solutions!
    My take-away is "don't waste" and "Import whenever possible".
    In our case, the modeling team is using Direct Query to Views that sometimes have 50+ joins, that are non-Scalar (use ranges) that themselves are joined to others in the model.
    I'm just the front-end guy, but I've offered them the following suggestions:
    1. Make all keys Integer.
    2. Use in-memory Materialled Views for the most complex queries.
    3. On SSAS, bring the most common tables into memory.
    4. Place values most used as filters and parameters out of the Snowflakes (like Employee Status Code, Unit Code) and into their parent dimension tables, because they're always referenced and relationships are costly.
    However, the restrictions on Live Connect to an SSAS model (using the Analysis Services connector) are many. I struggle daily with how I can get by without being able to create columns or tables, how to join other datasets or models when needed, or in case I need to bypass their model and go straight to the OLTP, then how to preserve Row Level Security.
    A series would be A W E S O M E!!!

  • @sushilgulati
    @sushilgulati ปีที่แล้ว +1

    Thanks for this great conversation guys !

  • @joannageorge6628
    @joannageorge6628 3 ปีที่แล้ว +2

    Wait a minute, so you have ALL my favorite people in 1 video? I cant handle this :)

  • @happyheart9431
    @happyheart9431 ปีที่แล้ว +1

    Billion Thanks for sharing such a valuable knowledge

  • @rrestituti
    @rrestituti 2 ปีที่แล้ว +1

    You are freaking intelligent!!!!!
    I love your channels.

  • @anhphan9141
    @anhphan9141 3 ปีที่แล้ว +2

    OMG my favorite Power BI People in 1 video. Love it!

  • @micahdail
    @micahdail 3 ปีที่แล้ว +4

    Yes, 4 billion. ;-) 14:18 .
    I do want to add that to Patrick's point it is a balance of what's in your model and what you need. 4 billion worked very well regardless, but after internal discussions we decided to leave the less-frequently-used grains in direct query and free up some P2 space for other models. So we are back down to slim 1.5B on that one model which is enabling us to deploy more models to.

  • @ThanhNhandl
    @ThanhNhandl 3 ปีที่แล้ว +2

    i love the way they discussed most

  • @jjjj8058
    @jjjj8058 2 ปีที่แล้ว +1

    great discussion. My issue with import over DQ is that you lose the centrally managed RLS applied at the source.

  • @wexwexexort
    @wexwexexort 3 ปีที่แล้ว +1

    Fantastic session, thanks!

  • @AnandDwivedi
    @AnandDwivedi 3 ปีที่แล้ว +4

    Marco Alberto expecting a video on XMLA endpoints inc refresh

  • @FedericoLov
    @FedericoLov 2 ปีที่แล้ว

    Patrick! even if the load happens once a day stakeholders don't want to introduce any additional lag.
    The problem is yes it runs once a day but we don't know at what time exactly so setting up a refresh schedule becomes difficult.
    Also, partitions management in powerbi is still behind what we used to have in SSAS and incremental refresh is only available on premium.

  • @Ajuneja99
    @Ajuneja99 2 ปีที่แล้ว +1

    Thanks for the great content. You guys talked about direct query being last resort in scenarios like latest data required, or client not willing to store data to Power BI. What I have seen with a client is having database side RLS as one of the reason of not using the import mode. Any thoughts around that or any other content I can go though related to this? I'm thinking now if there is any workaround to that.

  • @reynaldomalave
    @reynaldomalave 3 ปีที่แล้ว +1

    Thank you guys!

  • @tejatarigopula3266
    @tejatarigopula3266 3 ปีที่แล้ว +1

    Thank you for sharing real-time solutions for real world problems. Adam was referring to dictionary on a column. Can you please elaborate on this. Thanks

    • @SQLBI
      @SQLBI  3 ปีที่แล้ว +1

      Read this: www.sqlbi.com/articles/measuring-the-dictionary-size-of-a-column-correctly/

  • @hmhkh
    @hmhkh 3 ปีที่แล้ว +2

    Hi Marco, Alberto, Adam and Patrick I am fan and follower for four of you, thanks for the informative discussion, we are telecom operator and we have tremendous amount of data from all our nodes, we may need a consultation from your side to optimize our data models.
    Our backend is SSAS tabular and our front end is Tableau.

    • @SQLBI
      @SQLBI  3 ปีที่แล้ว +1

      You can contact us on www.sqlbi.com/consulting/

  • @mikar636
    @mikar636 3 ปีที่แล้ว +1

    Amazing!!! Thanks for sharing!!

  • @sandeepbarge4699
    @sandeepbarge4699 5 หลายเดือนก่อน

    With the introduction of Direct Lake connection mode with data in Lakehouse, would you recommend Direct Lake or Import? Which one is more efficient and faster? Assume that I am using Fabric Lakehouse as a source.

    • @SQLBI
      @SQLBI  5 หลายเดือนก่อน

      It depends. Read: www.sqlbi.com/blog/marco/2024/04/06/direct-lake-vs-import-mode-in-power-bi/

  • @tarekdemiati9991
    @tarekdemiati9991 2 ปีที่แล้ว

    It will be very interesting to have another discussion after the "new"Datamart feature will have been implemented in production with a few enterprise customers.
    It seems like the DataMart could be an amazing tool for handling multiple large data sources.

  • @anilkumar-rg9jk
    @anilkumar-rg9jk 2 ปีที่แล้ว +1

    Do more on dax guys like this

  • @artigasmoises
    @artigasmoises 3 ปีที่แล้ว +2

    Excelente video, saludos.

  • @AnandDwivedi
    @AnandDwivedi 3 ปีที่แล้ว +1

    amazing Discussion

  • @kananparmar9656
    @kananparmar9656 2 ปีที่แล้ว +1

    Hi Adam Saxton, Patrick LeBlanc, Alberto Ferrari, and Marco Russo I have been new to Power Bi where I'm trying to fetch the on- premise data into Azure cloud and then to visualized through Power Bi.
    I'm struggling here what azure architure to develop for Azure to fetch data from POS and what Azure components is to be used to get the project going and how this Data Modeling and relationship knowledge I have got can be implemented.

  • @hojatalaii8360
    @hojatalaii8360 11 หลายเดือนก่อน

    How to calculate max consecutive negative number in a column by DAX?

  • @NeumsFor9
    @NeumsFor9 3 ปีที่แล้ว

    Chris Adamson's Mastering DWH Aggregates is a good reference for solving these issues, but also Spark is not a bad alternative in some cases.

  • @zecosta9579
    @zecosta9579 3 ปีที่แล้ว

    What was that term brought up a couple times, "filter dags" (e.g. around the 49:00 mark)? Have not heard of it before and I don't think I even understood the word correctly.
    Many thanks for the great video folks, gave me a lot to think about when developing my next reports :)

    • @SQLBI
      @SQLBI  3 ปีที่แล้ว +1

      It's "filtered aggs" meaning "filtered aggregations" - it's a specific optimization technique for SQL Server (materialized view, filtered indexes, ...) that could improve performance in a DirectQuery scenario.

  • @franciscoclaudio4818
    @franciscoclaudio4818 3 ปีที่แล้ว

    Here in Brazil, analysts use Pentaho a lot in the processing of big data. Do you indicate a better tool?

  • @bojanjovanovic756
    @bojanjovanovic756 3 ปีที่แล้ว +1

    Thank you very much ! Your contribution is priceless !

  • @bhaveshpatelaus
    @bhaveshpatelaus 3 ปีที่แล้ว

    We are talking about using Direct Query for large tables. However we need to remember that No more than 1 million rows are supported in the query. The visual is going to throw an error if it has to query 1 million rows. Need more settings to control how DQ behaves with source.

    • @SQLBI
      @SQLBI  3 ปีที่แล้ว

      The limitation is related to the temporary tables moved to the formula engine. You can have 10 billion rows in DirectQuery, but you have to be aware of the limitation depending on the queries. Very generally speaking, dimension size could be an issue in a limited relationship, but there are many other elements in play.
      But yes, this limit is definitely something we'd like to customize. It is customizable in Analysis Services, not in Power BI (yet).

  • @rerangelt
    @rerangelt 3 ปีที่แล้ว +3

    This was a deluxe video.

  • @terryliu3635
    @terryliu3635 3 ปีที่แล้ว +2

    I like the part “no cached” mode instead of “direct query”. Lol.

  • @Karenshow
    @Karenshow 9 หลายเดือนก่อน

    what about telematics data, that gets big pretty fast.

  • @paradeen2730
    @paradeen2730 3 ปีที่แล้ว

    I heard you talken about text search in large dimensions. At least the filter visuals I have tested couldn’t fold a SQL query supporting SQL full text index search. Is this your experience to?

    • @SQLBI
      @SQLBI  3 ปีที่แล้ว

      Yes, the best we've seen is using LIKE. However, *depending on the hardware*, the same search could be faster on a relational database.

  • @natnaelstesfagiorgis3077
    @natnaelstesfagiorgis3077 ปีที่แล้ว

    Advice on connecting with SQL

  • @anselmolopez8584
    @anselmolopez8584 3 ปีที่แล้ว +1

    congratulation on the knowlage you shere, i need it and i will folow you...from Spain

  • @bhaveshpatelaus
    @bhaveshpatelaus 3 ปีที่แล้ว

    As per the Analysis Services Documentation, The cardinality of the columns should not be greater than 2 billion rows. I think this is also applied for Power Bi. Imagine having ticket number in a fact table that you need to display in your report and have more than 2 billion unique values.

    • @SQLBI
      @SQLBI  3 ปีที่แล้ว +1

      The cardinality can be bigger, the issue is mainly related to performance when the column is used in a relationship. Another scenario is DISTINCTCOUNT, where the cardinality has a big role in performance. Approximate algorithms can be very useful there, take a look at this article: dax.tips/2019/12/20/dax-approx-distinct-count/
      Absolute numbers also depend on the hardware, over time the "limit" can change because of hardware improvements. In 2012 1 million was a high cardinality, today 4 million rows could have reasonable performance on dedicated hardware (I would stay below 2m rows on cloud services, though).

  • @shafa7668
    @shafa7668 2 ปีที่แล้ว +1

    I think the issue is business requirements change over time and it is expected that the existing model should be able to answer the future questions. That's why is it safe to keep as many columns as available. So if you have narrow fact tables there is a risk that in future your model will not serve the purpose.

  • @netflixaccount3210
    @netflixaccount3210 3 ปีที่แล้ว +5

    I laughed so hard at this comment of Kevin Hunt:
    ​Marco did you take all of Alberto's books?

  • @richard-social8125
    @richard-social8125 3 ปีที่แล้ว

    Summary?

  • @NeumsFor9
    @NeumsFor9 3 ปีที่แล้ว +1

    In companies with high turnover, it can be easier to start over as opposed to trying to decompose what other teams implemented. If you can't decompose it in a week, chances are you could be better off starting from scratch. The big companies are actually the worst offenders.

  • @silviab5579
    @silviab5579 3 ปีที่แล้ว

    ❤❤❤❤

  • @VladMZ
    @VladMZ 3 ปีที่แล้ว

    Oracle's OBIEE enforce Star Schemas religiously. You simply think thrice before adding another attribute to your fact table. Let alone you CANNOT create standalone fact tables without any dimensions.

  • @timianalytics7150
    @timianalytics7150 ปีที่แล้ว +1

    What's that on Patrick's chest😂

  • @rcb3921
    @rcb3921 3 ปีที่แล้ว +4

    "We have a small youtube channel"

  • @gondebas
    @gondebas 3 ปีที่แล้ว

    Wooow! Greetings! I got a question for you, big guys, on how do you welcome calculated columns in a fact table. Whenever the fact table gets fat enough, say 1 GB, the star schema does not shine anymore. What works for me in terms of performance and speeding up reports is to add calculated columns to the fact table. It is efficient not only if there is any complexity in the measure. I can notice the speed jump by staying away from such a simple function as SUMX to calculate Sales Value. Adding the additional calculated column to FactTable increases the total size of the report negligibly, so it is a relatively low cost. Even replacing measure SUMX with a pair of the calculated column of sales value and then using a measure with a simple SUM function increases the speed approximately five times. If the measure contains any grain of complexity, say the IF function, then the speed increase manifold. Am I missing something here?

    • @SQLBI
      @SQLBI  3 ปีที่แล้ว

      Details are important. If you have an IF, the calculated column is faster, but it would be better to compute the column in the data source or in Power Query, because a calculated column in a large table is not a good idea for other reasons (lower compression, higher memory consumption at refresh time, longer processing time).
      If you have a simple expression that can be pushed to the storage engine (like Quantity * Price using two columns of the same table), usually you don't see any performance advantage at query time from the calculated column.

  • @timewithsopy
    @timewithsopy 5 หลายเดือนก่อน

    I'm dealing with a view in our Data Warehouse that has 57 columns and 225 million rows of records. I also have another table with 3 billion rows of records. My refreshes drains the life energy out of me because they take so long. I know it's not a good idea to have everyhting in a single view/table. I'm here trying to find something to show the company that what we have is bad approach. Anybody out there can give me a summary and clean insight that I can pass on to our data warehouse team?

    • @SQLBI
      @SQLBI  5 หลายเดือนก่อน

      Implement an incremental refresh: you build your house every day when you go camping, but you don't bring the furniture. Your historical data is like the content of the furniture, you would not relocate every day, right?

  • @lanasedayanoch
    @lanasedayanoch 3 ปีที่แล้ว

    PATRICK, could you be more specific? which BI tools encourage flat tables? How come you have learnt that only last week - you've been in the game long enough. Also, you say you have LEARNT that they "ALL HAVE RUN INTO THE SAME PERFORMANCE PROBLEMS" . That implies hell of a lot of empirical evidence, all within "last week", we all know how many datapoints one must have for each of those to make a valid conclusion. Evidence please. This is too far-fetched a statement.