For the whole talk I was waiting for the speaker to spring the realization on his audience that there were no trades on Wednesday but unfortunately it didn't happen. There were no observed trades on Wednesday so the sum for Wednesday is 0 and would bring down the average. The set of days has to be an independent input as we aren't guaranteed to have trades every day. We would likely need the full set of exchanges as there may be an exchange that is infrequently used.
This has highlighted the problem of why a simple question turns into a complex task. I would argue that even the results shown here are answering different questions than asked as they do not take into account the null results.(days where no apples were sold). The question answered in the example was: for days a seller sold at least one apple what was the daily average of apples sold for each seller? This is where you need to be aware of business practices happening in the real world. (Does the apple shop only open 3 days or is it 5/6/7 does it open on holidays or do you just want an average over a complete time period month/qtr/year). The main observation is that you can't rely on the original dataset given to give your 'collapsing key'. Sorry if this was obvious.
Not at all obvious. Very good points @giles Langdon. Business rules like Saturday and Sunday are "no sales days" may imply they don't contribute to N. Also a Monday-Friday with zero sales volume should trigger N+=1 with Volume = 0 (i.e. pulls down the average). On multi-dimensional data sets it can also be important to consider things like "Bob only makes sales on the days he is at work, i.e. a user scoring 0 sales on a day, may or may not contribute to N depending on whether they showed up or not... His concepts of "Collapsing Key" and "Grouping Key" and "Observation Key" are personal terms (I have not encountered them before) and he describes how they are useful.
@@wkxue3826 I get ya bro. You might use if (salesman.day.sales = NaN) then 0 else 1 as #salesday.total i.e. make a hashtable (temporary totals by day table) I'm linking Python these days for its dataframe concept...
Nice example of how analysis can go wrong if we do not get the basics right. So what basically goes wrong here? The observational units were misidentified. In this case the question is about an aggregate of the rows in the table, not on the individual rows themselves. The solution is straightforward. Construct a table of aggregate data first, then answer the question using the new table. To do that we do not need a magic formula with three subtly different keys. Just spent some time on identifying the units you are trying to process. Use aggregation, selection and joining etc to get a table that has the desired observational unit in each row. Use this table to answer the question posed. Hint: be sure you understand the concept of observational unit (IMHO, what was missing here is a clear understanding of this concept) en.wikipedia.org/wiki/Unit_of_observation
besides present it nicely to give the conceptual name collapsing key and grouping key to help understand the basic.. honestly, I think it's a really basic concept for even an entry-level analyst...
Thanks a lot Alex SQl/Pandas ->Formula Inner_Collapsingkey - Outer_GroupingKey = Implicit_ObservationKey collapsingKey==primaryKey(default) Amazing talk.
For the whole talk I was waiting for the speaker to spring the realization on his audience that there were no trades on Wednesday but unfortunately it didn't happen. There were no observed trades on Wednesday so the sum for Wednesday is 0 and would bring down the average. The set of days has to be an independent input as we aren't guaranteed to have trades every day. We would likely need the full set of exchanges as there may be an exchange that is infrequently used.
This has highlighted the problem of why a simple question turns into a complex task. I would argue that even the results shown here are answering different questions than asked as they do not take into account the null results.(days where no apples were sold).
The question answered in the example was: for days a seller sold at least one apple what was the daily average of apples sold for each seller?
This is where you need to be aware of business practices happening in the real world. (Does the apple shop only open 3 days or is it 5/6/7 does it open on holidays or do you just want an average over a complete time period month/qtr/year). The main observation is that you can't rely on the original dataset given to give your 'collapsing key'. Sorry if this was obvious.
Not at all obvious. Very good points @giles Langdon. Business rules like Saturday and Sunday are "no sales days" may imply they don't contribute to N. Also a Monday-Friday with zero sales volume should trigger N+=1 with Volume = 0 (i.e. pulls down the average).
On multi-dimensional data sets it can also be important to consider things like "Bob only makes sales on the days he is at work, i.e. a user scoring 0 sales on a day, may or may not contribute to N depending on whether they showed up or not...
His concepts of "Collapsing Key" and "Grouping Key" and "Observation Key" are personal terms (I have not encountered them before) and he describes how they are useful.
@@wkxue3826 I get ya bro. You might use
if (salesman.day.sales = NaN) then 0 else 1 as #salesday.total
i.e. make a hashtable (temporary totals by day table)
I'm linking Python these days for its dataframe concept...
Nice example of how analysis can go wrong if we do not get the basics right. So what basically goes wrong here?
The observational units were misidentified.
In this case the question is about an aggregate of the rows in the table, not on the individual rows themselves.
The solution is straightforward. Construct a table of aggregate data first, then answer the question using the new table.
To do that we do not need a magic formula with three subtly different keys.
Just spent some time on identifying the units you are trying to process.
Use aggregation, selection and joining etc to get a table that has the desired observational unit in each row.
Use this table to answer the question posed.
Hint: be sure you understand the concept of observational unit (IMHO, what was missing here is a clear understanding of this concept)
en.wikipedia.org/wiki/Unit_of_observation
A decent talk if you aren't familiar with SQL GROUP BY. For people comfortable with the concept, it's not very valuable.
This surely could have been conveyed in 5 mins.. brevity people!!
Jane sold 16 apples in 3 days. Why is her average daily amount 16?
besides present it nicely to give the conceptual name collapsing key and grouping key to help understand the basic.. honestly, I think it's a really basic concept for even an entry-level analyst...
totally Wendao - but let's not criticise
Who would have thought...
Enjoyed the talk - well executed and concise...
Very good, thanks for sharing. A lil' more Pandas and it coulda' been great :)
Thanks a lot Alex
SQl/Pandas ->Formula
Inner_Collapsingkey - Outer_GroupingKey = Implicit_ObservationKey
collapsingKey==primaryKey(default)
Amazing talk.
hardly...
Im no expert, but I feel like the meth kicked in. @17:30
Also it seems like you're unnecessarily reinventing ' df.set_index.'
Amazing talk. This seemingly simple concepts rapidly get complicated up in greater dimensions.