Fascinating video on applying persistent homology to market analysis! I noticed two potential data leakage issues that could affect the results: In the initial data preparation (around line 29), the log-returns are calculated using future prices: r = np.log(np.divide(P[1:],P[:len(P)-1])). This means each return uses the next day's price, which wouldn't be known at the time. In the Wasserstein distance calculation loop (around line 46), the second time window r[i+w+1:i+(2*w)+1] uses future data that wouldn't be available at the prediction time. To fix these, you could calculate returns using only past data and adjust the second window to r[i+1:i+w+1]. These changes would ensure the analysis only uses information available at each point in time. Great work exploring these advanced techniques! Looking forward to seeing more.
Excellent video. Well organized, presented, and explained. As a mathematician, I have one small nitpick. At 2:58, you use the torus/coffee mug as an example of H_1 = 1. In most mathematical settings, a mathematician referring to a torus as a space would really mean the *surface* of a donut. The surface of a donut, a torus, is actually very different from a simple cycle. H_1(T) has rank 2, and H_2(T) has rank 1. Some refer to this as a "Solid Torus" but solid tori usually do not come up much. Honestly, this is a very common mishap that probably is us mathematician's fault for using the "donut coffee mug" meme too often and calling them tori without getting into the weeds of what is meant.
Fascinating and beautiful, thanks. But there will always be some pattern in data. The question is whether past structures help predict future ones. And would we understand why some past structure leads the real world changes? Practically all indicators in technical analysis rely on some kind of causal hypothesis that would help explain its predictive success (if such there be....)
This is an important question and where I feel the "art of data science" comes in. Data science requires a unique blend of technical understanding and domain expertise to distinguish which patterns are signal (i.e. helpful) vs noise (i.e. not helpful).
Thanks! Great video. When you've computed the homology groups (H0, H1, H2) of all data points over increasing values for ε (a ball around points) 7:03, how do you plot them in the persistance diagram 8:52? Does each point represent a connected component (H0), loop (H1) or enclosed surface (H2)? If its on the diagonal then it is born and dies relatively shortly after each other? And in the code example you are only considering one homology group, so either H0, H1 or H2 12:10?
Hi Tessa, thanks for the questions. 1) In the persistence diagram, each point represents a different "hole". Connected components are blue points, loops are orange, and voids are in green. This is shown in the legend in the bottom right hand corner of the plot. 2) Yes, exactly. So those features are typically regarded as noise 3) For the example, I am actually considering H0, H1, and H2. We can see this from the input argument maxdim=2 in Rips() (1st line in 12:10). For more details check out the Class definition here: github.com/scikit-tda/ripser.py/blob/master/ripser/ripser.py Hope that helps!
Good question. The blue features represent the birth/death of fully connected components (which are basically clusters in this example). That blue point at the top-left corner of the plot represents the cluster of all data points, which doesn't tell us much about the data's shape.
Good question! The units depend on your dataset because the axes have units of a "distance" in the N-dimensional space defined by your variables. A trivial example is if we have 3 variables, say, x, y, and z positions in a 3D grid in units of cm, the the axes would also have units of cm. However, if the 3 variables were something like: weight, height, and top speed (like a car) then the units wouldn't be very meaningful.
Is there any meaning behind using days as points, instead of using series of prices itself as points? It seems kinda weird that when calculating persistent homologies we're growing radii of balls centered around points in time
Good question. While it it does depend strongly on context, applying persistent homology to a raw time series may not be informative. A simple thing one can do, however, is increase the dimensionality of the data through a time delay embedding.
I am not aware of any formal technique for determining sample size (persistent homology is more art than science), so it probably varies from use case to use case.
More in this series 👇
- Introduction to TDA: th-cam.com/video/fpL5fMmJHqk/w-d-xo.html
- Mapper: th-cam.com/video/NlMrvCYlOOQ/w-d-xo.html
Fascinating video on applying persistent homology to market analysis! I noticed two potential data leakage issues that could affect the results:
In the initial data preparation (around line 29), the log-returns are calculated using future prices: r = np.log(np.divide(P[1:],P[:len(P)-1])). This means each return uses the next day's price, which wouldn't be known at the time.
In the Wasserstein distance calculation loop (around line 46), the second time window r[i+w+1:i+(2*w)+1] uses future data that wouldn't be available at the prediction time.
To fix these, you could calculate returns using only past data and adjust the second window to r[i+1:i+w+1]. These changes would ensure the analysis only uses information available at each point in time.
Great work exploring these advanced techniques! Looking forward to seeing more.
Thanks for the notes! I'll need to revisit this work to confirm there are no leaks as well as apply it to other time series.
Best video on persistent homology for a newbie like me
For real
Glad it was helpful!
Very helpful and simple to understand, thanks a lot Shaw!!🙏🙏🙏🙏
Love it! Great way to explain it sensei Shawhin!
Thanks appreciate it! I’m glad it was clear
Excellent video. Well organized, presented, and explained.
As a mathematician, I have one small nitpick. At 2:58, you use the torus/coffee mug as an example of H_1 = 1. In most mathematical settings, a mathematician referring to a torus as a space would really mean the *surface* of a donut. The surface of a donut, a torus, is actually very different from a simple cycle. H_1(T) has rank 2, and H_2(T) has rank 1. Some refer to this as a "Solid Torus" but solid tori usually do not come up much.
Honestly, this is a very common mishap that probably is us mathematician's fault for using the "donut coffee mug" meme too often and calling them tori without getting into the weeds of what is meant.
Thanks for pointing that out. I'm clearly not a mathematician 😂, but I see why it's important to distinguish between a "solid torus" and a torus.
Friendly intro, i recommend having a practical full proven useful example in hand as well
Glad it was accessible. This took me a long time to wrap my head around 😅
Love all your videos on data, so well explained and makes complex topics so simple to understand! Thanks so much!!! 🔥
Thanks for watching! Glad they were helpful 😁
Yay!! Love this series
Thanks for watching!
Great video series. Nice job.
Thanks, glad you like them!
Great content. Fascinating.
You did a great job explaining a complex topic👏👏👏👏
Thanks, glad it helped!
Fascinating and beautiful, thanks. But there will always be some pattern in data. The question is whether past structures help predict future ones. And would we understand why some past structure leads the real world changes? Practically all indicators in technical analysis rely on some kind of causal hypothesis that would help explain its predictive success (if such there be....)
This is an important question and where I feel the "art of data science" comes in. Data science requires a unique blend of technical understanding and domain expertise to distinguish which patterns are signal (i.e. helpful) vs noise (i.e. not helpful).
Thanks Shawhin!
Thank you for watching!
Thanks! Great video.
When you've computed the homology groups (H0, H1, H2) of all data points over increasing values for ε (a ball around points) 7:03, how do you plot them in the persistance diagram 8:52?
Does each point represent a connected component (H0), loop (H1) or enclosed surface (H2)? If its on the diagonal then it is born and dies relatively shortly after each other? And in the code example you are only considering one homology group, so either H0, H1 or H2 12:10?
Hi Tessa, thanks for the questions.
1) In the persistence diagram, each point represents a different "hole". Connected components are blue points, loops are orange, and voids are in green. This is shown in the legend in the bottom right hand corner of the plot.
2) Yes, exactly. So those features are typically regarded as noise
3) For the example, I am actually considering H0, H1, and H2. We can see this from the input argument maxdim=2 in Rips() (1st line in 12:10). For more details check out the Class definition here: github.com/scikit-tda/ripser.py/blob/master/ripser/ripser.py
Hope that helps!
@@ShawhinTalebi Hey! Yes, this really helps, thanks a lot!
It's interesting to see how topology/homology gets practical use ;-)
Thanks for the video. Could you draw any parallels between mapper and PH diagrams? For example, would cover in mapper be similar to ε?
Perhaps I haven't thought about it deeply enough, but from my experience it feels the approaches have very little (if any) overlap.
You need to write an article in the medium about persistent homology.
It exists! medium.datadriveninvestor.com/persistent-homology-f22789d753c4?sk=c0925c51c31f5136abf362829c755146
I love it@@ShawhinTalebi
great video!
but i dont understand the part where u mention to ignore the blue topological feature at 8:25. can u rephrase your sentence?
Good question. The blue features represent the birth/death of fully connected components (which are basically clusters in this example). That blue point at the top-left corner of the plot represents the cluster of all data points, which doesn't tell us much about the data's shape.
Thank you for the video. I have a question, when you plot the persistance diagram what are the units that the birth and death axes use?
Good question! The units depend on your dataset because the axes have units of a "distance" in the N-dimensional space defined by your variables. A trivial example is if we have 3 variables, say, x, y, and z positions in a 3D grid in units of cm, the the axes would also have units of cm. However, if the 3 variables were something like: weight, height, and top speed (like a car) then the units wouldn't be very meaningful.
@@ShawhinTalebi Thank you!
really helpful
Glad it helped :)
Thanks.
My pleasure!
Thanks!
Is there any meaning behind using days as points, instead of using series of prices itself as points? It seems kinda weird that when calculating persistent homologies we're growing radii of balls centered around points in time
Good question. While it it does depend strongly on context, applying persistent homology to a raw time series may not be informative. A simple thing one can do, however, is increase the dimensionality of the data through a time delay embedding.
Is there a paper or a resource for the last example?
The example here is unpublished work, but it was inspired by this paper: arxiv.org/abs/1703.04385
Nice nice nice
🙌🙌
Thank you for that great video! One quick question I have is, how many data points at least should we have for the analysis for a reliable result?
I am not aware of any formal technique for determining sample size (persistent homology is more art than science), so it probably varies from use case to use case.
@@ShawhinTalebi Thank you for your comment!