this is why we love your videos, also to learn different approaches, get into the right mode and mindset. Personally I had the pleasure of watching a lot of your videos and using R for about two years, not very long, today having changed jobs I use it much less, however to do simple things with Excel I take too long... So in the end I import in R and continue to do as I always have. Thanks to your videos I still don't lose too much dexterity, which unfortunately is quickly lost by standing still.
Fascinating! I saw the note about how the weirdness goes away with bigger N, but I was surprised by how bad the results were there. All I can think is that there's a huge overhead for the actual getting of indices, relative to using the indices to extract the data. I don't care how much more performant "x == n1 | x == n2 | ..." is, I'm not giving up "x %in% c(n1, n2, ...)"!
Yeah, remember it's all about context and application. I use %in% all the time for analyses where speed doesn't matter. 99% of the time it takes longer to save a TIFF than to filter rows from a data frame 🤓
Might be interested to try the single vector function with a map or for loop and run through the desired kmers. You might find that just iterating with parallelizing of a single vector read is the most performent..
Just tried arrow - it's about 3x slower than duckdb with the filter function on a table with 1e7 rows and 3 columns. Check back on Thursday afternoon and I'll post the updated timings with arrow included. Thanks for asking!
Pat, I wanted to thank you for your selfless service that you do to help others. I really have benefitted from your kindness.
My pleasure and many thanks for your generous comment!
this is why we love your videos, also to learn different approaches, get into the right mode and mindset. Personally I had the pleasure of watching a lot of your videos and using R for about two years, not very long, today having changed jobs I use it much less, however to do simple things with Excel I take too long... So in the end I import in R and continue to do as I always have. Thanks to your videos I still don't lose too much dexterity, which unfortunately is quickly lost by standing still.
Thank you so much!🤓
Fascinating! I saw the note about how the weirdness goes away with bigger N, but I was surprised by how bad the results were there. All I can think is that there's a huge overhead for the actual getting of indices, relative to using the indices to extract the data. I don't care how much more performant "x == n1 | x == n2 | ..." is, I'm not giving up "x %in% c(n1, n2, ...)"!
Yeah, remember it's all about context and application. I use %in% all the time for analyses where speed doesn't matter. 99% of the time it takes longer to save a TIFF than to filter rows from a data frame 🤓
Might be interested to try the single vector function with a map or for loop and run through the desired kmers. You might find that just iterating with parallelizing of a single vector read is the most performent..
I tried map/sapply in an earlier episode to build a vector, it was pretty slow relative to other options
what is the effect of the JIT on these comparisons ?
Not sure what you mean by JIT?
Why not use Arrow and use it to read data out of memory
I haven't tried arrow, but in the next episode (Thursday, 2024-05-02) I'll try duckdb with duckplyr - it's pretty slick
Just tried arrow - it's about 3x slower than duckdb with the filter function on a table with 1e7 rows and 3 columns. Check back on Thursday afternoon and I'll post the updated timings with arrow included. Thanks for asking!
@@Riffomonas try saving your data out as Parquet using partitions for better performance
@@Riffomonas and I really enjoy your videos