- 81
- 78 099
probabl
เข้าร่วมเมื่อ 6 ก.พ. 2024
This is the official Probabl TH-cam channel where we feature humans teaching and learning about machine learning, data science, and open source. More often than not, we will discuss scikit-learn, as well as a plethora of other tools and libraries to help data scientists, data engineers and data owners extract the most value out of their data.
Why the MinHashEncoder is great for boosted trees
Boosted tree models don't support sparse matrices, which might make you think they have trouble encoding text data. There are, however, encoding techniques that can work great without resorting to sparse methods. The MinHash encoder is one such technique and this video explains why it is a great choice for many pipelines.
00:00 Introduction
01:03 Hashing text
05:16 Plotting hashes
06:22 Code demo
09:46 Pipelines
The notebooks for the dirty category series can be found here:
github.com/probabl-ai/youtube-appendix/tree/main/15-dirtycat
Website: probabl.ai/
LinkedIn: www.linkedin.com/company/probabl
Twitter: x.com/probabl_ai
Discord: discord.probabl.ai
We also host a podcast called Sample Space, which you can find on your favourite podcast player. All the links can be found here:
rss.com/podcasts/sample-space/
00:00 Introduction
01:03 Hashing text
05:16 Plotting hashes
06:22 Code demo
09:46 Pipelines
The notebooks for the dirty category series can be found here:
github.com/probabl-ai/youtube-appendix/tree/main/15-dirtycat
Website: probabl.ai/
LinkedIn: www.linkedin.com/company/probabl
Twitter: x.com/probabl_ai
Discord: discord.probabl.ai
We also host a podcast called Sample Space, which you can find on your favourite podcast player. All the links can be found here:
rss.com/podcasts/sample-space/
มุมมอง: 678
วีดีโอ
How the HashingVectorizer works
มุมมอง 379วันที่ผ่านมา
You can use the CountVectorizer in scikit-learn to encode text to a sparse array that a machine learning model can use. This functionality is great, but it can result in *huge* widths. An alternative to this is the HashingVectorizer, which we discuss in this video. The notebooks for the dirty category series can be found here: github.com/probabl-ai/youtube-appendix/tree/main/15-dirtycat Website...
You want to be in control of your own Copilot with Ty Dunn - founder of Continue.dev
มุมมอง 25114 วันที่ผ่านมา
There are many LLMs that you can use for programming these days. Some of them even go into your IDE like Cursor or Github Copilot. But what if you want to tweak these LLMs do to what you want? Instead of being stuck with the tools that a vendor gives you, the goal of Continue.dev is to allow you to customise this yourself. In this podcast we talk to Ty Dunn, co-founder of the project to learn m...
What it is like to maintain the scikit-learn docs with David Arturo Amor Quiroz, docs maintainer
มุมมอง 34121 วันที่ผ่านมา
Scikit-learn's documentation pages are celebrated. But not everyone is aware that the project actually has somebody on payroll to take care of it. In this episode we talk to Arturo about stories from the scikit-learn documentation. In particular, the docs have a recommender that few folks are aware of. People just assume that it is manually curated, but there are a few base scikit-learn tools u...
Sqlite can totally do embeddings now with Alex Garcia, creator of sqlite-vec
มุมมอง 1.1Kหลายเดือนก่อน
Sqlite can totally do embeddings now with Alex Garcia, creator of sqlite-vec
How to rethink the notebook with Akshay Agrawal, co-creator of Marimo
มุมมอง 743หลายเดือนก่อน
How to rethink the notebook with Akshay Agrawal, co-creator of Marimo
Feature engineering for overlapping categories
มุมมอง 7332 หลายเดือนก่อน
Feature engineering for overlapping categories
You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos
มุมมอง 8142 หลายเดือนก่อน
You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos
How Narwhals has many end users ... that never use it directly. - Marco Gorelli
มุมมอง 5893 หลายเดือนก่อน
How Narwhals has many end users ... that never use it directly. - Marco Gorelli
More flexible models via sample weights
มุมมอง 7843 หลายเดือนก่อน
More flexible models via sample weights
Why ridge regression typically beats linear regression
มุมมอง 1.4K3 หลายเดือนก่อน
Why ridge regression typically beats linear regression
Understanding how the KernelDensityEstimator works
มุมมอง 8084 หลายเดือนก่อน
Understanding how the KernelDensityEstimator works
Pragmatic data science checklists with Peter Bull
มุมมอง 9234 หลายเดือนก่อน
Pragmatic data science checklists with Peter Bull
Don't worry too much about missing data
มุมมอง 9474 หลายเดือนก่อน
Don't worry too much about missing data
Model safety, that's a pickle! with Adrin Jalali - scikit-learn maintainer
มุมมอง 4204 หลายเดือนก่อน
Model safety, that's a pickle! with Adrin Jalali - scikit-learn maintainer
Boosting vs. semi-supervised learning
มุมมอง 2.4K5 หลายเดือนก่อน
Boosting vs. semi-supervised learning
Benchmarking boosted trees against overfitting
มุมมอง 5495 หลายเดือนก่อน
Benchmarking boosted trees against overfitting
Moving towards KDearestNeighbors with Leland McInnes - creator of UMAP
มุมมอง 1.2K5 หลายเดือนก่อน
Moving towards KDearestNeighbors with Leland McInnes - creator of UMAP
Talk like a DataFrame, run like SQL with Phillip Cloud - core-committer on Ibis
มุมมอง 5416 หลายเดือนก่อน
Talk like a DataFrame, run like SQL with Phillip Cloud - core-committer on Ibis