Preparing Fineweb - A Finely Cleaned Common Crawl Dataset

Trelis Research

มุมมอง 1 741

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 16 ก.ย. 2024

ความคิดเห็น • 15

@unshadowlabs 3 หลายเดือนก่อน ⁺⁴
Fantastic video! Could you please consider doing a video on a pdf pipeline? Taking a few hundred or thousand text books in pdf or epub format and doing the extraction, filtering, and deduplication? Something in the vein of "Textbooks Are All You Need" paper. I have been following your excellent previous videos on how to do this to create Q&A pairs, but curious if there might be an updated process or pipeline that may utilize some of these same techniques. Also would it be of more benefit to continue training it on just text or to fine-tune it on the text further processed into Q&A pairs?
@TrelisResearch 3 หลายเดือนก่อน
Good idea, yeah I think a vid on textbooks could make good sense. I'll add to my list.
@PoornimaDevi-yx9oh 3 หลายเดือนก่อน
That was a great roundup of the new FineWeb data filtering flow ! Thanks for the cool presentation.
@TrelisResearch 3 หลายเดือนก่อน ⁺¹
you're welcome!
@mjpc 3 หลายเดือนก่อน
Fantastic video
@KopikoArepo 3 หลายเดือนก่อน
I love humanity. Thanks as always❤
@dmitriikhizbullin1902 3 หลายเดือนก่อน
Amazing walkthrough, highly appreciated
@TrelisResearch 3 หลายเดือนก่อน
thanks!
@thanartchamnanyantarakij9950 3 หลายเดือนก่อน ⁺¹
great content!
@unclecode 3 หลายเดือนก่อน
1. Regarding "deduplicated" data across the whole dumps, if I'm not wrong, they compute the aggregate score by averaging various quality metrics such as perplexity, token overlap, and the proportion of low-quality text. Due to the natural property of "average," it shouldn't be surprising to see higher values. But what if they used "median"? In my opinion, focusing on one dump or removing all duplication across all dumps should work better. I hope the HF team wasn't motivated to just hit a big number of 15T by coming up with a custom score to justify it.
2. The idea to train a classifier rather than keep using an LLM for scoring, always is fascinating. Although I expected a smaller classifier than a 12-layer BERT. But the idea is beautiful. Imagine creating an image classifier for a concept that doesn't exist. You could use a diffusion model to generate enough samples and then train a vision classifier! This opens the door to many interesting projects, like a "Black Hole" image classifier trained on synthetic data generated by a diffusion model.
@TrelisResearch 3 หลายเดือนก่อน ⁺¹
1. I like the thrust of your point on "averaging". Something else to consider, all of the graphs of aggregate score are only up to a few hundred billion tokens, as such they are a small sample of the datasets that they are measuring. So, statistically, the aggregate is a measure on a random sample - that, statistically - is likely to have little duplication within it. HF devote an entire section of the blog to outlining this issue/behaviour.
2. Yeah, I def expected a smaller classifier too!
@pin65371 3 หลายเดือนก่อน ⁺¹
Correct me if I'm wrong but couldnt you add extra points if it goes above a certain level of education and do more training after? If you could break it down to lets say "grade 1-7", "grade 8-12" and then "college/university". First round of training would be done at a lower grade level. Second round at the medium level and then a third at a higher level. You could even split the higher levels into multiple models that are more specialized which could maybe prevent hallucinations? With an agent system as long as the agents know which model to use then you could have lots of agents that do very specific tasks.
@TrelisResearch 3 หลายเดือนก่อน ⁺¹
You certainly could do that, whether it would improve the final model is the question.
Knowledge is transformed and stored in transformers in manners that are not immediately interpretable, so it's not obvious that taking this kind of approach could lead to a smoother updating of weights. To some degree, taking such an approach could result in the model forgetting how to articulate ideas in a primary school level of understanding.
At the same time, just focusing on mid-high grades strikes me as quite a simple strategy with room for improvement.
@jonatan01i 3 หลายเดือนก่อน
@@TrelisResearch it certainly works like this in human brains, learning lower concepts first and then building on top of already existing concepts.
We could train a grade detector as well and giving the model from lowest grades going to higher grades but still having the lower grade mixed in or sth
@pin65371 3 หลายเดือนก่อน
@@TrelisResearch yah I guess you risk just chasing that balance. It also really depends what the ultimate goal is. If you are building a system the average person then you want it to be able to communicate at a normal level of intelligence. If you are building a system that works with people that are already specialists in their field you can get away with it maybe "forgetting" how to communicate to the average person. You could get around that a bit by having different models though. Something like GPT-4o would be able to take the technical side of things and "dumb it down". The smaller and more specialized model would be just used when you need to ensure accuracy. You just want the model to be 100% focused on a very specific field of work. Correct me if I'm wrong though. I'm still just trying to learn how all of this works but from what I have read that seems like a feasible strategy. I do know that before going through all that effort you should ensure you have your prompting on point though. I've found the longer I've been using these tools the better I've gotten at prompting which has improved the results I got. Its actually been sorta fun seeing how that works.

ต่อไป

เล่นอัตโนมัติ