Timestamps: 00:00:00 Basic Preprocessing 00:00:35 Case-folding and its tradeoffs 00:02:40 Stop word removal (tradeoffs and how it can go wrong) 00:04:40 Stemming (tradeoffs and things to watch out for) 00:06:28 Lemmatization and its advantages over stemming 00:07:52 DEMO: basic processing with spaCy 00:10:37 Basic preprocessing recap
I have a bunch of reviews(about 20 million) on places like restaurants, cafes, pet groomers, cleaners and other services. Now I have to categorize them into these service categories like food, pet grooming, cleaning etc. A heavy model like BERT is taking up a lot of time and resources. The data in not labelled for the service so I was thinking about doing a clustering and doing food or no food as the only classes. Kind of like Aspect Based Classification
I also had to ask one more question that if I have so many product reviews(around 20 million) how will I analyze and clean my data. In some places the punctuations are wrong, some have too many spaces etc. It is not possible to see all the errors in the reviews. In that case how to preprocess the data.
Timestamps:
00:00:00 Basic Preprocessing
00:00:35 Case-folding and its tradeoffs
00:02:40 Stop word removal (tradeoffs and how it can go wrong)
00:04:40 Stemming (tradeoffs and things to watch out for)
00:06:28 Lemmatization and its advantages over stemming
00:07:52 DEMO: basic processing with spaCy
00:10:37 Basic preprocessing recap
Great concise intro, I see you getting big in the future. Keep up with the work.
Concise and easily understandable. Thanks a lot for the series.
This is the best NLP series I have ever watched
This content is simple and easy to understand.
Well done, thanks!
I have a bunch of reviews(about 20 million) on places like restaurants, cafes, pet groomers, cleaners and other services.
Now I have to categorize them into these service categories like food, pet grooming, cleaning etc. A heavy model like BERT is taking up a lot of time and resources.
The data in not labelled for the service so I was thinking about doing a clustering and doing food or no food as the only classes. Kind of like Aspect Based Classification
I also had to ask one more question that if I have so many product reviews(around 20 million) how will I analyze and clean my data. In some places the punctuations are wrong, some have too many spaces etc. It is not possible to see all the errors in the reviews.
In that case how to preprocess the data.