Text analysis / mining in R - how to plot word-graphs

Tom Henry - data science with R

มุมมอง 29 494

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 12 ม.ค. 2025

ความคิดเห็น • 62

@tomhenry-datasciencewithr6047 4 ปีที่แล้ว
*P.S.:* If you'd like more text-analysis-related content in the next few weeks, click *[like]* 👍 and *[subscribe]* 🔔!
Here's the R Markdown code if you want to join in: gist.github.com/larsentom/369c4227dced0aac8c78f2d192fc68bd 📊
@andystats 2 ปีที่แล้ว
It looks like it's been moved/deleted. Do you have an updated link?
@ZSY-jm4oi 3 ปีที่แล้ว ⁺¹
Thank you Tom! You instruction and explanation of the codes and logic behind the functions are so clear and easy to follow. It is very helpful!
@izuchukwuokeke4256 3 ปีที่แล้ว
Just seeing this wonderful tutorial now. I subscribed to the page, and I hope Tom is still very much available. Thanks for this and hope to see more posts.
@johnuesi 2 ปีที่แล้ว
Excellent video. I was enthralled the entire duration. You've also given me some ideas for something I'm working on
@RT_-fj9ht 2 ปีที่แล้ว
Starting to learn text mining this semester at school! Found this video really useful and interesting!
@tomhenry-datasciencewithr6047 ปีที่แล้ว
Glad to hear!!!!
@장성호-l6e 3 ปีที่แล้ว
It's a channel where you can always get useful skills. Thank you so much.
@mkklindhardt 3 ปีที่แล้ว ⁺¹
Thank you Tom,
This is an excellent screencast of the incredible possibilities with tidytext!
@raphaelortiz4459 2 ปีที่แล้ว
Great video sir. Thanks for the walk through! Will be applying some of this to a project I am working on
@niceperson9223 4 ปีที่แล้ว ⁺¹
Very informative, kindly upload the video for aspect based sentiment analysis in R programming
@ayasugihada 2 ปีที่แล้ว
Thanks for the great intro Tom. Though I have to say the interpretation of the word relationships sounded a bit like good old tarot reading :). Cheers!
@yi-hsuanchen5458 3 ปีที่แล้ว
Thank you so much! The code and the demo is really helpful :)
@robertc2121 4 ปีที่แล้ว
Liked and subscribed. Fantastic tutorial and explanation !!!
@jasoncysiu 4 ปีที่แล้ว
Thank you for this amazing tutorial!
@tomhenry-datasciencewithr6047 4 ปีที่แล้ว
Glad you like it!
@swifterbator8355 4 ปีที่แล้ว ⁺²
I get a vector of 3.2 GB (I have 3000 clean texts), and I cannot allocate the vector. It happens during the correlation calculation step. Any advice on memory allocation when working with heavy data?
@tomhenry-datasciencewithr6047 4 ปีที่แล้ว
What kind of text data do you have? That sounds pretty big!
@swifterbator8355 4 ปีที่แล้ว
@@tomhenry-datasciencewithr6047 just regular html documents that I cleaned, so it's really more like 3000 paragraphs about some company filings. I wanted to see relationships between words like covid and whatever they correlated with. I followed your code to the letter. I did however succeed in the end, but ended up with so many nodes even when filtering away all wordpairs that did not consist of covid, filtering away words not used in more than x documents and with correlations less thn 0.3. Maybe I just need some pracice.
It's a really cool plot though, I subscribed for more videos
@nourbouabdallah2619 3 ปีที่แล้ว
Thank you so much for this tutorial. when I tried to find the graph for positive correlation it says Error in FUN(X[[i]], ...) : object 'correlation' not found
." what do you think the problem is from ( nb: all the other chunks are running without errors)
@divyangirathore4156 3 ปีที่แล้ว
i am unable to do this in ggraph. can u tell how can i plot a histogram with the word count using ggplot
@amrutakale2465 4 ปีที่แล้ว ⁺²
Nice tutorial! but what if i am extracting my data from pdf files . Is there any way to convert it and then perform the analysis.
@tomhenry-datasciencewithr6047 4 ปีที่แล้ว ⁺¹
That's a bit more tricky! What kind of pdf files are you trying to analyze?
There are ways to convert pdf files to text and then analyze the text, although you might need to use another tool to do it. For example, on Mac, there is a command called "pdftotext" which you can run in the Terminal shell. Once the pdf files are converted to text files, you could load them in and analyze that text.
@TotusCamihurs 3 ปีที่แล้ว
Gracias por explicar cada línea.
@julitopabriga9094 3 ปีที่แล้ว
Please post the website for the dataset. Cannot be read on your presentation. Regards.
@jonathandevries4257 3 ปีที่แล้ว
Hi, I am having trouble installing tinytext, it seems like this package is no longer available, and it makes many of the functions difficult to accomplish, do you know if there is a new name for this package?
@tomhenry-datasciencewithr6047 3 ปีที่แล้ว
Hi Jonathan - have you tried
install.packages("tidytext")
vs
install.packages("tinytext")
('D' vs 'N' in 'tidy')?
I think that should work - if not let me know :)
@teknocatt 2 ปีที่แล้ว
Thank you for inspiring video!
@abhipsatripathy3934 3 ปีที่แล้ว
How to apply your codes in a text file. The text file is purely a story. It's not advisable to convert it to a csv file. What to do then?
@DataCentricInc 3 ปีที่แล้ว
This is great content on text mining in R. I also have a channel that discusses text mining & Sentiment Analysis in R on data from the web, PDF documents and data frames.
@miguelamaral5505 4 ปีที่แล้ว
Thank you so much for this tutorial. I tried to use a correlation >= 2 but it says "object "correlation"not found."
I think it won't aloud me to plot any number higher than 1. Do you know if there's a way to fix that? My data set is kinda big and the plot gets confusing with a low correlation.
@tomhenry-datasciencewithr6047 4 ปีที่แล้ว
Hi Miguel. Were you able to sort this out?
Otherwise if you post the code chunk that is failing we can figure it out together.
@GingerFelix1000 2 ปีที่แล้ว
Am I being slow or is it not just that a correlation coefficient cannot exceed 1. E.g. a range between -1 and 1?
@OpalCrossCoaching 3 ปีที่แล้ว
This is great content on text mining in R. I also have a channel that discusses text mining in R on data from the web, PDF documents and data frames.
@karthikparanthaman634 3 ปีที่แล้ว
Hi Tom, Great video!
When I tried to run the "pairwise_cor" R was returning following error :
"Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 11489"
Any suggestions?
@tomhenry-datasciencewithr6047 3 ปีที่แล้ว
I suspect this is because you might have an older version of 'widyr' installed.
Perhaps try the steps at github.com/dgrtwo/widyr and install the most recent version from GitHub, and then restart R and see if it works.
If not, we can figure out what is going on!
@benhalsted9574 4 ปีที่แล้ว
Great content mate! From your codes, I tried to create a positive_word_correlations dataset but I cant seem to get it. Any advice?
positive_word_correlations %
semi_join(users_who_mention_word, by = "word") %>%
pairwise_cor(item = word, feature = user_name) %>%
filter(correlation >= 0.2) %>%
filter(grade >= 5)
@tomhenry-datasciencewithr6047 4 ปีที่แล้ว ⁺¹
Hi Ben -- what error are you receiving?
@andrekerygma 4 ปีที่แล้ว
Would you help to do this process with a CSV file?
@robertrotich3958 3 ปีที่แล้ว
I would like to see this kind of approach as well
@JD_Mortal 2 ปีที่แล้ว ⁺¹
I didn't realize this was a thing... "R"... Though I thought the whole video was great, at the end, it seems like there could have been some better kind of formulation which would offer a better insight to the reviews. Word-pairing is great, but they were all "out of context" and due to the "chaining", it leaves one to assume that there is actually a connection of 3 or more words, when there may not be.
For instance, you saw "bugs" and "fishing"... Bug = a thing you fish with, or programming issue with fishing? (I assume the prior)
I see "bombing, review, click"... That could have been "review bombing", or "bombing review"... Were there bombs in the game? Was it "... bombing. Review ..." or "... review. Bombing ..." I am sure that there was no reviews that had "bombing review click" or "click review bombing".
I have an issue with the "tainted results". You threw away valuable "review words". I say tainted, or "corrupted", because you removed them, which now "pairs" possibly unrelated words. Also, periods... You don't constrain "pairings" to "sentences". You are getting cross-contamination of thoughts, creating pairings that truly don't exist.
Then there is the matter of "word association" and "similarity" and "depluralizing" that should be done. I saw "player" and "players", textually the same content also a pairing of "reviews negative", but no "review negative" only "review bombing" and no "reviews bombing", also island and islands, were oddly isolated.
Word association... "Nintendo switch", "Nintendo game", "Nintendo switch game", "Nintendo console game" That contaminates "switch", and "game" and "console" with other relevant pairings, having nothing to do with a "game console" branded "Nintendo", specifically the "Switch" model. (Also the removal of the games title from the review, which contaminates pairs related to "animal" and "crossing" and "horizon/s".) I noticed a lot of foreign words in there too. De, en, el, es, se... Perhaps a LOT was missed since those were quite commonly found, but they surely were not reviewing in English, and pairings of foreign words, even if translated, would not always be the same. The dialects orders are often different. Thus, the "word association" needed. Which identifies the subjects and relative words you threw away half of. "Good game", with "good" being one of those common words you surely had in the list, possibly found a hundred times. Good, like/d, love/d, enjoy/ed.
Missing critical triplets and notable phrases too, I assume... "well worth the money", "not worth the money", you just saw "worth money" as a pairing. "waste of my time", "no time to waste, get it now", as "waste time" and "time waste". I guess my mind just works different.
I feel that you were on the right track in the isolation of good/bad, but the pairing doesn't seem to be a good metric for anything other than "game content confirmation". By the reviews, the text suggests that it is a game that involves fishing, customization/crafting of things, multiplayer, animals, it works on Nintendo switch, there are islands in it. (Compared to the game developers description, it could "confirm" game content.)
Perhaps a better metric for good and bad would be the isolation of words NOT found in both. Seeing "not fun" as a pairing in a horrible review is expected. However, if you see "good value" or "worth ... money", then its not so bad.
@tomhenry-datasciencewithr6047 2 ปีที่แล้ว
You are asking exactly the questions that would go into a more detailed analysis!
There is a useful function called SnowballC::wordStem() which reduces words down to a common 'stem.' For example, it produces: "then there is the matter of word associ and similar and deplur that should be done i saw player and player textual the same content also a pair of review negat but no review negat"
For a more rigorous look into this subject see Julia Silge and David Robinson's excellent 'Tidy Text Mining in R' which is free online: www.tidytextmining.com
Text analysis is always imperfect (and will remain so forever, I suspect), but it can yield good insights when applied to a large dataset, provided a human is in the loop!
@JD_Mortal 2 ปีที่แล้ว
@@tomhenry-datasciencewithr6047 Something like this would be PERFECT for what I am doing, but it is honestly above me in complexity, at the moment. I was looking for a formulated way to form a sense of textual hierarchy. One which could be used to help new entries "find a logical category level", where it belongs with other similar associated words.
In a basic sense...
object => transportation => vehicle => automobile => car => gasoline_engine => hatch_back => ford => mustang => 1985 => cherry_red_paint
So a "truck", which, by similar associations, would be at the level of "car".
transportation => vehicle, as opposed to skates (accessory vs something you drive) or a ski-lift (not drivable)
vehicle => automobile, as opposed to a bicycle or skateboard (non-motorized vehicles)
etc...
The hierarchy being assumed by known relations, and/or by simple volume of appearance and order. People tend to say, "my 1985 mustang", classed in reverse, "specific => generic". However, by volume, 1985 appears less than mustang and ford appears more than that. Continuing up the chain to automobiles, vehicles and transportation, which has progressively more and more "objects" that they are identified with.
While knowing the relation, without needing to know the specifics of any one car... ford can be aligned with lexus, chevy, mazda, etc... Because of the similar preceding and following similar groupings.
Why turn the entire language into a form of "word tree of origins"? Partly for use with AI classification of image contents. Knowing that cars have rims, wheels, paint, body styles, manufactures. Partly for extracting and isolating the correlating subject matter and "emotion/opinions" within descriptions. Partly for assisting the extraction and isolation of "finer details", such as the more rare descriptions of the types of tires, ground effects, antenna types, rim types.
I could go on, but my primary goal in my knowledge-quest, was for the things just mentioned, as a whole. Extending to the final purpose of being used as a guide for helping others "classify images contents", with more valuable information that AI can use for digestion. (All in relation for text2image and image2image AI created art, which is assisted by "human textual prompts". They have a rough system in place, but it is hardly extensive or adequate enough to be used with any form of accuracy or repeatability.)
P.S. Looking into this further, because of this video you posted. (Yet another language to consume my brain-cells.)
@seancherry2520 3 ปีที่แล้ว
anyone know what this error means/how to address it? here's my input + the error that follows
my input:
> parsed_words %
+
unnest_tokens(output = word, input = text) %>%
+
anti_join(stop_words, by = "word") %>%
+
filter(str_detect(word, "[:alpha:]")) %>%
+ distinct()
Error: Must extract column with a single valid subscript.
x Subscript `var` has the wrong type `function`.
ℹ It must be numeric or character.
@divyangirathore4156 3 ปีที่แล้ว
apparently ggraph package is not getting installed
@tomhenry-datasciencewithr6047 3 ปีที่แล้ว
What error message do you see when you try to install ggraph?
install.packages("ggraph")
@divyangirathore4156 3 ปีที่แล้ว
@@tomhenry-datasciencewithr6047 ERROR: dependency ‘igraph’ is not available for package ‘graphlayouts’
* removing ‘/opt/homebrew/lib/R/4.1/site-library/graphlayouts’
Warning in install.packages :
installation of package ‘graphlayouts’ had non-zero exit status
ERROR: dependencies ‘igraph’, ‘tidygraph’, ‘graphlayouts’ are not available for package ‘ggraph’
* removing ‘/opt/homebrew/lib/R/4.1/site-library/ggraph’
Warning in install.packages :
installation of package ‘ggraph’ had non-zero exit status
@tomhenry-datasciencewithr6047 3 ปีที่แล้ว
@@divyangirathore4156 Have you tried installing those other packages first? (e.g. install.packages("igraph") and so on)
or you can try
install.packages('ggraph', dependencies = TRUE)
one more thing - are you trying to install on a server or other shared location .... or is this just on your personal computer?
@divyangirathore4156 3 ปีที่แล้ว
@@tomhenry-datasciencewithr6047 it is on my personal computer. the above command did not work still. is there any chance u can tell me how we can plot the same data u plotted using gplot? any references would be helpful
@onsfarhat1042 3 ปีที่แล้ว
A very intresting video! Thank you Australia all the way from France!
@rafaafeitos 2 ปีที่แล้ว
Excelent video
@AnahideCastro 9 หลายเดือนก่อน
👏🏽👏🏽👏🏽
@CurveBlade 4 ปีที่แล้ว ⁺¹
If you want your lessons to be applicable, you have to include a section that teaches us how to convert the data into whatever format you are using. I am using txt format and i am unable to replicate anything in this video.
@tomhenry-datasciencewithr6047 4 ปีที่แล้ว
Hi Elton! Do you have an example of what your text data looks like in the txt file?
@CurveBlade 4 ปีที่แล้ว
@@tomhenry-datasciencewithr6047 Hi Tom. It looks like this.
Asin Rating Reviews
B085234 4.0 out of 5 stars All in all, if you weigh quality more you should probably pay 50-100 bucks more for laptop with similar
B092453 3.0 out of 5 stars It is light weight. I liked it. However, probably because of the software installed, I couldnt install the apps I was
@tomhenry-datasciencewithr6047 4 ปีที่แล้ว
Excellent. Also, is the data stored in a tab separated format, or in a comma separated format, or is it in Excel, or is it in some other format?
@CurveBlade 4 ปีที่แล้ว ⁺¹
@@tomhenry-datasciencewithr6047 I have already solved the issue. Thanks Tom! If your user reviews had a lot of non-english characters, how would you resolve that?
@JOHNSMITH-ve3rq 3 ปีที่แล้ว
The *purpose* of generating text networks was not clear from this video. The exercise didn’t seem to generate any particular insight. Is this a problem with text networks themselves, or content selection?
@h4lifewaalwijk 3 ปีที่แล้ว
I think your results of negative reviews are a bit unclear because you include those with review=0 in this group. I suppose these are just unrated.
@seancherry2520 3 ปีที่แล้ว ⁺¹
this is great. however at step 4 to unnest_tokens, is anyone else getting this error?
could not find function "%>%"
@tomhenry-datasciencewithr6047 3 ปีที่แล้ว ⁺¹
Hi Sean, have you loaded the tidyverse packages? library(tidyverse)
That is likely the reason :)
@guhreenskittles 3 ปีที่แล้ว ⁺¹
I got the same error and loaded all the packages again and I worked.

ต่อไป

เล่นอัตโนมัติ

How to perform text analytics in R on Multiple PDF Documents