Great work Mate, Always wanted to look at SERP with NLP from 2019, but now you showed us how to do that. And you showed NLP or Google is nt perfect, they don't know how these all really work once deployed.
After extracting words it would useful co confront them again at the Google knowledge graph (in order to see which of the topics are entities). I would also use a stop words or common words list to clean the final table (top 25 terms)
Great feedback! I agree, the stop words make it a little messy. I do like keeping in the prepositions. BERT takes those into account when understanding intent.
@@smamarketing I might be having a caching issue then, will try another browser. But those are real google NLP entities and not some other custom trained NLP classifications right?
Unfortunately, spacy isn't extacting all entities from the text, there are hundreds of entities it's missing. It's only pulling known entities that either have a wikipedia page or knowledge graph. However "other" as in other entities types plays an important role in content and how they are used increases the salience score of the focus known entity. Is there a way to get spacy to show ALL entity types including "other"?
Nevermind, I just realized that spacy NLP is not the same as Google's NLP.. so the question is spacy's NLP similar to googles, and if not, why use it for SEO then>?
All of these models are slightly different. To use the Google API, you'll need to set up an account with Google cosnole and pay for usage. It may not be perfect, but it's a good start. All of these NLP tools have their pros and cons and SpaCy does a decent job. you can train it to make it better. I also like Textrazor. It seems to do good a good job, but it's also paid.
@@smamarketing We're using Google's NLP because that's how BERT reads content, so i.e. Google NLP for SEO content analysis. I already have a google NLP API all setup and ready to use. I was saying that how hard would it be to convert the spacy notebook that you shared over to google's NLP, using it's API... I found recently a colab notebook that uses google's NLP API to grab text/content from URLs (that you input) and outputs entities, text classification, salience etc.. but it's broken, and I don't know python enough to figure out the problem... its like 1 through 6 cells fire off just fine but when it gets to the 7th cell it throws an error and won't proceed. If I could get this notebook working it would be pretty powerful tool.
@@smamarketing Also, "paid" is literally pennies though. You get 5000 free API calls a month anyway, and you'd have to be making pretty large amount of calls to go over that... I am not. Even if I did go over 5000 API calls a month, it would be near impossible for me to incur anything over a $5 a month bill.
@@SpiritTracker7 Send me the link to the notebook and I'll see what I can do! As fart as Google using the console NLP library for BERT, it's a little more complex than that. BERT is one type NLP and the out of the box NLP most people use in the console isn't BERT. BERT is open source and you can use a number of the models for NLP research. See this post for more info: ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Looks like Trafilatura is coming up with problems "NameError Traceback (most recent call last) in () ----> 1 pd.set_option('display.max_colwidth', None) # make sure output is not truncated (cols width) 2 pd.set_option("display.max_rows", 100) # make sure output is not truncated (rows)"
Hi. I run all the cells and when I get to the visualization part, I get this output: --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) in () 5 width_in_pixels=900, 6 minimum_term_frequency=3, ----> 7 term_significance = st.LogOddsRatioUninformativeDirichletPrior()) 8 open("SERP-Visualization_top3.html", 'wb').write(html.encode('utf-8')) 9 display(HTML(html)) 2 frames /usr/local/lib/python3.7/dist-packages/scattertext/ScatterChart.py in to_dict(self, category, category_name, not_category_name, scores, transform, title_case_names, not_categories, neutral_categories, extra_categories, background_scorer, use_offsets, **kwargs) 274 275 all_categories = self.term_doc_matrix.get_categories() --> 276 assert category in all_categories 277 278 if not_categories is None:
Thanks for a great and interesting video. Can you change the stop word list to another language in this line of code? import scattertext as st from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
You are doing great videos, just discovered your channel and watched to many of them
Glad you like them! Really appreciate it.
Unique content. Love it. Thanks.
Glad you enjoy it!
Awesome! Thanks a lot for this gem❤
Great work Mate, Always wanted to look at SERP with NLP from 2019, but now you showed us how to do that. And you showed NLP or Google is nt perfect, they don't know how these all really work once deployed.
Hello
Thank you for your very good content.
I had a seva, when I run the code I get this error
Traceback (most recent call last)
How can I fix it?
Love your content
Thanks!
Please make one more detailed video on this topic
I'll be doing more on this topic!
Do you know of a notebook that does this exact same thing but with google nlp? i have an api. thanks
After extracting words it would useful co confront them again at the Google knowledge graph (in order to see which of the topics are entities). I would also use a stop words or common words list to clean the final table (top 25 terms)
Great feedback!
I agree, the stop words make it a little messy. I do like keeping in the prepositions. BERT takes those into account when understanding intent.
Hello, the tool isn't working for me. Getting this - ERROR: Failed building wheel for tokenizers
Please try now. Code updated and it should all work.
Quick Question, How can you scrap results for only usa or particular country?
It is asking for access if i open the link mentioned in the description
Sorry about that. Here you go colab.research.google.com/drive/1PI6JBn06i3xNUdEuHZ9xKPG3oSRi1AUm?usp=sharing
Do we need to add our own NLP API to this? I've been using this but lately I've been getting a lot of errors even when restarting runtime etc.
You should be good. SpaCy is open.
If you stop runtime, you'll need to rerun the cells.
@@smamarketing I might be having a caching issue then, will try another browser. But those are real google NLP entities and not some other custom trained NLP classifications right?
Great video! Which languages are supported for entities, except EN?
Unfortunately, spacy isn't extacting all entities from the text, there are hundreds of entities it's missing. It's only pulling known entities that either have a wikipedia page or knowledge graph. However "other" as in other entities types plays an important role in content and how they are used increases the salience score of the focus known entity. Is there a way to get spacy to show ALL entity types including "other"?
Nevermind, I just realized that spacy NLP is not the same as Google's NLP.. so the question is spacy's NLP similar to googles, and if not, why use it for SEO then>?
All of these models are slightly different. To use the Google API, you'll need to set up an account with Google cosnole and pay for usage. It may not be perfect, but it's a good start.
All of these NLP tools have their pros and cons and SpaCy does a decent job. you can train it to make it better.
I also like Textrazor. It seems to do good a good job, but it's also paid.
@@smamarketing We're using Google's NLP because that's how BERT reads content, so i.e. Google NLP for SEO content analysis. I already have a google NLP API all setup and ready to use. I was saying that how hard would it be to convert the spacy notebook that you shared over to google's NLP, using it's API... I found recently a colab notebook that uses google's NLP API to grab text/content from URLs (that you input) and outputs entities, text classification, salience etc.. but it's broken, and I don't know python enough to figure out the problem... its like 1 through 6 cells fire off just fine but when it gets to the 7th cell it throws an error and won't proceed. If I could get this notebook working it would be pretty powerful tool.
@@smamarketing Also, "paid" is literally pennies though. You get 5000 free API calls a month anyway, and you'd have to be making pretty large amount of calls to go over that... I am not. Even if I did go over 5000 API calls a month, it would be near impossible for me to incur anything over a $5 a month bill.
@@SpiritTracker7 Send me the link to the notebook and I'll see what I can do!
As fart as Google using the console NLP library for BERT, it's a little more complex than that.
BERT is one type NLP and the out of the box NLP most people use in the console isn't BERT. BERT is open source and you can use a number of the models for NLP research. See this post for more info: ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Hi !
How can we change de language of the query ?
You can set the language in SpaCy following this: spacy.io/api/language
Is this still working? I tried using the template, but its giving me errors. Is that me messing up or has this not been updated a while?
My bad, I was too quick on the gun. Got it. Pressed one script too early.
@@WillemNout1 Let me know if you have any questions!
Thank you - very interesting. Is there a way I can get UK Google results, rather than US please?
Change the Google URL to the UK one.
@@smamarketing would you mind explaining how to do this? Please
ModuleNotFoundError: No module named 'sklearn.feature_extraction.stop_words'
Did you run every cell in order?
I just updated it. Working on a few more tweaks and it should be fixed later today!
@Ryan Working on fixing it.
THERE IS A MAJOR ISSUE THE TOP 10 RESULTS DOES NOT INCLUDE FEATURED SNIPPETS AND 2ND POSITION, FIX IT SO PEOPLE WOULD GET ACCURATE DATA . THANKS
This will only pull organic results.
This is a free colab that others can use to explore.
As of June 2023, the
```
!pip install "transformers == 3.3.0"
```
fails with can't build wheels
Please try now. Code updated and it should all work.
Looks like Trafilatura is coming up with problems
"NameError Traceback (most recent call last)
in ()
----> 1 pd.set_option('display.max_colwidth', None) # make sure output is not truncated (cols width)
2 pd.set_option("display.max_rows", 100) # make sure output is not truncated (rows)"
Hi mate, SpaCy did not work for me. It stops here on this ### Scraping results with Trafilatura###.
Just made an update. SpaCy made a few changes. Try now and let me know!
@@smamarketing It works now. Thank you.
Hi.
I run all the cells and when I get to the visualization part, I get this output:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
in ()
5 width_in_pixels=900,
6 minimum_term_frequency=3,
----> 7 term_significance = st.LogOddsRatioUninformativeDirichletPrior())
8 open("SERP-Visualization_top3.html", 'wb').write(html.encode('utf-8'))
9 display(HTML(html))
2 frames
/usr/local/lib/python3.7/dist-packages/scattertext/ScatterChart.py in to_dict(self, category, category_name, not_category_name, scores, transform, title_case_names, not_categories, neutral_categories, extra_categories, background_scorer, use_offsets, **kwargs)
274
275 all_categories = self.term_doc_matrix.get_categories()
--> 276 assert category in all_categories
277
278 if not_categories is None:
I'll look into this!
Thanks for a great and interesting video. Can you change the stop word list to another language in this line of code?
import scattertext as st
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
I believe you can. Here is a list of supported languages compiled by advertools advertools.readthedocs.io/en/master/advertools.stopwords.html