NLP-Powered SEO: Using Machine Learning to Understand Search Intent

Search engines don’t just match keywords, they try to understand what the user actually wants. For example, imagine someone types: “best resistor for LED”. A basic keyword-matching system would look for pages containing:

“best”
“resistor”
“LED”

But a search engine that understands intent knows the user is likely looking for “How to calculate the right resistor value for an LED”. This project explores how we can use Google Autocomplete and “People Also Ask” (PAA) questions to uncover what people are really searching for. By combining natural language processing (NLP), clustering algorithms, and visualization techniques, we map out user questions into structured topic clusters. The goal is to help websites show up in search results by using real search data to understand what people want to know.

Step 1: Keyword Expansion and PAA Data Collection

The process begins by expanding a handful of seedwords via Google Autocomplete. The common A-Z expansion technique adds a letter of the alphabet to a seed keyword to fetch suggested queries [1]. For example. The seed “nanoelectronics” would generate predictions like “nanoelectronics a…”, “nanoelectronics b…”, etc. This reveals variants that people searched for, such as “nanoelectronics application” or “nanoelectronics book”. Next would be to perform a search for each keyword and retrieve the People Also Ask (PAA) box [2]. These PAA boxes are specific to the query entered:

The PAA box typically shows a few (normally 4) questions. However, clicking on one question reveals an answer and often triggers additional questions to appear. Each new click yields more specific questions on that topic [3]. Being able to automatically click them and scraping the questions is important, because many valuable long-tail questions only surface after expanding the PAA list. It is good practice to go at least one, preferably more levels down, to get a lot of contentful questions [4]. Be mindful of Google’s anti-scraping measures to avoid getting your IP blocked [5]. For the webscraping of PAAs, I used Selenium, which is an open-source framework used to automate web browsers.

Obtaining the keywords a-z was not the problem and can be obtained within a few minutes by using the undocumented Google Suggest (autocomplete) API. Howevever, to obtain PAAs, it required the script to actually open a new browser, click on the PAA queries and record all questions. I started with 3 seed keywords, times 26 for the amount of letters in the alphabet. By scraping all googe’s autocomplete I obtained about 700 keywords. In the end, by searching for each keyword individually, I got about 7200 PAA questions.

Step 2: Structuring and Preparing the Data for Analysis

With a dataset of about 7200 PAA questions, the next step is to structure it for clustering. You’ll have to compile all the PAA questions into a dataset (a .csv file is nromally fine) along with their source keyword. Often multiple keywords will produce duplicate questions. For clustering it is important to delete the duplicate PAAs so each unique question is only clustered once. However, it is still useful to keep record of which keyword yielded these questions. This can later help interpret clusters.

One key consideration I had is whether to include the original keyword text when clustering the questions. In other words, should the question “What are the three major applications of nanotechnology?” be clustered on its own, or augmented with the context like “[nanoelectronics applications] What are the three major applications of nanotechnology?” for embedding? I did not find any research on this point, from my point of view we want to cluster by similarity of the questions themselves. Including the keyword would bias the representation. Questions from the same seed might cluster together, even if semantically those questions align better with queries from another seed. Moreover, the PAA questions normally include the topic context already, as shown in the example just given.

In summary, prepare your data as a list of unique PAA questions (associated with one or more seed keywords).

Step 3: Transforming Questions into Semantic Embeddings

To perform clustering on actual words, we need to convert each question into a numerical vector. Short questions might give a problem, as they contain few words, and lexical overlap between questions might be low even if they are on the same topic for some of the older models. Questions like “How to start a blog” and “Ways to launch a website” have no real words in common, and the TF-IDF model would not see them as similar [6]. In contrast, modern sentence embeddings (such as Sentence-BERT) capture the meaning beyond the exact wording. They are trained to encode the contextual meaning of entire sentences into high-dimensional vectors. As a result, questions with similar topics end up close together in vector space, even if they do not have the same words [6].

I recommend using a sentence embedding model for PAA questions. Libraries like Sentence-Transformers in Python make this very easy [7]. I use the model ‘BAAI/bge-large-en-v1.5’. It transforms each question into its embedding vector. And you’ll end up with a matrix of vectors. Going directly into clustering might give you a lot of noise due to the high dimensionality. That is why, at this stage, you may apply a dimensionality reduction model (e.g. UMAP or PCA) to compress the vectors for easier clustering. Even thuogh the clustering algorithms can handle it without reduction, it does help lower the noise. In my dataset, it went from 60% noise, down to below 20% noise. The critical outcome is that easy question is now represented in a way that captures its meaning in context, which sets the stage for effective clustering.

				
					# === STEP 2: EMBED ===
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(questions, convert_to_numpy=True, show_progress_bar=True)

# === STEP 3: UMAP REDUCTION ===
umap_model = umap.UMAP(n_neighbors=15, min_dist=0.0, n_components=10, random_state=42)
reduced_embeddings = umap_model.fit_transform(embeddings)

Step 4: Clustering the Questions to Identify Intent Groups

Nex step is to cluster these vectors to find groups of related questions. The choice of clustering algorithm is important. In a previous project I used K-Means clustering, because I know in advance how many clusters I needed . However, in this scenario, density-based clustering methods like DBSCAN or HDBSCAN are more suitable, here I explain why.

K-Means vs. DBSCAN/HDBSCAN: K-Means require specifying k (the number of clusters) beforehand, and assumes that clusters are roughly spherical and evenly sized. This is not practical for PAA questions, where clusters can vary in size and some questions might not cleanly fit any cluster at all. By contrast, HDBSCAN (Hierarchical DBSCAN) automatically finds an appropriate number of clusters and can leave out noise that dont strongly belong to any clusters. The latter is useful in SEO, since some very unique questions might not warrant grouping into a broader topic cluster [7].

For my case, HDBSCAN is a highly effective choice. It was designed to work with high-dimensionality embeddings like those from NLP models. Its advantage align with SEO needs: “It automatically determines the optimal number of clusters, handles clusters of any shape or size, and excludes noise/outliers instead of forcing bad groupings.” This approach was demonstrated to be succesful in creating keyword clusters from search queries.

				
					# === STEP 4: HDBSCAN CLUSTERING ===
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=5,
    metric='euclidean',
    cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced_embeddings)

In practice, you’ll feed the matrix of questions into HDBSCAN (with appropriate parameters like minimum cluster size). The algorithm will output cluster labels for each question (or noise label for the outliers). An alternative is DBSCAN, however, this requires tuning a distance epsilon parameter and is a bit more complicated to setup correctly.

Step 5: Interpreting Clusters and Applying Insights to SEO

In this final step, I interpret and label the clusters meaningfully, then try and integrate my findings into my SEO content strategy.

The first thing to do here is to asign a descriptive label or theme to each cluster. This can be done manually or with NLP assistance. For manual you can look for common keywords. For an automated approach, you can either use extractice or generative methods. Extractive methods include algorithms like c-TF-IDF or KeyBERT that extract words/phrases from clusters. Generative methods involve using a language model (like ChatGPT), sending them a list and ask them to suggest a concise topic name or summary [8]. As I have access to a generative API, this was an easy choice.

Now that we have clusters of related questions and their theme, we can look at practical SEO applications:

Build FAQ and Q&A Sections: PAA clusters are perfectly suited for Frequenly Asked Question (FAQ) content. The questions in a cluster can be used to create an FAQ section on a page, or a standalone FAQ page. In fact, one recommended use of PAA data is to directly build FAQ pages addressing those questions [9].
Identify Content Gaps and New Content Opportunities: By clustering all the questions, you created a map out of subtopics within your niche that users care about. When comparing this map against your own content, you can see if there is anything you haven’t covered on your site yet. These content gapsare useful to find new areas you have not yet covered. It help ensuring you don’t overlook subjects that users are asking about [3].
Create Topic “Pillar”: Your original seed word might be a broad topic, while the clusters could form groups of subtopic questions. Yuo could use this to structure your site’s content hierarchy [10].
Optimized for Featured Snippets and PAA visibility: Clustering PAA questions also helps yuo prioritize which questions to answer in your content for maximum SEO-content. By answering them in 40-50 words under a proper heading, you have a high chance that Google will use your answer directly in the PAA box. However, Google prefers if your answer is in context (within a full article), rather than just end of the page FAQ [3].

In summary, by generating PAA data and clustering it, you gain a map of user-intents. The steps I showed reflect the best practice in NLP and SEO. It results in groups of questions that highlight what real users want to know.

References

[1] Apify. (2024). How to scrape keywords from the Google search bar (Google autocomplete suggestions). Apify Blog. https://blog.apify.com/how‑to‑scrape‑google‑ autocomplete‑suggestions/

[2] van der Meij, W. (2023, April 19). How to scrape People Also Ask questions on Google? Umbrellum. Retrieved from https://umbrellum.com/guides/content/people-also-ask/scraping-people-also-ask

[3] Malik, K. (2025). Dominating People Also Ask (PAA). Answer Engine Journal. https://answerenginejournal.com/guide/visibility/people-also-ask/

[4] Hooda, K. (2024). People also ask: What is it and how to use the PAA tool. Keywords Everywhere Blog. https://keywordseverywhere.com/blog/people-also-ask/

[5] Datajournal. (2024, November). How to scrape Google’s “People Also Ask” using Python. Medium. https://medium.com/@datajournal/scrape-google-people-also-ask-cefb9489c647

[6] [5] Kumar, A. (2023). Text clustering: Key steps & algorithms. Vitalflux. https://vitalflux.com/text-clustering-key-steps-algorithms-examples/

[7] Jawaid, M. (2025, May 12). Clustering multilingual search terms with HDBSCAN and embeddings [Post]. LinkedIn. https://www.linkedin.com/posts/maryam-jawaid_nlp-hdbscan-clustering-activity-7350788791902617602-7abo/

[8] [8] Sia AI. (2023). Labeling text clusters with keywords. Medium. https://sia-ai.medium.com/labeling-text-clusters-with-keywords-b5b5b6c1a89e

[9] Elliott, D. (2018). Scraping “People Also Ask” boxes for SEO and content research. Builtvisible. https://builtvisible.com/scraping-people-also-ask-boxes-for-seo-and-content-research/

[10] Bortoluzzi, N. (2023). How to easily cluster People Also Asked keywords. SEO Lynx Blog. https://seo-lynx.co.uk/blog/cluster-paa-queries/

Post Views: 906

Hi, welcome to my website. I am writing about my previous studies, work & research related topics and other interests. I hope you enjoy reading it and that you learned something new.

Daily tech news

NLP-Powered SEO: Using Machine Learning to Understand Search Intent

Step 1: Keyword Expansion and PAA Data Collection

Step 2: Structuring and Preparing the Data for Analysis

Step 3: Transforming Questions into Semantic Embeddings

Step 4: Clustering the Questions to Identify Intent Groups

Step 5: Interpreting Clusters and Applying Insights to SEO

References

Florius

Leave a comment

You may also like

3-phase IGBT-inverter – Working principles

How to design an Operational Transconductance Amplifier (OTA)?

30 qualities and attributes job interviewers seek out

Scaling of CMOS and its Issues

Pin diagram of the PIC16F877A microcontroller

CMOS Process Steps: 3um to 1.25um

PIC16F877A Analog to Digital Converter (ADC)

Antiferromagnetic Spin Configuration – Hematite

Categories

About

Term Links