RSS Daily tech news
  • Tiny gold “super atoms” could spark a quantum revolution
    Scientists have found that microscopic gold clusters can act like the world’s most accurate quantum systems, while being far easier to scale up. With tunable spin properties and mass production potential, they could transform quantum computing and sensing.
  • Scientists unveil bioplastic that degrades at room temperature, and outperforms petroplastics
    Plastic pollution is a mounting global issue, but scientists at Washington University in St. Louis have taken a bold step forward by creating a new bioplastic inspired by the structure of leaves. Their innovation, LEAFF, enhances strength, functionality, and biodegradability by utilizing cellulose nanofibers, outperforming even traditional plastics. It degrades at room temperature, can be […]
  • Building electronics that don’t die: Columbia's breakthrough at CERN
    Deep beneath the Swiss-French border, the Large Hadron Collider unleashes staggering amounts of energy and radiation—enough to fry most electronics. Enter a team of Columbia engineers, who built ultra-rugged, radiation-resistant chips that now play a pivotal role in capturing data from subatomic particle collisions. These custom-designed ADCs not only survive the hostile environment inside CERN […]
  • Digital twins are reinventing clean energy — but there’s a catch
    Researchers are exploring AI-powered digital twins as a game-changing tool to accelerate the clean energy transition. These digital models simulate and optimize real-world energy systems like wind, solar, geothermal, hydro, and biomass. But while they hold immense promise for improving efficiency and sustainability, the technology is still riddled with challenges—from environmental variability and degraded equipment […]
  • Quantum tunneling mystery solved after 100 years—and it involves a surprise collision
    For the first time ever, scientists have watched electrons perform a bizarre quantum feat: tunneling through atomic barriers by not just slipping through, but doubling back and slamming into the nucleus mid-tunnel. This surprising finding, led by POSTECH and Max Planck physicists, redefines our understanding of quantum tunneling—a process that powers everything from the sun […]
  • Decades of chemistry rewritten: A textbook reaction just flipped
    Penn State researchers have uncovered a surprising twist in a foundational chemical reaction known as oxidative addition. Typically believed to involve transition metals donating electrons to organic compounds, the team discovered an alternate path—one in which electrons instead move from the organic molecule to the metal. This reversal, demonstrated using platinum and palladium exposed to […]

NLP-Powered SEO: Using Machine Learning to Understand Search Intent

by Florius

Search engines don’t just match keywords, they try to understand what the user actually wants. For example, imagine someone types: “best resistor for LED”. A basic keyword-matching system would look for pages containing:

  • “best”
  • “resistor”
  • “LED”

But a search engine that understands intent knows the user is likely looking for “How to calculate the right resistor value for an LED”. This project explores how we can use Google Autocomplete and “People Also Ask” (PAA) questions to uncover what people are really searching for. By combining natural language processing (NLP), clustering algorithms, and visualization techniques, we map out user questions into structured topic clusters. The goal is to help websites show up in search results by using real search data to understand what people want to know.

Step 1: Keyword Expansion and PAA Data Collection

The process begins by expanding a handful of seedwords via Google Autocomplete. The common A-Z expansion technique adds a letter of the alphabet to a seed keyword to fetch suggested queries [1]. For example. The seed “nanoelectronics” would generate predictions like “nanoelectronics a…”, “nanoelectronics b…”, etc. This reveals variants that people searched for, such as “nanoelectronics application” or “nanoelectronics book”. Next would be to perform a search for each keyword and retrieve the People Also Ask (PAA) box [2]. These PAA boxes are specific to the query entered:

A three-column infographic showing how the seed keyword "spintronics" is expanded into autocomplete phrases like "spintronic applications" and "spintronic capacitor," which are then used to retrieve related People Also Ask (PAA) questions from Google, such as "What can spintronics be used for?" and "Is spintronics used in quantum computing?"
Figure 1. Flowchart of Google Autocomplete and PAA Extraction. A single seed keyword like “spintronics” is expanded using the A–Z autocomplete method to generate multiple long-tail keyword variations. These variations are then used to extract relevant “People Also Ask” (PAA) questions from Google, providing insight into user intent and content opportunities.

The PAA box typically shows a few (normally 4) questions. However, clicking on one question reveals an answer and often triggers additional questions to appear. Each new click yields more specific questions on that topic [3]. Being able to automatically click them and scraping the questions is important, because many valuable long-tail questions only surface after expanding the PAA list. It is good practice to go at least one, preferably more levels down, to get a lot of contentful questions [4]. Be mindful of Google’s anti-scraping measures to avoid getting your IP blocked [5]. For the webscraping of PAAs, I used Selenium, which is an open-source framework used to automate web browsers.

Obtaining the keywords a-z was not the problem and can be obtained within a few minutes by using the undocumented Google Suggest (autocomplete) API. Howevever, to obtain PAAs, it required the script to actually open a new browser, click on the PAA queries and record all questions. I started with 3 seed keywords, times 26 for the amount of letters in the alphabet. By scraping all googe’s autocomplete I obtained about 700 keywords. In the end, by searching for each keyword individually, I got about 7200 PAA questions.

Step 2: Structuring and Preparing the Data for Analysis

With a dataset of about 7200 PAA questions, the next step is to structure it for clustering. You’ll have to compile all the PAA questions into a dataset (a .csv file is nromally fine) along with their source keyword. Often multiple keywords will produce duplicate questions. For clustering it is important to delete the duplicate PAAs so each unique question is only clustered once. However, it is still useful to keep record of which keyword yielded these questions. This can later help interpret clusters. 

One key consideration I had is whether to include the original keyword text when clustering the questions. In other words, should the question “What are the three major applications of nanotechnology?” be clustered on its own, or augmented with the context like “[nanoelectronics applications] What are the three major applications of nanotechnology?” for embedding? I did not find any research on this point, from my point of view we want to cluster by similarity of the questions themselves. Including the keyword would bias the representation. Questions from the same seed might cluster together, even if semantically those questions align better with queries from another seed. Moreover, the PAA questions normally include the topic context already, as shown in the example just given. 

In summary, prepare your data as a list of unique PAA questions (associated with one or more seed keywords). 

Step 3: Transforming Questions into Semantic Embeddings

To perform clustering on actual words, we need to convert each question into a numerical vector. Short questions might give a problem, as they contain few words, and lexical overlap between questions might be low even if they are on the same topic for some of the older models. Questions like “How to start a blog” and “Ways to launch a website” have no real words in common, and the TF-IDF model would not see them as similar [6]. In contrast, modern sentence embeddings  (such as Sentence-BERT) capture the meaning beyond the exact wording. They are trained to encode the contextual meaning of entire sentences into high-dimensional vectors. As a result, questions with similar topics end up close together in vector space, even if they do not have the same words [6]

I recommend using a sentence embedding model for PAA questions. Libraries like Sentence-Transformers in Python make this very easy [7]. I use the model ‘BAAI/bge-large-en-v1.5’. It transforms each question into its embedding vector. And you’ll end up with a matrix of vectors. Going directly into clustering might give you a lot of noise due to the high dimensionality. That is why, at this stage, you may apply a dimensionality reduction model (e.g. UMAP or PCA) to compress the vectors for easier clustering. Even thuogh the clustering algorithms can handle it without reduction, it does help lower the noise. In my dataset, it went from 60% noise, down to below 20% noise. The critical outcome is that easy question is now represented in a way that captures its meaning in context, which sets the stage for effective clustering.

				
					# === STEP 2: EMBED ===
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(questions, convert_to_numpy=True, show_progress_bar=True)

# === STEP 3: UMAP REDUCTION ===
umap_model = umap.UMAP(n_neighbors=15, min_dist=0.0, n_components=10, random_state=42)
reduced_embeddings = umap_model.fit_transform(embeddings)
				
			

Step 4: Clustering the Questions to Identify Intent Groups

Nex step is to cluster these vectors to find groups of related questions. The choice of clustering algorithm is important. In a previous project I used K-Means clustering, because I know in advance how many clusters I needed . However, in this scenario, density-based clustering methods like DBSCAN or HDBSCAN are more suitable, here I explain why.

K-Means vs. DBSCAN/HDBSCAN: K-Means require specifying k (the number of clusters) beforehand, and assumes that clusters are roughly spherical and evenly sized. This is not practical for PAA questions, where clusters can vary in size and some questions might not cleanly fit any cluster at all. By contrast, HDBSCAN (Hierarchical DBSCAN) automatically finds an appropriate number of clusters and can leave out noise that dont strongly belong to any clusters. The latter is useful in SEO, since some very unique questions might not warrant grouping into a broader topic cluster [7].

For my case, HDBSCAN is a highly effective choice. It was designed to work with high-dimensionality embeddings like those from NLP models. Its advantage align with SEO needs: “It automatically determines the optimal number of clusters, handles clusters of any shape or size, and excludes noise/outliers instead of forcing bad groupings.” This approach was demonstrated to be succesful in creating keyword clusters from search queries.

				
					# === STEP 4: HDBSCAN CLUSTERING ===
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=5,
    metric='euclidean',
    cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced_embeddings)
				
			

In practice, you’ll feed the matrix of questions into HDBSCAN (with appropriate parameters like minimum cluster size). The algorithm will output cluster labels for each question (or noise label for the outliers). An alternative is DBSCAN, however, this requires tuning a distance epsilon parameter and is a bit more complicated to setup correctly.

Step 5: Interpreting Clusters and Applying Insights to SEO

In this final step, I interpret and label the clusters meaningfully, then try and integrate my findings into my SEO content strategy. 

The first thing to do here is to asign a descriptive label or theme to each cluster. This can be done manually or with NLP assistance. For manual you can look for common keywords. For an automated approach, you can either use extractice or generative methods. Extractive methods include algorithms like c-TF-IDF or KeyBERT that extract words/phrases from clusters. Generative methods involve using a language model (like ChatGPT), sending them a list and ask them to suggest a concise topic name or summary [8].  As I have access to a generative API, this was an easy choice.

A labeled network graph showing the cluster “Nanobots and nanomachines applications” connected to People Also Ask (PAA) questions such as “What could a nano robot be used for?” and “Why can't we make nanobots?”, each branching into multiple keyword suggestions like “nanoelectronics applications” and “applications of nanoelectronics.” Red labels indicate the cluster title, PAAs, and keywords.
Figure 2. The cluster title connects to related People Also Ask questions, which in turn branch into associated keyword suggestions from Google.

Now that we have clusters of related questions and their theme, we can look at practical SEO applications:

  • Build FAQ and Q&A Sections: PAA clusters are perfectly suited for Frequenly Asked Question (FAQ) content. The questions in a cluster can be used to create an FAQ section on a page, or a standalone FAQ page. In fact, one recommended use of PAA data is to directly build FAQ pages addressing those questions [9]
  • Identify Content Gaps and New Content Opportunities: By clustering all the questions, you created a map out of subtopics within your niche that users care about. When comparing this map against your own content, you can see if there is anything you haven’t covered on your site yet. These content gapsare useful to find new areas you have not yet covered. It help ensuring you don’t overlook subjects that users are asking about [3].
  • Create Topic “Pillar”: Your original seed word might be a broad topic, while the clusters could form groups of subtopic questions. Yuo could use this to structure your site’s content hierarchy [10].
  • Optimized for Featured Snippets and PAA visibility: Clustering PAA questions also helps yuo prioritize which questions to answer in your content for maximum SEO-content. By answering them in 40-50 words under a proper heading, you have a high chance that Google will use your answer directly in the PAA box. However, Google prefers if your answer is in context (within a full article), rather than just end of the page FAQ [3]

In summary, by generating PAA data and clustering it, you gain a map of user-intents. The steps I showed reflect the best practice in NLP and SEO. It results in groups of questions that highlight what real users want to know. 

References

[1] Apify. (2024). How to scrape keywords from the Google search bar (Google autocomplete suggestions). Apify Blog. https://blog.apify.com/how‑to‑scrape‑google‑ autocomplete‑suggestions/ 

[2] van der Meij, W. (2023, April 19). How to scrape People Also Ask questions on Google? Umbrellum. Retrieved from https://umbrellum.com/guides/content/people-also-ask/scraping-people-also-ask

[3] Malik, K. (2025). Dominating People Also Ask (PAA). Answer Engine Journal. https://answerenginejournal.com/guide/visibility/people-also-ask/

[4] Hooda, K. (2024). People also ask: What is it and how to use the PAA tool. Keywords Everywhere Blog. https://keywordseverywhere.com/blog/people-also-ask/

[5] Datajournal. (2024, November). How to scrape Google’s “People Also Ask” using Python. Medium. https://medium.com/@datajournal/scrape-google-people-also-ask-cefb9489c647

[6] [5] Kumar, A. (2023). Text clustering: Key steps & algorithms. Vitalflux. https://vitalflux.com/text-clustering-key-steps-algorithms-examples/

[7] Jawaid, M. (2025, May 12). Clustering multilingual search terms with HDBSCAN and embeddings [Post]. LinkedIn. https://www.linkedin.com/posts/maryam-jawaid_nlp-hdbscan-clustering-activity-7350788791902617602-7abo/

[8] [8] Sia AI. (2023). Labeling text clusters with keywords. Medium. https://sia-ai.medium.com/labeling-text-clusters-with-keywords-b5b5b6c1a89e

[9] Elliott, D. (2018). Scraping “People Also Ask” boxes for SEO and content research. Builtvisible. https://builtvisible.com/scraping-people-also-ask-boxes-for-seo-and-content-research/

[10] Bortoluzzi, N. (2023). How to easily cluster People Also Asked keywords. SEO Lynx Blog. https://seo-lynx.co.uk/blog/cluster-paa-queries/

Florius

Hi, welcome to my website. I am writing about my previous studies, work & research related topics and other interests. I hope you enjoy reading it and that you learned something new.

More Posts

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.