Have we reached the stage where agentic AI gives you superpowers? We have moved on from chatbots to AI agents that can carry out tasks for you. The most powerful way to use an agentic AI is to Augment your Intelligence. I set myself a goal of completing a substantial project on a Saturday afternoon.
Even the free versions of AIs will do amazing things, but if you want to create something useful, you need to have a vision of what you’d like to achieve. A good starting place is to consider your own personal areas of expertise. I decided to use the 81 blogs on this web site as my subject matter. My vision was to generate the 3D semantic map that appears at the top of this page.
Make a plan
AI tools: I decided to use the Copilot agent in VSCode to perform web scraping, coding and data processing tasks. Gemini provided some useful pointers along the way. I also wanted to experiment with running a language model locally.
Data: Scrape all the blogs from my web site
Processing: Create a semantic representation of each blog in the form of an embedding. Reduce the dimensionality of the embeddings. Look for clusters. Identify common semantic characteristics of the members of each cluster.
Presentation: Turn it into a 3D plot, highlighting the clusters and the progression of time.
Scraping
I asked Copilot to create a uv python environment in VSCode. In order to scrape all my blogs, Copilot first found the sitemap.xlm and then I asked it to create a JSON file with the page name, date and text of each blog. After a little tidying up this was all done in about half an hour.
Semantic embeddings
I turned to Gemini for suggestions of the best publicly-available text embedding model on HuggingFace that would run on my MacBook Pro. It proposed a Qwen2-7B model due to its large context window and strong performance. I had already downloaded LM Studio, so it is was simple (or so I thought) to download the model and set it running on a local server. However, after numerous attempts, Copilot could not get a response from the server, even though LM Studio confirmed it was running. Eventually Gemini suggested that the Qwen2 model was running as a chatbot rather than an embedding domain. I eventually found the “Override Domain Type” in the LM Studio models tab. Once I switched it to Embedding, Bingo! Everything worked.
Copilot’s Python code successfully created text embeddings for all the blogs. My intuition was that blogs on similar topics would be closer to each other in embedding space, but the embedding dimension was 3584. Fortunately, along with recommending a text embedding model, Gemini had recommended using UMAP, as the gold standard in visualising complex data sets while preserving clusters, in preference to alternatives, such as Principal Components Analysis, and to use Plotly for visualisation.
Copilot wrote scripts to collapse the dimensions to three and to display the results in a 3D Plotly chart. This all ran perfectly. Upon inspection, I could see distinct groupings, so I asked Copilot to identify four clusters. It decided to use K-means, which was fine by me, but this unsupervised method doesn’t explain why the points in the same cluster were close to each other. So I went back to Copilot and instructed it to review to the original 3584 dimensional embeddings and identify, for each of the four clusters, what aspects of the text explain why the blogs have been grouped together?
Copilot chugged away for a while and came back with a very nice characterisation of each group. It is instructive to look at the code to find out how it did this. The code shifted back into the full embedding space and identified the five blogs closest to the centroid of each cluster. It appeared to use snippets of the first 220 characters of the top five blogs to identify the themes. I think this worked because each blog opens with a summary of the topic. Copilot produced the short labels for the clusters that appear on the chart.
I also asked Copilot to attempt to find a semantic interpretation of the three axes that resulted from the UMAP projection.
Communicating the result
An interactive 3D chart provided an intuitive visualisation of the results. I asked Copilot to set the marker shape according to cluster and to colour the points according to date to see whether themes have changed over time.
Science4Performance
The blogs on this web site are characterised by four themes
- Performance Science
- Cycling Data & Tech
- Strava and Race Performance
- Technical Science and Modelling
The themes vary in nature in the following manner
- general performance-science v applied cycling/Strava theme
- degree of analytical / scientific depth
- technical modeling / data-science emphasis
GitHub repository for this project https://github.com/science4performance/Science4PerformanceWebsite






















