Unearthing the soul of climate innovation: introducing ALMa

Undaunted collage shows an academic researcher in a lab coat looking at climate innovation surrounded by a city scape and an orange sky

Dr César Quilodrán-Casas, Advanced Research Fellow in Machine Learning, is part of Undaunted’s Climate Solutions Catalyst team. The CSC is building a novel tool with the power to unearth untapped academic research that has huge potential to tackle climate change if it can be supported to commercialise. In this blog, César deep dives into the team’s tool, ALMa, exploring what the CSC has been up to in the project’s first year and what’s on the horizon.

My team at the CSC, and I, believe that many of the technologies the UK needs to tackle climate change already exist. Not as polished products, but as ideas and findings quietly published in scientific journals. The challenge is not a lack of innovation, but a lack of visibility. What if, instead of waiting for researchers to propose commercial applications of their findings, we could find the most promising work ourselves? What if we could spot the idea before its author even thinks about spinning it out?

A proactive approach is at the core of the CSC. As part of this fantastic team, I have led the development of ALMa, a tool we built to identify high-potential climate innovation hiding in plain sight. ALMa stands for Accelerating climate innovation using Language Model based analysis. The acronym might be a bit of a stretch, but there is a reason for it. “Alma” means “soul” in Spanish, and what we are looking for is the soul of climate innovation – research that could grow into something transformative if the academics behind it can translate their ideas.

Instead of starting with an open call, we flipped the model upside down. We wanted to identify researchers before they even realised their idea could be commercialised. Think Minority Report: but instead of pre-crime, we are looking for pre-commercial science; and instead of Tom Cruise, the academic community has the CSC team, some open databases, and a few thousand lines of code!

Which climate challenge?

We decided to focus our proof of concept on green catalysis. This choice came out of a collaborative effort with my colleague Dr Chris Waite, who helped us zoom in on this “underappreciated” climate challenge. Arguably, catalysis is involved in over 90 per cent of chemical manufacturing processes. It plays a critical role in areas like hydrogen production. However, it is rarely discussed in the same breath as solar or wind.

The chemical sector still depends heavily on fossil carbon, as most high-value chemicals are derived from a handful of fossil raw materials. Decarbonising this system means designing new catalysts and processes to create the same outputs from renewable or waste inputs. It is a deep, technical problem, and we suspected that there might already be research addressing it that had not been connected to its innovation potential.

That’s a lot of academic papers! How did you read that many?

From the beginning, we knew that we wanted to use machine learning to scan the literature, because in the face of reading thousands of abstracts, we are Undaunted. However, we did not go straight to a Large Language Model (LLM). Instead, we built a hybrid workflow using classic Natural Language Processing (NLP) and Machine Learning (ML) methods first. Why? Because using LLMs in our workflow is powerful but not very flexible. Every time we wanted to change how we framed a question or filtered a paper, we would have had to rerun the entire process – not ideal when you are in the middle of a discovery process.

We also wanted to reduce opacity. Using smaller and modular models gave us more control over what was happening and allowed us to iterate quickly. We needed to make mistakes, but we wanted to make them early and cheaply.

To train ALMa, we needed a reference dataset. We are not catalysis experts, but with backgrounds in biology and physics, we were comfortable reading technical literature. We compiled an initial set of high-quality papers from the REF impact case studies and from the UK Catalysis Hub. We added to this by downloading around 3,000 abstracts from a scientific database, using keywords and topic filters.

Each abstract was classified into one of three categories:

  • Positive: original research proposing a deep tech solution relevant to green catalysis
  • Negative: irrelevant or lacking a clear technical innovation
  • Background: reviews, perspectives, roadmaps, technical perspectives and so on

Manual labelling helped us to define the edges of these categories, but doing this for thousands of papers was not sustainable. Thus, we trained a simple LLM to help us label the rest. Of course, it made some errors. Some abstracts looked promising based on keywords, but clearly read like reviews. We corrected those manually.

For this initial experiment, we limited our scope to 13 UK universities. We downloaded 10,000 abstracts from these institutions going back to 2010, but only ~7000 passed our filters. We kept only those where the first, last, or corresponding author was based at one of the target universities. This was important: if we were going to contact researchers later, we wanted to speak to the ones most involved in the work, not a minor contributor.

We used an embeddings model to convert the title and abstract pairs of each paper into an embedding. These embeddings capture the semantic meaning of the paper, making them suitable for classification. We then fine-tuned the embeddings using our labelled dataset to train a simple but powerful additional ML classification model. This classifier predicted whether a paper was positive, negative, or background, and also whether it had been cited in a patent.

Once trained, this pipeline could score thousands of unseen papers in seconds – important, especially when dealing with scale. If we ever wanted to analyse a million papers, an LLM-only approach would be slow and prohibitively expensive. This setup gave us a flexible and efficient filtering layer before handing anything off to the LLM for deeper analysis.

The positive papers were then double-checked by an LLM, but with a very specific prompt. We asked structured questions about each abstract, including:

  • What is the technology readiness level (TRL)?
  • What is the climate mitigation potential?
  • Could this research be useful to UK industry?
  • Is it at a stage where it could be disclosed to a tech transfer office?
  • Would it benefit from support in market discovery or IP development?

We also asked for a non-specialist summary of the abstract for us to navigate through the technical language and evaluate the real potential of the presented solution. This allowed us to compare the model’s answers across papers and apply an algorithmic scoring method to prioritise them. At the same time, Chris did his own manual review. Out of the top 160 papers selected by each method, 83 overlapped. Those 83 became our shortlist.

From datapoints to real faces

The next step was to track down the researchers behind those papers. Some were no longer in academia. Others had changed institutions. We used Perplexity and other tools to find updated affiliations, filtering out the most senior professors and identifying early career researchers who might benefit most from the CSC’s bespoke support.

From our 83 shortlisted papers, we found 50 researchers who were contactable and whose work showed high potential. We sent them cold emails to ask if they might be interested in the support that the CSC can offer. Did any of them reply? That’s a question for the next CSC blog post…!

Why did we build ALMa this way?

In perspective, reading 10,000 abstracts at an average of 200 words each would mean reading about 2 million words. At a reading speed of 250 words per minute, that would take over 130 hours. ALMa does not get tired or distracted and it does not run out of steam – but it could still run out of API credits. That is why we designed it to use expensive resources only where necessary, so scaling into millions of papers is achievable in the short term.

This is still an early experiment, but it makes us wonder: what other areas of research might benefit from this approach? What else is hiding in the literature, waiting for someone to ask the right question?

Our “summer of testing”

The CSC team is currently doing a “summer of testing” to validate our search tool. If you are interested or want to hear more about it, please email us at csc_undaunted@imperial.ac.uk or message me directly at c.quilodran@imperial.ac.uk.

Sign up to the Undaunted newsletter to stay in-the-loop on upcoming climate innovations news, events and opportunities.

Leave a Reply