April 29, 2026

Research and Innovation Intelligent and Autonomous Systems

When AI Generates Ideas: Measuring the Performance of Language Models

A futuristic depiction of a brain intertwined with digital circuits, symbolizing the fusion of artificial intelligence and human cognition.

Generative artificial intelligence is gradually becoming an integral part of ideation, design, and innovation processes. From designers to engineers, including strategy teams, large language models (LLMs) are increasingly being used to explore new avenues, organize ideas, and expand the range of possible options. But one question remains: Are these systems truly effective from a creative standpoint, and more importantly, how can this be evaluated?

This is where ÉTS Professor Romain Rampa comes in, working alongside colleagues such as Florian Carichon of Mila (the Quebec Artificial Intelligence Institute), as part of a research project on the creative performance of large language models (LLMs). His goal: To go beyond current, often limited approaches to provide a more rigorous and representative assessment of the actual issues.

A creative achievement, yet underestimated

In recent years, a consensus seems to be emerging: large language models (LLMs) may be capable of matching, or even surpassing, human performance on certain creative tasks. But according to Romain Rampa, PhD, this conclusion is largely based on biased evaluation methods.

Most existing studies rely on very specific tests, such as generating alternative uses for an object or producing words that are very different from one another. These exercises primarily measure divergent thinking, that is, the ability to generate a large number of varied ideas.

However, these tasks have several limitations. For one thing, they are often already included in the models’ training data, which can skew the results. For another, they cover only a limited aspect of creativity. Designing a product, developing a strategy, or addressing an ambiguous problem involves much more than simply generating a large number of diverse ideas.

A new benchmark for more realistic scenarios

To overcome these limitations, the research team developed a new framework for evaluating ideation tasks, inspired by real-world design scenarios.

This Benchmark covers a wide range of challenges: product design, service development, solving complex—and even ethically ambiguous—problems, and strategy formulation. The goal is to understand better the contexts in which humans and AI actually generate ideas.

Six models were tested (GPT-4, Claude Sonnet 4.5, Llama 3, Qwen 3.5, Grok, and Gemini) on over 14,000 tasks each. In total, nearly three million ideas were generated, creating an unprecedented database for analyzing the performance of these systems.

Measuring creative performance

To evaluate these results, the researchers focus on four main dimensions:

Fluency: the number of ideas generated
Variety: the diversity of explored categories
Originality: the novelty of ideas, measured by rarity and in relation to a reference query.
Task relevance, using metrics such as perplexity.

This approach provides a multifaceted evaluation that better reflects the quality and diversity of the proposals.

Confident professional at a university tech center.

ÉTS Professor Romain Rampa

Are creativity techniques transferable?

Another key component of the project is to test the impact of traditional creativity and design methods on the performance of large language models. Methods such as brainstorming, design thinking, TRIZ, and C-K theory were translated into prompts to assess their impact.

The researchers also experimented with other methods tailored to the models, such as category-negation generation, which involves explicitly instructing the system to move away from conventional solutions through iteration.

The goal is to determine whether tools designed to structure the generation of ideas in humans are effective when applied to language models, or whether tools specific to these systems should be developed.

Insightful results

Preliminary results highlight several trends.

First, performance varies by model. Some, like Grok, stand out for their more abundant and diverse output of ideas, with a higher level of originality, though sometimes at the expense of relevance. Other models take a more conservative approach.

Next, the way prompts are formulated plays a crucial role. Simple approaches that explicitly encourage users to think outside the box often yield better results than complex, multi-step methodologies. Including unexpected or unusual elements in briefs also encourages the generation of more original ideas.

Finally, some degree of standardization among the models becomes apparent: when faced with the same problem, they tend to produce similar responses. This suggests that systems often explore solutions that are already well-represented in their training data.

Furthermore, like humans, the models seem to perform better at generating product-related ideas than at addressing procedures or organizational innovations, which require a more abstract framework.

Toward better use of LLMs in ideation

Beyond simply comparing different models, this research focuses on better understanding how to effectively integrate LLMs into ideation processes.

The team hopes to develop other projects aimed at a better understanding of the interactions between humans and AI: How do ideas evolve when enriched by successive contributions? What respective roles can humans and systems play in a collaborative process?

The expected benefits are both scientific and practical. From a research perspective, the project proposes a new benchmark evaluating the models’ creative performance. For organizations, this could lead to practical recommendations for drafting requests and the most appropriate methods depending on the context.

A methodology designed to evolve

The framework developed by the research team also paves the way for future developments. It could incorporate other aspects, such as ethical issues related to ideation. Beyond quantity and originality lie the questions of the relevance and accountability of the proposed solutions.

By seeking a better characterization of language models' performance in design contexts, this research helps shape an emerging field at the intersection of artificial intelligence, design, and innovation management.

One thing is certain: as these tools become more integrated into professional practices, understanding their strengths and limitations becomes essential to using them effectively.

Portes ouvertes

When AI Generates Ideas: Measuring the Performance of Language Models

A creative achievement, yet underestimated

A new benchmark for more realistic scenarios

Measuring creative performance

Are creativity techniques transferable?

Insightful results

Toward better use of LLMs in ideation

A methodology designed to evolve

Featured articles

Gala d’excellence 2026 : Des parcours qui font briller l’ÉTS

Deux jours pour inspirer la relève féminine en informatique à l'ÉTS

L'ÉTS accueille une délégation ministérielle de Belgique francophone et signe une entente avec l'Université catholique de Louvain