Skip to content


Prevayls Human Data challenge

1 April 2024

The Challenge

In the ever-evolving landscape of artificial intelligence, the training of Generative Pre-trained Transformers (GPTs) stands at the forefront of innovation. Developing a model of such complexity, capable of generating human-like text, requires a vast amount of data that has been meticulously curated to refine their understanding of language and context.

Typically, GPT models are trained on datasets comprising billions of words, drawn from a diverse array of sources ranging from literature and scientific journals to internet forums and news articles. It’s well publicised that GPT-4 is thought to have been trained on 1.7 trillion parameters. Whilst this task is by no means trivial, it is worth noting that the internet has provided a vast and readily available source of text and image data that has been used to train many of the models in the market.

However, the journey from raw data to sophisticated language comprehension is not without its challenges. Structuring the data in a coherent and meaningful manner is paramount, ensuring that the model can extract and synthesize relevant information effectively.

When the focus shifts to domains as intricate as human physiological conditions, the hurdles intensify. The nuances of medical terminology, the complexity of interrelated biological processes, the availability of labelled data, and the ethical considerations surrounding patient data all present formidable obstacles in the pursuit of training GPTs to comprehend and generate insights in this domain. Despite these challenges, the potential for GPTs to revolutionize healthcare and scientific research is immense, offering a glimpse into a future where artificial intelligence augments our understanding of the human condition in unprecedented ways.

Our Approach

At Prevayl, we’re setting our sights high, fully aware of the hurdles ahead, especially the challenge of accessing the extensive multi-modal personal health data we need. But we have a clear plan to tackle this.

To kick things off, we’ll leverage publicly available datasets for our initial proof of concept. Resources like PhysioNet, OpenNeuro, and the Cancer Imaging Archive (TCIA) provide the rich data we need to train AI algorithms. These open-source repositories facilitate widespread innovation and collaboration, enabling researchers and developers to access a wealth of information that can lead to breakthroughs in diagnostics, treatment, and patient care.

Our starting point focuses on heart disease and the circulatory system. Our team’s background in wearable technology gives us an edge in analysing ECG data, allowing us to exploit our subject matter expertise. From there, we’ll systematically address different disease categories, prioritising based on the potential for impact. This approach not only deepens our understanding but also hastens the development of practical applications. We’re not just interested in showcasing our model’s technical capabilities, we’re aiming to solve tangible health problems.

The journey continues as we plan to access large, multi-modal datasets from projects like the UK Biobank through our academic partnerships. This step reflects our belief in collaboration—by pooling resources and knowledge, we can achieve meaningful outcomes that benefit everyone involved. This is how we, at Prevayl, envision contributing to the broader medical AI field: by working together towards shared goals.