340 - Comparing Top Large Language Models for Python Code Generation

Опубликовано: 31 Июль 2024
на канале: DigitalSreeni
2k
85

A video walkthrough of up an experiment testing various large language models for generating hashtag code for scientific image analysis.

340 - Comparing Top Large Language Models for Python Code Generation

The task: segment nuclei in a multichannel hashtag image (proprietary format), measure mean intensities of other channels in the segmented nuclei regions, calculate ratios and report in csv.

All well-hyped models are tested:
- Claude 3.5 Sonnet
- ChatGPT-4o
- Meta AI Llama 3.1 405-B
- Google Gemini 1.5 Flash
- Microsoft Copilot (in God's name, how do I find the model name?)

All generated code can be found here:

A summary of my findings:
- When I didn't specify how to segment, all needed a bit of hand-holding. Claude did the best here.
- When I asked for Stardist segmentation, ChatGPT-4 nailed it on the first try, with minimal code.
- Claude and Meta AI weren't far behind, just needed a small tweak (normalize pixel values).
- Gemini and Copilot... well, I'm super disappointed with their performance. Didn't manage to run the code at all, even after many prompts.

With Stardist-based segmentation, the code generated by ChatGPT, Claude, and Meta AI produced statistically identical results for the intensity measurements.

While AI is making rapid progress, the need for detailed prompting to obtain reliable results underscores the continued importance of domain expertise and basic coding skills.

Title:
340 - Comparing Top Large Language Models for Python Code Generation