Our Sci - Generating Structured Data from Interview Transcripts using AI: A Test Case

AI versus Human Generated text to structured data

January 21, 2025

Two years ago, I sent a semi-panicked email to friends in the GOAT (Gathering for Open Ag Tech) community saying “AI is coming and it’ll change everything!!!“.

The response was muted, and life moved on.

As time moved on a great deal of over-hyping happened, but real use cases in agriculture also emerged. Digital Green began using RAG to deliver recommendations to farmers in India, for example. In Our Sci’s world, I hoped that we could use AI to solve our biggest problem: collecting data.

No one likes data collection. Digital surveys, interviews, email chains, paper forms, spreadsheets… it all basically sucks. People (especially producers!) don’t like providing it, and organizations don’t like managing it.

There is no perfect answer to this problem – there’s no technology (yet) to magically extract information from people’s brains, though larger systems like JD Ops do help avoid the need for humans at all by GPS tracking of tractors, implements, quantifying fertilizer use, etc. But not everyone has, wants, or needs this, and not all information can be tracked via tractor GPS.

Can AI help us collect data with less pain?

Funded through the National Agricultural Data Producers Cooperative out of the University of Lincoln Nebraska, and in partnership Pasa, Fibershed, OpenTEAM, Mad Agriculture, Momentum Ag, and CAFF along with some GOATs who meet bi-weekly to talk about AI uses in Ag (like Good Agriculture, Ag Informatics Lab, Qlever, and others), we decided to identify a useful test case and spend a couple months to see what we could learn. A summary of our findings is below.

Summary of Findings

We used OpenAPI’s GPT o1 in ‘Structured Output’ mode to test a Large Language Model’s (AI) ability to convert raw transcripts of a conversation between a producer and technical assistance provider concerning in-field trials into structured data (“a database”) following 3 defined JSON schemas. We tested 20 different prompting strategies (“recipes”) and compared them on 3 different test transcripts, and scored them against a human-generated ‘gold standard’ answer using 2 different scorers (Rose and Adie).

In the highest quality interview, the range of average accuracy scores across the 3 databases was 72%, 78% and 81% correct answers, with the best performing recipe in that same interview yielded 87%, 88%, and 87.5% correct answers respectively.

After analyzing the quality of the AI generated results, our main takeaways were:

the quality of the interview was the main driver of the quality structured output by the AI;
more complex prompting did not yield clearly better results (pre-prompts, heavily structured prompts, XML tagging, etc.);
more context and access to history (field names, interview participant names and roles, etc.) will probably reduce AI hallucinations due to missing information;
better defined schemas with fewer open ended questions will probably result in more accurate results and less hallucinations.

Overall, we feel that with appropriate adjustments based on our learnings, consistent performance of 90% or higher is possible. This exceeds the performance level minimum (ranging from 60 – 90%) identified by the NAPDC interview group during needs assessment.

The Gitlab issue containing complete discussion and results can be found here.

——————-

Interested in reading more? Click here to read the full report.

Do you like this, want to test other things, want to contribute or have done work yourself and like sharing, let us know! We’re always interested in chatting, sharing, or collaborating.