Summary: A walkthrough of the Our-Sci data pipeline, from field to prediction outputs. This is part of our journey to transform our tool suite from a specialized solution for a few use cases into a generally usable continuous data merging tool, resulting in an immediately useful database.
UPDATE November 5, 2021: We now have a video version of this article.
The journey begins when a real-world object, such as a piece of produce, is selected to be evaluated by the Bionutrient Institute. For our purposes, we might say that information exists as observable qualities in existing objects. This information comes to us from multiple sources:
These data points are compiled using SurveyStack tools, and then split into two significant categories:
Measurements
Physical dimensions
Subjective look and taste
Spectroscopic registers (XRF, UV/Visible light and NIR)
Wet Chemistry analyses.
Metadata
These are pieces of information which are relevant and distinguish points from each other, even when they are not directly measurable on the object itself:
Source location
Variety
Agricultural practices in the field.
These two forms of information are registered in multiple surveys. The past year’s database is assembled from twenty-two different surveys, and a single piece of data can involve information from as many as ten of these.
Software
Through SurveyStack, users can design almost any kind of survey, with inputs including advanced features such as hardware integration.
Surveys are divided into moments of data collection within both agricultural and lab processes. These can look like any of the following:
As a data point travels through our system, we continually check the quality of the gathered information using several dashboards that give data managers insight into the current state of the process.
Information gets stored in a mongoDB database and is easily accessible through several means, including an automated query generator.
Once a data point has been registered in any survey, our data pipeline starts its work. It will seek out information about this particular point on all other surveys and form a single complex entry synthesizing everything we know about it.
The goal of this stage is to transform available data into organically useful information, in a way that is transparent enough to be used for independent research. This is part of a Continuous Integration process that happens three times a day, which allows for a glimpse into the current state of data collection at any given moment.
Organized Information Database
At the moment of merging, data is organized into clearly named columns, in tables with single-value cells. These can be opened in Excel, and are ready for more advanced data science treatments, such as those done in Python or R.
Merging Briefs
These contain information such as the number of merged surveys, information completeness status for each point, and amount of errors (orphan entries, ambiguous naming, etc).
We are in the process of transforming this software from a specialized solution fitted to our particular needs, to a more generally usable continuous data merging tool. When paired with SurveyStack, this will allow any organization to design a complex data collection scheme and get a finished, coherent and immediately useful database as a result, just by setting parameter functions to organize how data is arranged.
Data analysis is perhaps the most complex section, and takes our data point through a host of different tasks and uses.
Cleaning
Different kinds of errors can occur in the process of collecting and analyzing data. Finding and removing these errors is critical for producing dependable insights.
The most typical error case involves outliers, which have values that are impossible to compare with the other values in a set. Even though exceptional cases exist and the values can occasionally be real, they are mostly caused by mistakes or failures, and are not useful for making predictions about the general data set. These points are marked and omitted for most processes.
Plotting
We design several kinds of visualizations, which are useful for getting a generalized, global view of the whole population, or to test hypotheses.
Researchers can get important insights by looking at well-designed plots, which enable them to focus their attention on the relevant details faster.
Modelling
One of our main tasks is modelling, which for us involves using accessible techniques to reliably predict information that is otherwise hard or expensive to measure. Models that predict a response variable using different predictor variables are tested together and compared. We then use a range of modelling algorithms based on different mathematical grounds, which have different strengths.
Hypothesis Testing
Hypothesis testing involves using different criteria to decide whether results observed in an experiment support the core question motivating the research at hand.
For us, our focus so far has been comparing the effects of different farm practices on crop attributes. To get this done, we compare groups of samples in specific categories like no-till, no-spray, irrigated, organic, or greenhouse grown to those that do not fall into those categories.
Dashboards
Data dashboards allow us to share the information we’ve gathered back to the community. In turn, insights from diverse user perspectives can enrich our interpretations and questions, giving us better insights and sparking new areas of exploration.
We’ve tried to make this as interactive and useful for independent research as possible. The main idea behind the data explorer dashboard is that it will allow a user to test their own hypotheses by visually comparing different granular selections of points.
You can see it for yourself here!
In-Field Predictions
We’ve just recently reached a stage in which our models are mature enough to deploy. This is the moment when the idea of getting expensive information from cheap information crystallizes. Using the handheld UV/Vis scanner, a user can get estimates of the nutritional densities in several crops (more than ten, currently!) with an informed level of precision. This provides results in under a minute that otherwise would require delivering the samples to a lab for time-intensive testing.
We are currently developing new models that will deliver useful predictions, from determining soil carbon to further details of a crop’s nutritional value.
At this stage of the process, we might say that the original data point is still directly involved–It’s what informs the model, allowing us to better understand new observations in the real world. The thing that’s so unique and exciting about this system lies in its integration – moving from research to prediction in the same platform, following a single data point all the way through. Thank you for coming with us on this journey! And if you have a question or idea to bring to this pipeline, please contact us. We’d love to give it a try.