Blueberry production classification

Keywords: Your, keywords, inserted, here


Context

Blueberry is a profitable fruit exported from South American countries like Peru, Argentina, and Chile to different continents worldwide. Understanding the crop to have the capacity to estimate or predict how many tons or kilograms of blueberry the farm will produce is critical for the business to attract customers and meet the deadlines. Wrong estimations could lead to important economic losses and impact the prestige of the farm to close future contracts.

Figure 1. Weekly yield per plant for two blueberry cultivars. Figure taken from M. E. Cortes, P. A. Mesa, C. M. Grijalba, and M. M. Perez. (2016). Yield and Fruit Quality of the Blueberry Cultivars Biloxi and Sharpblue in Guasca, Colombia.

Agronomists usually take into consideration many sources of variables to determine the yield per week (Fig. 1). Sources such as irrigation, plant nutrition, soil quality, NDVI, and weather, among others, are some of the most common. However, as the crop fields increase in size, the burden of calculating manually the yield per sector becomes harder, so calculations tend to be biased or mistaken.

In order to make this process easier, we require a tool to assist the agronomist in making better decisions and have a whole picture of the production of his fields. With Artificial Intelligence we can predict the class of the productions as low, medium and high.

Task description

In this challenge, you will be tasked with the classification of blueberry production using data from different sources. The dataset provides information about blueberry growth characterization, fertilizers, irrigation, and weather stations from 2021 to 2022. Three classes are defined: low, medium, and high production.

  • WEEK: Week of the current year
  • H20: Amount of water in M3 provided to crop
  • NDVI: How green is the crop index
  • EVO: Evaporation of water index
  • LR: Irrigation sheet
  • N: Nitrogen
  • P: Phosphorous
  • K: Potassium
  • TEMPERATURE: Temperature of the crop field (C)
  • HUMIDITY: Relative humidity of the crop field (%)
  • WIND_VELOCITY: Velocity of wind in the crop field (km/s)
  • RAIN: Rain level in mm
  • SOLAR_RADIATION: Measurement in W/m2
  • STEMS_PER_M: Number of stems per meter
  • TOTAL_BUDS: Number of buds in every blueberry plant
  • BUDS_PER_BUNCHES: Number of buds per bunch
  • MATURATION_LEVEL: Level of blueberry maturation (%)
  • BRIX_DEGRESS: % average BRIX level in blueberries
  • PERCENTAGE_OF_BIG_BUDS: % of buds relatively large
  • PERCENTAGE_OF_MEDIUM_BUDS: % of buds relatively medium
  • PERCENTAGE_OF_SMALL_BUDS: % of buds relatively small
  • ABORTED_PERCENTAGE: % of aborted blueberries in a harvest

The primary goal is to implement a robust machine learning model to accurately predict the class of production (low, medium, high) out of a diverse dataset of selected variables listed above.


Dataset

https://www.kaggle.com/competitions/blueberry-production-classification-competition/data

This page appears alongside the data files. It describes what files have been provided and the format of each. There is no single format for this page that is appropriate for all competitions, but you should strive to describe as much as you can here. A little time spent describing the data here can save a lot of time answering questions later.

Participants should be able to answer these types of questions after reading the data description:

What files do I need?
What should I expect the data format to be?
What am I predicting?
What acronyms will I encounter?

Files

  • train.csv – the training set
  • test.csv – the test set
  • sample_submission.csv – a sample submission file in the correct format
  • metaData.csv – supplemental information about the data

Columns

  • WEEK: Week of the current year
  • H20: Amount of water in M3 provided to crop
  • NDVI: How green is the crop index
  • EVO: Evaporation of water index
  • LR: Irrigation sheet
  • N: Nitrogen
  • P: Phosphorous
  • K: Potassium
  • TEMPERATURE: Temperature of the crop field (C)
  • HUMIDITY: Relative humidity of the crop field (%)
  • WIND_VELOCITY: Velocity of wind in the crop field (km/s)
  • RAIN: Rain level in mm
  • SOLAR_RADIATION: Measurement in W/m2
  • STEMS_PER_M: Number of stems per meter
  • TOTAL_BUDS: Number of buds in every blueberry plant
  • BUDS_PER_BUNCHES: Number of buds per bunch
  • MATURATION_LEVEL: Level of blueberry maturation (%)
  • BRIX_DEGRESS: % average BRIX level in blueberries
  • PERCENTAGE_OF_BIG_BUDS: % of buds relatively large
  • PERCENTAGE_OF_MEDIUM_BUDS: % of buds relatively medium
  • PERCENTAGE_OF_SMALL_BUDS: % of buds relatively small
  • ABORTED_PERCENTAGE: % of aborted blueberries in a harvest

Evaluation method

Accuracy

Accuracy measures how close a model’s predictions are to the actual values. It is commonly used in classification tasks and is defined as the proportion of correctly predicted labels to the total number of instances.

Formally, accuracy is expressed as:

$$ \text{Accuracy} = \sum_{i=1}^{N} \mathbf{1}\!\left(\hat{y}_i = y_i\right) $$ where: \[ \begin{aligned} N &\text{ is the total number of samples,} \\ y_i &\text{ is the true label for the } i\text{th sample,} \\ \hat{y}_i &\text{ is the predicted label for the } i\text{th sample,} \\ \mathbf{1}(\hat{y}_i = y_i) &\text{ is an indicator function that returns 1 if } \hat{y}_i = y_i,\text{ otherwise 0.} \end{aligned} \]

Submission Files

For each ID in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

ID,TARGET
2,0
5,0
6,0
etc.


Citation

Erick Fiestas S.. Blueberry production classification competition. https://kaggle.com/competitions/blueberry-production-classification-competition, 2025. Kaggle.