For this project, we are interested in identifying the best resources for predicting climate changes where people live. We have chosen to use both supervised and unsupervised machine learning algorithms.
We have chosen one supervised learning model and two unsupervised models:
- K-Nearest Neighbour (KNN)
- Decision Tree
- Artificial Neural Network
- The data is obtained from the European Climate Assessment & Data Set project
- The dataset consists of weather observations from 18 European weather stations from the late 1800s to 2022.
- Recordings exist for almost every day with values such as temperature, wind speed, snow, global radiation, and more.
Machine learning (ML) algorithms are a set of techniques that allow computers to learn from data and make predictions or decisions without being explicitly programmed for a specific task. ML can be highly beneficial for analysing and predicting weather patterns. This can include forecasting temperatures, humidity, wind speed, rainfall etc. They can also be used to detect unusual events or patterns such as heatwaves or unseasonal rains. For this project, we are using three ML algorithms:
- k-Nearest Neighbour (KNN) - is a simple, versatile, and widely-used machine learning algorithm used for both classification and regression tasks. It is a supervised learning algorithm, meaning it relies on labelled training data to learn and make predictions.
- Decision Tree - also used for classification and regression tasks. It works by splitting the data into subsets based on the value of input features, making it highly interpretable and effective for a range of practical applications.
- Artificial Neural Network - is used for a wide range of machine learning tasks, including classification, regression, and many more complex problems like image recognition, natural language processing, and time-series forecasting.
To address ethical concerns, we have to consider any bias that may impact how the analysis is conducted and the results thereof. Bias in machine learning can affect model performance, accuracy, fairness, and overall generalisability.
Some biases observed in this project include:
- Collection Bias: The data was collected from 18 weather stations. However, according to the ECAD there are a total of 23755 weather stations across Europe. This sample of weather stations may not be a representative sample
- Temporal Bias: Given that the data range is so large (1800s to 2022), some of the data is likely to not be relevant anymore and could result in a distorted outcome from the models
- Location Bias: The data has been collected from only European weather stations and may not be able to predict weather patterns from other areas of the world given that climates are different.
We ran the data through a KNN model, which yielded an overall accuracy score of 88,15% for all 15 weather stations. Valentia has the best accuracy score of 95.83%, well above the mean of 88%. Sonnblick showed an accuracy score of 100%, indicating that the model was overfitting. i.e. the model has overadapted to the training data and captures even random fluctuations.
Station | Predicted Negative | Predicted Positive | Actual Negative | Actual Positive | Accuracy |
---|---|---|---|---|---|
BASEL | 3907 | 935 | 465 | 431 | 84.38% |
BELGRADE | 3238 | 1502 | 460 | 538 | 82.61% |
BUDAPEST | 3416 | 1432 | 406 | 484 | 84.49% |
DEBILT | 4346 | 732 | 369 | 291 | 88.50% |
DUSSELDORF | 4167 | 800 | 431 | 340 | 86.56% |
HEATHROW | 4161 | 754 | 414 | 409 | 85.66% |
KASSEL | 4563 | 607 | 316 | 252 | 90.10% |
LJUBLJANA | 3726 | 1133 | 410 | 469 | 84.68% |
MAASTRICHT | 4249 | 819 | 357 | 313 | 88.32% |
MADRID | 2735 | 2257 | 313 | 433 | 87.00% |
MUNCHEN | 4222 | 766 | 426 | 324 | 86.93% |
OSLO | 4624 | 507 | 352 | 255 | 89.42% |
SONNBLICK | 5738 | 0 | 0 | 0 | 100.00% |
STOCKHOLM | 4449 | 588 | 384 | 317 | 87.78% |
VALENTIA | 5391 | 108 | 168 | 71 | 95.83% |
The decision tree recognises patterns in the data to create subsets of the data. The decision tree we created is quite deep and complex, meaning it is likely overfitting. For this, it would need to be pruned. This will reduce the complexity and hence improve predictive accuracy.
For the first run of the unsupervised learning ANN algorithm, we obtained an accuracy score of 50.02% & 49.93% on training and test data, respectively. Upon changing the number of hidden layers and iterations that the model runs through we obtained improved scores of 88.49% & 59.72% (training and test).