This project is a semesterproject, completed with three other group members, with the goal of using big data principles to process the data intake coming from the copernicus satellite (Sentinel 2). Data ingested is processed through an NDWI image filter to understand how beaches and shorelines are changing. Data is shown in an android app that relies on a backend coded in ASP.NET.
The remainder of this page will focus on my contribution to the project.
The infrastructure is designed to provide a reliable and secure platform for the real-time processing of satellite imagery of shorelines worldwide. The infrastructure consists of three servers, located in:
These three nodes make up the cluster. Each server is unmanaged and runs a clean Ubuntu 20.04 installation.
To ensure secure communication and seamless coordination between the servers, they are placed within the same subnet. The cluster uses Docker Swarm, to deploy and manage services across the cluster. Docker Swarm enables a master node to distribute services intelligently and automatically handles load balancing, fault tolerance, and service placement constraints.
The deployment of code is quite straightforward with Docker Swarm. All relevant code is containerized and uploaded to our Docker registry. The image is then be used as a distributed service within the cluster. The entirety of the cluster, except for the database, is managed by Docker Swarm.
To manage intercommunication between the growing number of services in the EcoBeach cluster, Kafka is used. Kafka enables the pub-sub event pattern and provides a stream with multiple topics that services can publish or subscribe to. The cluster runs three Kafka brokers, one on each server, and the Helsinki node runs Zookeeper to synchronize incoming jobs and keep track of Kafka activity. Kafka brokers on each node introduce redundancy and better distribution of messages. Kafka also provides Kafka Connect, which integrates external services via plugins. Communication between brokers and services is limited to internal networking for increased security.
Kafka, Hadoop HDFS, Apache Spark, and the EcoBeach Codebase (NDWI Scraper, Image Analyzer, and Rest API) are all managed through Docker Swarm. The NDWI scraper downloads satellite imagery, converts it into grayscale images, and publishes it to Kafka for further operation. The services have a high availability and are replicated across the cluster for increased data intake and throughput. Replicated services are responsible for different regions, enabling granular control and temporary disabling of regions when the cluster is under strain.
The use of Copernicus for data ingestion was a requirement for this project. Unfortunately, from a big-data perspective, the data made available from Copernicus is inherently problematic, as data is not refreshed as often as many other big data use cases. This situation results in many different services staying dormant while waiting for the new data to be made available.
Ultimately the project was still fun and challenging and gave me some good insights into some of the significant pitfalls people encounter when trying to set up a cluster that should be able to scale well and handle large amounts of data intake.