From Data Chaos to Data Insights with Google Cloud and GitLab CI: A Cutting-Edge Solution

Published in

Google Cloud - Community

11 min readApr 30, 2023

Let’s take a look at a streamlined, effective approach that may help us acquire important insights from our data and make dealing with the turmoil of manual data deployment and analysis easy.

With Google Cloud, you get access to a robust infrastructure and a suite of useful tools for handling your data with ease. Google Cloud offers a comprehensive set of services for storing, processing, and analyzing data, from cloud storage to computing instances.

Meanwhile, GitLab CI provides a useful means of automating your deployment. GitLab CI makes it simple to create, test, and deploy code to the cloud while making use of version control and teamwork.

Combine the power of Google Cloud and GitLab CI to streamline your deployment process and glean actionable insights from your data.

Goal

To automate the deployment of a dockerized application using Gitlab CI infrastructure and tools: Cloud Compute (GCE), Docker, Terraform, and Python

Task

Download regularly (e.g., daily or hourly) some datasets from a free data provider. For example: https://github.com/CSSEGISandData/COVID-19 https://openweathermap.org/current
Store the downloaded dataset on cloud storage. From every downloaded dataset, extract some specific data (e.g. data relevant to Czechia or Prague).
Display all extracted data using a single HTML page served from the cloud storage. A simple table.

Outline of the steps you need to follow:

Determine the dataset you wish to periodically get from a free data supplier. For example, the Covid-19 data is available at https://github.com/CSSEGISandData/COVID-19
Write a Python script that can download the dataset from the data provider using requests or urllib library. You can use argparse or click library to pass arguments such as the frequency of downloading, the destination folder, and the specific data to extract. For example, you can use `python download.py — frequency daily — folder covid_data — country Czechia`
Third, using the Google Cloud Storage library, write a python script to read the downloaded dataset from your cloud storage and output it as a table on an HTML website. The dataset may be manipulated with the help of the pandas or csv library, and the HTML page can be generated with the help of the jinja2 or HTML library, for instance, `python generate.py — folder covid_data — output covid.HTML` would do the trick.
Create a DockerFile that includes the Python scripts you want to run, their prerequisites, and the base image. You may use python:3 as your foundational image and pip to add the necessary libraries. Arguments can also be sent to your scripts via environment variables. For example, you can use `ENV FREQUENCY daily` and `CMD [“python”, “download.py”, “ — frequency”, “$FREQUENCY”, “ — folder”, “covid_data”, “ — country”, “Czechia”]`
Use the `docker build` and `docker run` commands to create and test your docker image on your local machine. The dataset must be downloaded saved to cloud storage and an HTML page must be generated all inside the same image.
Using the `docker push` command upload your Docker image to a registry like Docker Hub or Google Container Registry. Before you may push you’ll need to register your image and verify your identity with the registry.
Create a Terraform configuration file that specifies the cloud resources your application will use. In order to launch your docker image you will need to set up a service account with access to your cloud storage as well as a compute instance. You’ll also need to use input variables or outputs of other resources to define the docker image’s environment variables, for instance, you can use `variable “frequency” {default = “daily”}` and `resource “google_compute_instance” “app” {metadata = {frequency = var.frequency}}`
Initialize and apply your Terraform configuration using `terraform init` and `terraform apply` commands. Make sure your cloud resources are created and configured correctly.
Create a Gitlab CI configuration file (.gitlab-ci.yml) that defines the stages and jobs for automating the deployment of your application to your cloud environment. You will need to use docker-in-docker service or kaniko executor to build and push your docker image from Gitlab CI. You will also need to install Terraform CLI and authenticate with your cloud provider using service account credentials or environment variables. You can use gitlab variables or gitlab secrets to store sensitive information such as registry credentials or cloud credentials. For example, you can use `variables: {REGISTRY_USER: $REGISTRY_USER, REGISTRY_PASSWORD: $REGISTRY_PASSWORD}` and `script: {echo $REGISTRY_PASSWORD | docker login -u $REGISTRY_USER — password-stdin}`
Verify that your pipeline can successfully construct your Docker image, push it, and apply your Terraform configuration. Don’t forget to double check that your HTML page and dataset are being routinely downloaded from the cloud. You may verify the health of your Docker image by inspecting the logs of your compute instance.

Now, let’s jump into the code:

Code snippet for download.py with frequency daily, folder covid_data, and country Czechia:

# Import libraries
import requests
import argparse
import os
import datetime

# Define arguments
parser = argparse.ArgumentParser(description="Download COVID-19 data")
parser.add_argument("--frequency", type=str, choices=["daily", "hourly"], default="daily", help="How often to download the data")
parser.add_argument("--folder", type=str, default="covid_data", help="The destination folder to store the data")
parser.add_argument("--country", type=str, default="Czechia", help="The country to extract the data for")
args = parser.parse_args()

# Define constants
DATA_URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
BUCKET_NAME = "your-bucket-name" # Change this to your cloud storage bucket name

# Create the destination folder if it does not exist
if not os.path.exists(args.folder):
    os.makedirs(args.folder)

# Download the data from the data provider
response = requests.get(DATA_URL)
if response.status_code == 200:
    # Save the data to a local file with a timestamp
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    filename = f"{args.folder}/covid_{timestamp}.csv"
    with open(filename, "w") as f:
        f.write(response.text)
    # Upload the file to your cloud storage bucket using gsutil command
    os.system(f"gsutil cp {filename} gs://{BUCKET_NAME}/{args.folder}/")
    # Delete the local file
    os.remove(filename)
else:
    # Handle the error
    print(f"Failed to download the data: {response.status_code}")

Code snippet for generate.py with folder covid_data and output covid.html:

# Import libraries
import pandas as pd
import argparse
import os
from jinja2 import Template

# Define arguments
parser = argparse.ArgumentParser(description="Generate HTML page from COVID-19 data")
parser.add_argument("--folder", type=str, default="covid_data", help="The source folder to read the data from")
parser.add_argument("--output", type=str, default="covid.html", help="The output file name for the HTML page")
args = parser.parse_args()

# Define constants
BUCKET_NAME = "your-bucket-name" # Change this to your cloud storage bucket name
TEMPLATE = """
<html>
<head>
    <title>COVID-19 Data</title>
</head>
<body>
    <h1>COVID-19 Data</h1>
    <table border="1">
        <tr>
            <th>Date</th>
            <th>Country</th>
            <th>Cases</th>
        </tr>
        {% for row in data %}
        <tr>
            <td>{{ row.date }}</td>
            <td>{{ row.country }}</td>
            <td>{{ row.cases }}</td>
        </tr>
        {% endfor %}
    </table>
</body>
</html>
"""

# Read the latest file from the source folder using gsutil command
files = os.popen(f"gsutil ls gs://{BUCKET_NAME}/{args.folder}/").read().split()
latest_file = files[-1]
os.system(f"gsutil cp {latest_file} {args.folder}/")

# Load the data using pandas and filter by country
df = pd.read_csv(latest_file)
df = df[df["Country/Region"] == args.country]

# Transpose the data and extract the date, country, and cases columns
df = df.T
df.reset_index(inplace=True)
df.columns = ["date", "cases"]
df["country"] = args.country

# Convert the data to a list of dictionaries
data = df.to_dict(orient="records")

# Render the template with the data and save to the output file
template = Template(TEMPLATE)
html = template.render(data=data)
with open(args.output, "w") as f:
    f.write(html)

# Upload the output file to your cloud storage bucket using gsutil command
os.system(f"gsutil cp {args.output} gs://{BUCKET_NAME}/{args.output}")

# Delete the local files
os.remove(latest_file)
os.remove(args.output)

Dockerfile that can run your Python scripts:

# Use python:3 as the base image
FROM python:3

# Set the working directory to /app
WORKDIR /app

# Copy the requirements.txt file to /app
COPY requirements.txt /app

# Install the dependencies using pip
RUN pip install -r requirements.txt

# Copy the Python scripts to /app
COPY download.py /app
COPY generate.py /app

# Set the environment variables for the arguments
ENV FREQUENCY daily
ENV FOLDER covid_data
ENV COUNTRY Czechia
ENV OUTPUT covid.html

# Run the download.py script as the default command
CMD ["python", "download.py", "--frequency", "$FREQUENCY", "--folder", "$FOLDER", "--country", "$COUNTRY"]

# Run the generate.py script as an additional command
RUN ["python", "generate.py", "--folder", "$FOLDER", "--output", "$OUTPUT"]

Terraform configuration file for Google Cloud with a service account and a compute instance:

# Specify the provider and the project ID
provider "google" {
  project = "your-project-id" # Change this to your Google Cloud project ID
}

# Create a service account for accessing the cloud storage bucket
resource "google_service_account" "app" {
  account_id   = "app"
  display_name = "App Service Account"
}

# Create a cloud storage bucket and grant the service account read/write access
resource "google_storage_bucket" "bucket" {
  name          = "your-bucket-name" # Change this to your cloud storage bucket name
  location      = "US"
  force_destroy = true
}

resource "google_storage_bucket_iam_member" "app" {
  bucket = google_storage_bucket.bucket.name
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.app.email}"
}

# Create a compute instance and assign the service account
resource "google_compute_instance" "app" {
  name         = "app"
  machine_type = "f1-micro"
  zone         = "us-central1-a"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-9"
    }
  }

  network_interface {
    network = "default"
    access_config {
      // Ephemeral IP
    }
  }

  service_account {
    email  = google_service_account.app.email
    scopes = ["cloud-platform"]
  }

  # Set the environment variables for the docker image as metadata
  metadata = {
    frequency = var.frequency # Use input variable for frequency
    folder    = var.folder    # Use input variable for folder
    country   = var.country   # Use input variable for country
    output    = var.output    # Use input variable for output
  }

  # Run the docker image using startup script
  metadata_startup_script = <<-EOT
    # Install docker
    apt-get update
    apt-get install -y docker.io

    # Pull the docker image from the registry
    docker pull your-registry-name/your-image-name:your-image-tag # Change this to your docker image name and tag

    # Run the docker image with the environment variables from metadata
    docker run -e FREQUENCY=$frequency -e FOLDER=$folder -e COUNTRY=$country -e OUTPUT=$output your-registry-name/your-image-name:your-image-tag # Change this to your docker image name and tag
  EOT

}

# Define input variables for the environment variables of the docker image
variable "frequency" {
  type        = string
  description = "How often to download the data"
  default     = "daily"
}

variable "folder" {
  type        = string
  description = "The destination folder to store the data"
  default     = "covid_data"
}

variable "country" {
  type        = string
  description = "The country to extract the data for"
  default     = "Czechia"
}

variable "output" {
  type        = string
  description = "The output file name for the HTML page"
  default     = "covid.html"
}

The next step after writing the Terraform configuration file is to initialize and apply it using terraform init and terraform apply commands. This will create and configure the cloud resources you defined in your file.

Here is an example of how to do that:

Open a terminal and navigate to the folder where your Terraform configuration file is located.
Run `terraform init` to initialize Terraform and download the required plugins for your provider.
Run `terraform apply` to review and confirm the changes that Terraform will make to your infrastructure.
Enter `yes` when prompted to apply the changes.
Wait for Terraform to create and configure the cloud resources.
Check the output of Terraform to see the details of the resources created

The next step after applying the Terraform configuration file is to create a Gitlab repository and push your source code including your Python scripts, Dockerfile and Terraform configuration file. This will store your code in a version control system and allow you to collaborate with your team.

Here’s how you can do that:

Go to https://gitlab.com/ and sign in or sign up for an account.
Click on the New project button and choose a name and visibility level for your project.
Copy the URL of your project and go back to your terminal.
Navigate to the folder where your source code is located.
Run `git init` to initialize a local Git repository.
Run `git add .` to stage all your files for commit.
Run `git commit -m “Initial commit”` to commit your files with a message.
Run `git remote add origin <your-project-url>` to add your Gitlab project as a remote repository.
Run `git push -u origin master` to push your files to your Gitlab project

The next step after pushing your source code to your Gitlab project is to create a Gitlab CI configuration file (.gitlab-ci.yml) that defines the stages and jobs for automating the deployment of your application to your cloud environment. This will enable you to use Gitlab CI to build and push your docker image and apply your Terraform configuration whenever you make changes to your code.

Create a file named .gitlab-ci.yml in the root folder of your project.
Write the following content in the file:

# Define the stages of the pipeline
stages:
 - build
 - deploy

# Define the variables for the registry and the bucket
variables:
 REGISTRY_NAME: your-registry-name # Change this to your registry name
 REGISTRY_USER: $REGISTRY_USER # Use gitlab variable for registry user
 REGISTRY_PASSWORD: $REGISTRY_PASSWORD # Use gitlab variable for registry password
 IMAGE_NAME: your-image-name # Change this to your image name
 IMAGE_TAG: latest # Use latest as the image tag
 BUCKET_NAME: your-bucket-name # Change this to your bucket name

# Define the build job that builds and pushes the docker image
build:
 stage: build
 image: docker:19.03.12 # Use docker image to run the job

 services:
 - docker:19.03.12-dind # Use docker-in-docker service

 script:
 - docker login -u $REGISTRY_USER -p $REGISTRY_PASSWORD $REGISTRY_NAME # Login to the registry
 - docker build -t $REGISTRY_NAME/$IMAGE_NAME:$IMAGE_TAG . # Build the docker image
 - docker push $REGISTRY_NAME/$IMAGE_NAME:$IMAGE_TAG # Push the docker image

# Define the deploy job that applies the Terraform configuration
deploy:
 stage: deploy
 image: hashicorp/terraform:1.0.11 # Use terraform image to run the job

 before_script:
 - rm -rf .terraform # Remove any existing terraform files
 - gcloud auth activate-service-account - key-file <(echo $GCLOUD_SERVICE_KEY) # Authenticate with Google Cloud using gitlab secret for service account key
 - gsutil cp gs://$BUCKET_NAME/terraform.tfstate terraform.tfstate # Copy the terraform state file from the bucket if it exists

 script:
 - terraform init # Initialize terraform
 - terraform apply -auto-approve # Apply the terraform configuration without prompting for confirmation

 after_script:
 - gsutil cp terraform.tfstate gs://$BUCKET_NAME/terraform.tfstate # Copy the terraform state file back to the bucket

3. Save and commit the file and push it to your Gitlab project.

The next step after creating the Gitlab CI configuration file is to test your Gitlab CI pipeline by creating a merge request or pushing a commit to your repository. This will trigger the pipeline to run and execute the build and deploy jobs.

Here‘s how you can do that:

Go to your Gitlab project and click on the Merge requests tab.
Click on the New merge request button and choose a source branch and a target branch for your merge request.
Click on the Compare branches and continue button and fill in the title and description for your merge request.
Click on the Submit merge request button and wait for the pipeline to start.
Click on the CI/CD tab and monitor the progress of the pipeline.
Check the logs of each job and see if there are any errors or warnings.
If the pipeline succeeds, go to your cloud console and verify that your resources are created and configured correctly. If the pipeline fails, go back to your code and fix any issues and push again.

The next step after testing your Gitlab CI pipeline is to check your cloud storage and verify that your dataset is downloaded regularly and your HTML page is updated accordingly. This will confirm that your application is working as expected.

Here‘s how you can do that:

Go to your cloud console and navigate to the cloud storage section.
Click on the name of your bucket and see the list of files in it.
Look for the files with the prefix covid_data and see the timestamps of each file.
Verify that the files are downloaded according to the frequency you specified (daily or hourly).
Download one of the files and open it with a spreadsheet program.
Verify that the file contains the COVID-19 data for the country you specified.
Go back to the cloud storage section, and look for the file with the name covid.html
Download the file and open it with a web browser.
Verify that the file contains an HTML page with a table that displays the COVID-19 data for the country you specified.

Conclusion

We’ve finally come to the end of this article. I hope you’ve enjoyed it and have learned something new.

I’m always open to suggestions and discussions on LinkedIn. Hit me up with direct messages.

If you’ve enjoyed my writing and want to keep me motivated, consider leaving starts on GitHub and endorse me for relevant skills on LinkedIn.

Till the next one, stay safe and keep learning.