An Open Data Stack

But First!

Looking at my poor blog, it’s been years.. YEARS since I’ve posted. A lot has changed, but a lot has stayed the same, since I last posted.

Firstly, I’ve taken on a new role in my companies Architecture group. Specifically, I’ve been working on Web and Cloud applications (not new) but also our big data platform (very new). This has been a very interesting transition for me as I began my passion for computing science following my Compute 391 class where by Database instructor explained the beer and diapers correlation and I was flabbergasted by the amazing ways in which data could be utilized.

Now, I didn’t get much of a say in designing the stack I’m working in and it is heavily influenced by our enterprise partnerships which is completely fine, but it has taken away one of the things I like to do most - play and learn for minimal investment costs up front. So much of the data world is pay-to-play built on the shoulders of open source technology giants so I wanted to take some time to poke around myself at what I might be able to accomplish with a few spare hours and a world of open data tools.

The Components

My naive view of the Data world is going to be focussed on a few key components that I’m looking to implement using only open source tools. Based on my exploration, I’ve landed on something that looks like this:

Storage - MinIO

https://min.io/

I fell in love with MinIO ages ago when I integrated it into a piece of ‘bridge’ software that would eventually be replaced by an AWS S3 bucket as the company I was working for would eventually move to cloud. S3 compatible storage wherever you need it - what a dream.

Data Lake - Apache Iceberg

https://iceberg.apache.org/

I’ve played with delta tables quite enough at this point (hint hint towards which platform I use at work) that I want to see what else is out there. Iceberg looks very exciting to me so I’m going to go a head and make use of it here!

Visualization - Metabase

https://www.metabase.com/

It’s not Power BI, but it does stuff. So let’s try some stuff.

ELT and Orchestration - Dagster

https://dagster.io/

Truthfully, I poked around with a few solutions and I felt very impeded by my ability to get real work done. I tried Airbyte and it was very opinionated - too much so for me to play with in a blog post that isn’t making it to production. I tried Knime but drag-and-drop tools drive me up the wall. Apache Camel was an idea but it was confusing to set up (for the fifteen minutes I spent trying to set it up) and Apache NiFi looks like it’s powerful but I found myself going really far down the rabbit hole attempting to organize and structure my workflows so I abandoned it and came back to Dagster as it felt friendly and I did invest some time to do their essentials course last year.

Bla Bla Bla, what’s the plan for content then?

I’m going to break this blog out into a few parts to go through how I’m planning to set up the environment and integrate the tools. Then I’m going to run through a data use-case to bring in some data from the AESO (because they have a lot of cool data about the Alberta Grid and electricity data is pretty cool), and maybe some weather data from Open Weather or ACIS. I really don’t quite know yet.

At the end of this, I’m going to hopefully give you a docker-compose file that will provision our data stack, make it nice and integrated, suck in some data from a couple of sources and make a cool metabase dashboard correlating, perhaps, peak demand with weather patterns or the like.

Wish me luck, and nag me on my socials if you are actually reading and I’m not posting the content fast enough!

So for today, we’ll call this post 1 - and I feel like there was a lot of setup and not a lot of content, so we’re going to begin with a fairly easy part of the setup but a non-trivial component, our storage layer (MinIO)

MinIO

Environment variables

We’re not heathens here. While I may not be the greatest when it comes to ultra-security of credentials, secrets, and keys, I do try to at least practice a little bit of what I preach. I also find that too many blogs write too little in th realm of security and it makes me fear for the inevitable moment when some zealous new developer implements a lot of awful code in an unsecured database because a tutorial led them astray.

Step 1 - create a .env variable

Step 2 - add your secrets in a KEY=VALUE format

Step 3 - Add .env to your .gitignore

Step 4 - Feel a little better in your setup. Never, EVER, EVER commit that file to source, lest you want to fight forever to have it forgotten from the depths of git histories. so tools like this won’t be sniffing your credentials in your public portfolio.

I digress, back to the interesting stuff

your .env file:

MINIO_USER=minio
MINIO_PASS=minio123

… OK. Not secure, I get it. But you can tweak those whenever you feel like it and you can feel more secure in your implementation.

Next, we’re going to actually set up a docker-compose file. We’re not going to go crazy here like you will find in the official documentation where they set up a big cluster w/ an nginx load balancer (great practice, overkill for what I’m trying to do here). We’re setting up a single-point-of-failure-s3-bucket-storage with some basic default configs.

Now for your docker-compose.yaml file

# Services
services:
  # Seriously - this should be setup in a cluster for production workloads, but I'm not doing it.
  # See the official docs here: https://github.com/minio/minio/blob/master/docs/orchestration/docker-compose/docker-compose.yaml
  minio:
    image: quay.io/minio/minio
    command: server --console-address ":9001" /data
    volumes:
      - ./minio-data:/data
    expose:
      - "9000"
      - "9001"
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_PASS}
    healthcheck:
      test: [ "CMD", "mc", "ready", "local" ]
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - minio
networks:
  minio:
    driver: bridge

So what’s happening here? We’re provisioning a network called ‘minio’ to ensure only things I want to access s3 can access s3. Then we’re provisioning our first service in the stack (minio) and binding it to a local volume called ‘minio-data’ - make sure you run mkdir minio-data in the same directory as this docker-compose file to make sure things don’t break.

NOTE that your credentials are referenced in the yaml as ${MINIO_USER} and ${MINIO_PASS}. If you’re using a .env file then you don’t need to do anything extra or special, but if you want to have multiple env files, you can pick the one you want to use by adding the -e flag to your docker-compose up command. For example, if you create a prod.env file and set your secrets, you can use that environment file instead of the default .env file by running: docker-compose --env-file prod.env up

Run docker-compose up and watch the magic happen. Congrats! you have a minio server set up on port 9001 (web) and 9000 (api) for us to make use of in our future steps! Feel free to play around with the GUI, create some buckets, upload some data, etc.

In the next post, I’ll either look at setting up Apache Iceberg or setting up our first ELT pipeline using Dagster!

An Open Data Stack#

But First!#

The Components#

Storage - MinIO#

Data Lake - Apache Iceberg#

Visualization - Metabase#

ELT and Orchestration - Dagster#

Bla Bla Bla, what’s the plan for content then?#

MinIO#

Environment variables#