An Open Data Stack
But First!
Looking at my poor blog, it’s been years.. YEARS since I’ve posted. A lot has changed, but a lot has stayed the same, since I last posted.
Firstly, I’ve taken on a new role in my companies Architecture group. Specifically, I’ve been working on Web and Cloud applications (not new) but also our big data platform (very new). This has been a very interesting transition for me as I began my passion for computing science following my Compute 391 class where by Database instructor explained the beer and diapers correlation and I was flabbergasted by the amazing ways in which data could be utilized.
Now, I didn’t get much of a say in designing the stack I’m working in and it is heavily influenced by our enterprise partnerships which is completely fine, but it has taken away one of the things I like to do most - play and learn for minimal investment costs up front. So much of the data world is pay-to-play built on the shoulders of open source technology giants so I wanted to take some time to poke around myself at what I might be able to accomplish with a few spare hours and a world of open data tools.
The Components
My naive view of the Data world is going to be focussed on a few key components that I’m looking to implement using only open source tools. Based on my exploration, I’ve landed on something that looks like this:
Storage - MinIO
I fell in love with MinIO ages ago when I integrated it into a piece of ‘bridge’ software that would eventually be replaced by an AWS S3 bucket as the company I was working for would eventually move to cloud. S3 compatible storage wherever you need it - what a dream.
Data Lake - Apache Iceberg
I’ve played with delta tables quite enough at this point (hint hint towards which platform I use at work) that I want to see what else is out there. Iceberg looks very exciting to me so I’m going to go a head and make use of it here!
Visualization - Metabase
It’s not Power BI, but it does stuff. So let’s try some stuff.
ELT and Orchestration - Dagster
Truthfully, I poked around with a few solutions and I felt very impeded by my ability to get real work done. I tried Airbyte and it was very opinionated - too much so for me to play with in a blog post that isn’t making it to production. I tried Knime but drag-and-drop tools drive me up the wall. Apache Camel was an idea but it was confusing to set up (for the fifteen minutes I spent trying to set it up) and Apache NiFi looks like it’s powerful but I found myself going really far down the rabbit hole attempting to organize and structure my workflows so I abandoned it and came back to Dagster as it felt friendly and I did invest some time to do their essentials course last year.
Bla Bla Bla, what’s the plan for content then?
I’m going to break this blog out into a few parts to go through how I’m planning to set up the environment and integrate the tools. Then I’m going to run through a data use-case to bring in some data from the AESO (because they have a lot of cool data about the Alberta Grid and electricity data is pretty cool), and maybe some weather data from Open Weather or ACIS. I really don’t quite know yet.
At the end of this, I’m going to hopefully give you a docker-compose file that will provision our data stack, make it nice and integrated, suck in some data from a couple of sources and make a cool metabase dashboard correlating, perhaps, peak demand with weather patterns or the like.
Wish me luck, and nag me on my socials if you are actually reading and I’m not posting the content fast enough!
So for today, we’ll call this post 1 - and I feel like there was a lot of setup and not a lot of content, so we’re going to begin with a fairly easy part of the setup but a non-trivial component, our storage layer (MinIO)
MinIO
Environment variables
We’re not heathens here. While I may not be the greatest when it comes to ultra-security of credentials, secrets, and keys, I do try to at least practice a little bit of what I preach. I also find that too many blogs write too little in th realm of security and it makes me fear for the inevitable moment when some zealous new developer implements a lot of awful code in an unsecured database because a tutorial led them astray.
Step 1 - create a .env variable
Step 2 - add your secrets in a KEY=VALUE
format
Step 3 - Add .env to your .gitignore
Step 4 - Feel a little better in your setup. Never, EVER, EVER commit that file to source, lest you want to fight forever to have it forgotten from the depths of git histories. so tools like this won’t be sniffing your credentials in your public portfolio.
I digress, back to the interesting stuff
your .env
file:
MINIO_USER=minio
MINIO_PASS=minio123
… OK. Not secure, I get it. But you can tweak those whenever you feel like it and you can feel more secure in your implementation.
Next, we’re going to actually set up a docker-compose file. We’re not going to go crazy here like you will find in the official documentation where they set up a big cluster w/ an nginx load balancer (great practice, overkill for what I’m trying to do here). We’re setting up a single-point-of-failure-s3-bucket-storage with some basic default configs.
Now for your docker-compose.yaml
file
# Services
services:
# Seriously - this should be setup in a cluster for production workloads, but I'm not doing it.
# See the official docs here: https://github.com/minio/minio/blob/master/docs/orchestration/docker-compose/docker-compose.yaml
minio:
image: quay.io/minio/minio
command: server --console-address ":9001" /data
volumes:
- ./minio-data:/data
expose:
- "9000"
- "9001"
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: ${MINIO_USER}
MINIO_ROOT_PASSWORD: ${MINIO_PASS}
healthcheck:
test: [ "CMD", "mc", "ready", "local" ]
interval: 5s
timeout: 5s
retries: 5
networks:
- minio
networks:
minio:
driver: bridge
So what’s happening here? We’re provisioning a network called ‘minio’ to ensure only things I want to access s3 can access s3.
Then we’re provisioning our first service in the stack (minio) and binding it to a local volume called ‘minio-data’ - make sure you run mkdir minio-data
in the same directory as this docker-compose
file to make sure things don’t break.
NOTE that your credentials are referenced in the yaml as ${MINIO_USER}
and ${MINIO_PASS}
. If you’re using a .env
file then you don’t need to do anything extra or special, but if you want to have multiple env
files, you can pick the one you want to use by adding the -e
flag to your docker-compose up
command.
For example, if you create a prod.env
file and set your secrets, you can use that environment file instead of the default .env
file by running:
docker-compose --env-file prod.env up
Run docker-compose up
and watch the magic happen.
Congrats! you have a minio server set up on port 9001 (web) and 9000 (api) for us to make use of in our future steps! Feel free to play around with the GUI, create some buckets, upload some data, etc.
In the next post, I’ll either look at setting up Apache Iceberg or setting up our first ELT pipeline using Dagster!