Dagster
As with most initiatives in technology… It seems that my first plan is never my purest plan. Setting up Dagster in Docker was confusing for me. Honestly, I almost gave up the blog series dream as I just didn’t have a lot of time to sit down and troubleshoot. HOWEVER
Here we are, to continue the good fight. I picked myself up by my shoelaces and got to work getting this thing up and running. For that reason, I’m probably going to have to break blog part 2 into a couple of parts to really get the point across.
The official docs for setting up Dagster with docker-compose can be found here: https://docs.dagster.io/guides/deploy/deployment-options/docker The documentation leaves something to be desired, however.
The full example that exists on Github is located here: https://github.com/dagster-io/dagster/tree/master/examples/deploy_docker
With all good troubleshooting scenarios, this is what I ended up turning to to figure out why I couldn’t get my stuff up and running. So I’m going to drag you through my epic journey of setting up Dagster w/ Docker compose, and the re-working of my previous docker-compose.yaml
file that only contained my minio services.
After poking around for a while, I understand the Dagster Architecture to look a little like this:
--- title: Dagster Architecture --- flowchart LR A[Dagster Webserver] B[Dagster Daemon] C[Daemon Workers 1..n] D[Dagster Code Locations] DB[(PostgreSQL)] DB1[(Dagster Volume)] A--GRPC---D A----DB B----DB C----DB B---C A--fs---DB1 B--fs---DB1
Where the Dagster Webserver is the main user interface of the application, deployed on its own. The Dagster Daemon is 1-n services running configured to handle particular jobs. In this case, we’ll set up exactly 1.
The Dagster Webserver and Dagster Daemon need to communicate between each other using a shared volume + PostgreSQL (this tripped me up for quite a while, honestly)
The Dagster Daemon will indicate it’s running my sharing log files in the shared volume location. The Dagster Web Server will know the daemons are running based on there being files dropped there. Make sense so far?
Dagster Codespaces are independently deployable pipeline code repositories that you can coordinate by leveraging Dagster Webserver. You will need to configure your workspace.yaml
files to indicate this (not clear in the documentation, clearer in the github)
Dagster.yaml
will also need to be configured according to the github reference which was also not clear from the documentation page.
Directory Structure
/opendatastack
data/
dagsterCode/
.env
dagster.yaml
docker-compose.yaml
Dockerfile_code_location_1
Dockerfile_dagster
workspace.yaml
Dockerfile_dagster
This comes right from the official documentation, essentially this is creating a dagster image for you to re-use in your docker-compose. It’s a minimal setup with all the configuration coming form workspace.yaml
and dagster.yaml
. This become increasingly obvious later.
FROM python:3.10-slim
RUN pip install \
dagster \
dagster-graphql \
dagster-webserver \
dagster-postgres \
dagster-docker
# Set $DAGSTER_HOME and copy dagster.yaml and workspace.yaml there
ENV DAGSTER_HOME=/opt/dagster/dagster_home/
RUN mkdir -p $DAGSTER_HOME
COPY dagster.yaml workspace.yaml $DAGSTER_HOME
WORKDIR $DAGSTER_HOME
Dagster WebServer, Daemons, and PostgreSQL
Next, we’re going to use basically the same docker-compose
file from the official documentation above, but I want to get it set up very minimally to begin. I’m taking you down this path instead of the ‘straight in’ version I went with that had me scratching my head for a while trying to figure out what was going on.
For the webserver and daemon services to run, we need to set up only 3 of the services to start from the example docker-compose
.
services:
docker_example_postgresql:
image: postgres:11
container_name: docker_example_postgresql
environment:
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASS}
POSTGRES_DB: ${POSTGRES_DB}
volumes:
# this is a change from the official docs. I want to persist my postgres data in a known volume
- ./data/postgres:/var/lib/postgresql/data
networks:
- docker_example_network
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
interval: 10s
timeout: 8s
retries: 5
docker_example_webserver:
build:
context: .
dockerfile: ./Dockerfile_dagster
entrypoint:
- dagster-webserver
- -h
- '0.0.0.0'
- -p
- '3000'
- -w
- workspace.yaml
container_name: docker_example_webserver
expose:
- '3000'
ports:
- '3000:3000'
environment:
DAGSTER_POSTGRES_USER: ${POSTGRES_USER}
DAGSTER_POSTGRES_PASSWORD: ${POSTGRES_PASS}
DAGSTER_POSTGRES_DB: ${POSTGRES_DB}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
networks:
- docker_example_network
depends_on:
docker_example_postgresql:
condition: service_healthy
docker_example_daemon:
build:
context: .
dockerfile: ./Dockerfile_dagster
entrypoint:
- dagster-daemon
- run
container_name: docker_example_daemon
restart: on-failure
environment:
DAGSTER_POSTGRES_USER: ${POSTGRES_USER}
DAGSTER_POSTGRES_PASSWORD: ${POSTGRES_PASS}
DAGSTER_POSTGRES_DB: ${POSTGRES_DB}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
networks:
- docker_example_network
depends_on:
docker_example_postgresql:
condition: service_healthy
networks:
docker_example_network:
driver: bridge
name: docker_example_network
As a good practice, never hardcode your credentials into your docker files. In the previous blog post I introduced the usage of .env
files with docker compose, so go ahead and update your secrets in that .env
file. They map to the ${variable_name}
in the docker file.
MINIO_USER=minio
MINIO_PASS=minio123
POSTGRES_USER=postgres_user
POSTGRES_PASS=pgpassword
POSTGRES_DB=postgres
Also, these are terrible passwords. Don’t use these passwords for real. This setup configures the Dagster Webserver, Dagster Daemon, and PostgreSQL to be used with our setup. There is a shared volume mapped to your local machine (/tmp/io_manager_storage
) where the webserver and daemon can communicate with each other. Later, I might find a better way to handle this but for now, it’s mapping to a shared volume on my local machine.
Next, according to the documentation, you need to include a dagster.yaml
. You can find a full version of it at the github location above, or the snippet below:
scheduler:
module: dagster.core.scheduler
class: DagsterDaemonScheduler
run_coordinator:
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
max_concurrent_runs: 5
tag_concurrency_limits:
- key: "operation"
value: "example"
limit: 5
run_launcher:
module: dagster_docker
class: DockerRunLauncher
config:
env_vars:
- DAGSTER_POSTGRES_USER
- DAGSTER_POSTGRES_PASSWORD
- DAGSTER_POSTGRES_DB
network: docker_example_network
container_kwargs:
volumes: # Make docker client accessible to any launched containers as well
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
run_storage:
module: dagster_postgres.run_storage
class: PostgresRunStorage
config:
postgres_db:
hostname: docker_example_postgresql
username:
env: DAGSTER_POSTGRES_USER
password:
env: DAGSTER_POSTGRES_PASSWORD
db_name:
env: DAGSTER_POSTGRES_DB
port: 5432
schedule_storage:
module: dagster_postgres.schedule_storage
class: PostgresScheduleStorage
config:
postgres_db:
hostname: docker_example_postgresql
username:
env: DAGSTER_POSTGRES_USER
password:
env: DAGSTER_POSTGRES_PASSWORD
db_name:
env: DAGSTER_POSTGRES_DB
port: 5432
event_log_storage:
module: dagster_postgres.event_log
class: PostgresEventLogStorage
config:
postgres_db:
hostname: docker_example_postgresql
username:
env: DAGSTER_POSTGRES_USER
password:
env: DAGSTER_POSTGRES_PASSWORD
db_name:
env: DAGSTER_POSTGRES_DB
port: 5432
Next, a very minimal workspace.yaml
which we will completely replace in a few minutes …
load_from:
- python_file: my_file.py
Then docker compose up
and wait …
NOTE this is probably going to give you an error in the dagster UI - but don’t worry, it’s an expected error.
Head over to http://localhost:3000 and voila! Dagster is running and our code location is, as expected, not loading!
Because it can’t be found
Because this file doesn’t exist…
Anywhere…
So onto the next section
Dagster Code Locations
This is where I got confused for a while. Dagster Code Locations. Why do they need to be a container? Why indeed!
It’s because they’re separating out projects to make deployment easier. Your core functionality is running as a platform, then your code is running and integrated to the orchestration platform, leaving you free to have multiple teams running multiple projects and leveraging the same platform!
So what are we going to do next? Let’s write an extremely minimal Dag!
Refer back to my preferred organization code to be in the folder called dagsterCode/demo
, for now. I’m going to include the definitions.py
from the github docker example linked above for this example just to show that things can sort of work!
At dagsterCode/demo/definitions.py
copy/paste the following:
import dagster as dg
@dg.asset(
op_tags={"operation": "example"},
partitions_def=dg.DailyPartitionsDefinition("2025-01-01"),
)
def example_asset(context: dg.AssetExecutionContext):
context.log.info(context.partition_key)
partitioned_asset_job = dg.define_asset_job("partitioned_job", selection=[example_asset])
defs = dg.Definitions(assets=[example_asset], jobs=[partitioned_asset_job])
It doesn’t do a lot, it defines an asset called example_asset
, configures it to have partitions that are of the daily variety starting from 2025-01-01
. It then logs the partition key.
That’s it. That’s all. Let’s make it work now!
In the dockerfile called Dockerfile_code_location_1
let’s get setup:
FROM python:3.10-slim
RUN pip install \
dagster \
dagster-postgres \
dagster-docker
# Add code location code
WORKDIR /opt/dagster/app
COPY dagsterCode/demo /opt/dagster/app
# Run dagster code server on port 4000
EXPOSE 4000
# CMD allows this to be overridden from run launchers or executors to execute runs and steps
CMD ["dagster", "code-server", "start", "-h", "0.0.0.0", "-p", "4000", "-f", "definitions.py"]
Next let’s update our docker_compose
to use this dockerfile!
services:
# ... <- Means I'm ommitting stuff for brevity. Just make changes in the services.
# new service
docker_example_user_code:
build:
context: .
dockerfile: ./Dockerfile_code_location_1
container_name: docker_example_user_code
image: docker_example_user_code_image
restart: always
environment:
DAGSTER_POSTGRES_USER: ${POSTGRES_USER}
DAGSTER_POSTGRES_PASSWORD: ${POSTGRES_PASS}
DAGSTER_POSTGRES_DB: ${POSTGRES_DB}
DAGSTER_CURRENT_IMAGE: 'docker_example_user_code_image'
expose:
- '4000'
ports:
- '4000:4000'
networks:
- docker_example_network
# ...
docker_example_webserver:
# ...Only update depends_on here
depends_on:
docker_example_postgresql:
condition: service_healthy
docker_example_user_code:
condition: service_started
# ...
docker_example_daemon:
# ... Only update depends_on here
depends_on:
docker_example_postgresql:
condition: service_healthy
docker_example_user_code:
condition: service_started
We created a new container called docker_example_user_code and informed both the webserver and the daemon that they need this to successfully run before we they are allowed to boot up.
Now we make changes to our workspace.yaml
again - which will instruct Dagster webserver and daemon to connect to the GRPC server hosted at the internal docker hosted location called docker_example_user_code
load_from:
- grpc_server:
host: docker_example_user_code
port: 4000
location_name: demo
Now.. docker compose up --build
<- Build instructs the system to rebuild the docker containers previously built locally (Dockerfile_dagster
and Dockerfile_code_location_1
as we’ve updated the workspace.yaml
file that’ll be loaded in)
Next, I say a small prayer…
That little ‘Loaded’ means everything worked great!
Check the daemons
link to make sure all your daemons are running:
Then Navigate to assets
and click example_asset
and click materialize
and choose how you want to tackle it. You could just do the latest, or select a date range to populate, or instruct it to backfill only failed and missing partitions within selection. Regardless, it’s just going to log out dates anyways.
Once it’s done, you can click into it by pressing the little green pill and look at the output - it’s going to show you the date you ran it for (I ran it for 2025-04-01)
There you have it! Dagster webserver is up and running! We have a good foundation to work with to start building our orchestration pipelines!
We’re missing one item though, our dockerfile is missing the minio service we created in part 1. For continuity, I’m going to add it in the dockerfile and paste the results below. We’ll now have Minio AND our Dagster Stack running on our local docker setup.
services:
# ... omitted for brevity
minio:
image: quay.io/minio/minio
command: server --console-address ":9001" /data
volumes:
- ./data/minio:/data
expose:
- "9000"
- "9001"
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: ${MINIO_USER}
MINIO_ROOT_PASSWORD: ${MINIO_PASS}
healthcheck:
test: [ "CMD", "mc", "ready", "local" ]
interval: 5s
timeout: 5s
retries: 5
networks:
- docker_example_network
# ...
That’s it!
In the next blog post I’m going to talk about how to create a non-trivial dagster project, extract some data with some partitions, develop locally using dagster dev
and finally publish it to my fancy new dagster
project!