Dagster

As with most initiatives in technology… It seems that my first plan is never my purest plan. Setting up Dagster in Docker was confusing for me. Honestly, I almost gave up the blog series dream as I just didn’t have a lot of time to sit down and troubleshoot. HOWEVER

Here we are, to continue the good fight. I picked myself up by my shoelaces and got to work getting this thing up and running. For that reason, I’m probably going to have to break blog part 2 into a couple of parts to really get the point across.

The official docs for setting up Dagster with docker-compose can be found here: https://docs.dagster.io/guides/deploy/deployment-options/docker The documentation leaves something to be desired, however.

The full example that exists on Github is located here: https://github.com/dagster-io/dagster/tree/master/examples/deploy_docker

With all good troubleshooting scenarios, this is what I ended up turning to to figure out why I couldn’t get my stuff up and running. So I’m going to drag you through my epic journey of setting up Dagster w/ Docker compose, and the re-working of my previous docker-compose.yaml file that only contained my minio services.

After poking around for a while, I understand the Dagster Architecture to look a little like this:

  ---
title: Dagster Architecture
---
flowchart LR
  A[Dagster Webserver]
  B[Dagster Daemon]
  C[Daemon Workers 1..n]
  D[Dagster Code Locations]
  DB[(PostgreSQL)]
  DB1[(Dagster Volume)]

  A--GRPC---D
  A----DB
  B----DB
  C----DB
  B---C
  A--fs---DB1
  B--fs---DB1

Where the Dagster Webserver is the main user interface of the application, deployed on its own. The Dagster Daemon is 1-n services running configured to handle particular jobs. In this case, we’ll set up exactly 1.

The Dagster Webserver and Dagster Daemon need to communicate between each other using a shared volume + PostgreSQL (this tripped me up for quite a while, honestly)

The Dagster Daemon will indicate it’s running my sharing log files in the shared volume location. The Dagster Web Server will know the daemons are running based on there being files dropped there. Make sense so far?

Dagster Codespaces are independently deployable pipeline code repositories that you can coordinate by leveraging Dagster Webserver. You will need to configure your workspace.yaml files to indicate this (not clear in the documentation, clearer in the github)

Dagster.yaml will also need to be configured according to the github reference which was also not clear from the documentation page.

Directory Structure

/opendatastack
  data/
  dagsterCode/
    
  .env
  dagster.yaml
  docker-compose.yaml
  Dockerfile_code_location_1
  Dockerfile_dagster
  workspace.yaml

Dockerfile_dagster

This comes right from the official documentation, essentially this is creating a dagster image for you to re-use in your docker-compose. It’s a minimal setup with all the configuration coming form workspace.yaml and dagster.yaml. This become increasingly obvious later.

FROM python:3.10-slim

RUN pip install \
    dagster \
    dagster-graphql \
    dagster-webserver \
    dagster-postgres \
    dagster-docker

# Set $DAGSTER_HOME and copy dagster.yaml and workspace.yaml there
ENV DAGSTER_HOME=/opt/dagster/dagster_home/

RUN mkdir -p $DAGSTER_HOME

COPY dagster.yaml workspace.yaml $DAGSTER_HOME

WORKDIR $DAGSTER_HOME

Dagster WebServer, Daemons, and PostgreSQL

Next, we’re going to use basically the same docker-compose file from the official documentation above, but I want to get it set up very minimally to begin. I’m taking you down this path instead of the ‘straight in’ version I went with that had me scratching my head for a while trying to figure out what was going on.

For the webserver and daemon services to run, we need to set up only 3 of the services to start from the example docker-compose.

services:
  docker_example_postgresql:
    image: postgres:11
    container_name: docker_example_postgresql
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASS}
      POSTGRES_DB: ${POSTGRES_DB}
    volumes:
      # this is a change from the official docs. I want to persist my postgres data in a known volume
      - ./data/postgres:/var/lib/postgresql/data
    networks:
      - docker_example_network
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 10s
      timeout: 8s
      retries: 5
  docker_example_webserver:
    build:
      context: .
      dockerfile: ./Dockerfile_dagster
    entrypoint:
      - dagster-webserver
      - -h
      - '0.0.0.0'
      - -p
      - '3000'
      - -w
      - workspace.yaml
    container_name: docker_example_webserver
    expose:
      - '3000'
    ports:
      - '3000:3000'
    environment:
      DAGSTER_POSTGRES_USER: ${POSTGRES_USER}
      DAGSTER_POSTGRES_PASSWORD: ${POSTGRES_PASS}
      DAGSTER_POSTGRES_DB: ${POSTGRES_DB}
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /tmp/io_manager_storage:/tmp/io_manager_storage
    networks:
      - docker_example_network
    depends_on:
      docker_example_postgresql:
        condition: service_healthy
  docker_example_daemon:
    build:
      context: .
      dockerfile: ./Dockerfile_dagster
    entrypoint:
      - dagster-daemon
      - run
    container_name: docker_example_daemon
    restart: on-failure
    environment:
      DAGSTER_POSTGRES_USER: ${POSTGRES_USER}
      DAGSTER_POSTGRES_PASSWORD: ${POSTGRES_PASS}
      DAGSTER_POSTGRES_DB: ${POSTGRES_DB}
    volumes: 
      - /var/run/docker.sock:/var/run/docker.sock
      - /tmp/io_manager_storage:/tmp/io_manager_storage
    networks:
      - docker_example_network
    depends_on:
      docker_example_postgresql:
        condition: service_healthy
networks:
  docker_example_network:
    driver: bridge
    name: docker_example_network

As a good practice, never hardcode your credentials into your docker files. In the previous blog post I introduced the usage of .env files with docker compose, so go ahead and update your secrets in that .env file. They map to the ${variable_name} in the docker file.

MINIO_USER=minio
MINIO_PASS=minio123
POSTGRES_USER=postgres_user
POSTGRES_PASS=pgpassword
POSTGRES_DB=postgres

Also, these are terrible passwords. Don’t use these passwords for real. This setup configures the Dagster Webserver, Dagster Daemon, and PostgreSQL to be used with our setup. There is a shared volume mapped to your local machine (/tmp/io_manager_storage) where the webserver and daemon can communicate with each other. Later, I might find a better way to handle this but for now, it’s mapping to a shared volume on my local machine.

Next, according to the documentation, you need to include a dagster.yaml. You can find a full version of it at the github location above, or the snippet below:

scheduler:
  module: dagster.core.scheduler
  class: DagsterDaemonScheduler

run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 5
    tag_concurrency_limits:
      - key: "operation"
        value: "example"
        limit: 5

run_launcher:
  module: dagster_docker
  class: DockerRunLauncher
  config:
    env_vars:
      - DAGSTER_POSTGRES_USER
      - DAGSTER_POSTGRES_PASSWORD
      - DAGSTER_POSTGRES_DB
    network: docker_example_network
    container_kwargs:
      volumes: # Make docker client accessible to any launched containers as well
        - /var/run/docker.sock:/var/run/docker.sock
        - /tmp/io_manager_storage:/tmp/io_manager_storage

run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      hostname: docker_example_postgresql
      username:
        env: DAGSTER_POSTGRES_USER
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      db_name:
        env: DAGSTER_POSTGRES_DB
      port: 5432

schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      hostname: docker_example_postgresql
      username:
        env: DAGSTER_POSTGRES_USER
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      db_name:
        env: DAGSTER_POSTGRES_DB
      port: 5432

event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      hostname: docker_example_postgresql
      username:
        env: DAGSTER_POSTGRES_USER
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      db_name:
        env: DAGSTER_POSTGRES_DB
      port: 5432

Next, a very minimal workspace.yaml which we will completely replace in a few minutes …

load_from:
  - python_file: my_file.py

Then docker compose up and wait …

NOTE this is probably going to give you an error in the dagster UI - but don’t worry, it’s an expected error.

Head over to http://localhost:3000 and voila! Dagster is running and our code location is, as expected, not loading! Good Error

Because it can’t be found

Because this file doesn’t exist…

Anywhere…

So onto the next section

Dagster Code Locations

This is where I got confused for a while. Dagster Code Locations. Why do they need to be a container? Why indeed!

It’s because they’re separating out projects to make deployment easier. Your core functionality is running as a platform, then your code is running and integrated to the orchestration platform, leaving you free to have multiple teams running multiple projects and leveraging the same platform!

So what are we going to do next? Let’s write an extremely minimal Dag!

Refer back to my preferred organization code to be in the folder called dagsterCode/demo, for now. I’m going to include the definitions.py from the github docker example linked above for this example just to show that things can sort of work! At dagsterCode/demo/definitions.py copy/paste the following:

import dagster as dg

@dg.asset(
    op_tags={"operation": "example"},
    partitions_def=dg.DailyPartitionsDefinition("2025-01-01"),
)
def example_asset(context: dg.AssetExecutionContext):
    context.log.info(context.partition_key)


partitioned_asset_job = dg.define_asset_job("partitioned_job", selection=[example_asset])

defs = dg.Definitions(assets=[example_asset], jobs=[partitioned_asset_job])

It doesn’t do a lot, it defines an asset called example_asset, configures it to have partitions that are of the daily variety starting from 2025-01-01. It then logs the partition key. That’s it. That’s all. Let’s make it work now!

In the dockerfile called Dockerfile_code_location_1 let’s get setup:

FROM python:3.10-slim

RUN pip install \
    dagster \
    dagster-postgres \
    dagster-docker

# Add code location code
WORKDIR /opt/dagster/app
COPY dagsterCode/demo /opt/dagster/app

# Run dagster code server on port 4000
EXPOSE 4000

# CMD allows this to be overridden from run launchers or executors to execute runs and steps
CMD ["dagster", "code-server", "start", "-h", "0.0.0.0", "-p", "4000", "-f", "definitions.py"]

Next let’s update our docker_compose to use this dockerfile!

services:
  # ... <- Means I'm ommitting stuff for brevity. Just make changes in the services. 
  # new service
  docker_example_user_code:
    build:
      context: .
      dockerfile: ./Dockerfile_code_location_1
    container_name: docker_example_user_code
    image: docker_example_user_code_image
    restart: always
    environment:
      DAGSTER_POSTGRES_USER: ${POSTGRES_USER}
      DAGSTER_POSTGRES_PASSWORD: ${POSTGRES_PASS}
      DAGSTER_POSTGRES_DB: ${POSTGRES_DB}
      DAGSTER_CURRENT_IMAGE: 'docker_example_user_code_image'
    expose:
      - '4000'
    ports:
      - '4000:4000'
    networks:
      - docker_example_network
  # ...
  docker_example_webserver:
    # ...Only update depends_on here 
    depends_on:
      docker_example_postgresql:
        condition: service_healthy
      docker_example_user_code:
        condition: service_started
  # ...
  docker_example_daemon:
    # ... Only update depends_on here
    depends_on:
      docker_example_postgresql:
        condition: service_healthy
      docker_example_user_code:
        condition: service_started

We created a new container called docker_example_user_code and informed both the webserver and the daemon that they need this to successfully run before we they are allowed to boot up.

Now we make changes to our workspace.yaml again - which will instruct Dagster webserver and daemon to connect to the GRPC server hosted at the internal docker hosted location called docker_example_user_code

load_from:
  - grpc_server:
      host: docker_example_user_code
      port: 4000
      location_name: demo 

Now.. docker compose up --build <- Build instructs the system to rebuild the docker containers previously built locally (Dockerfile_dagster and Dockerfile_code_location_1 as we’ve updated the workspace.yaml file that’ll be loaded in)

Next, I say a small prayer…

Great Success!

That little ‘Loaded’ means everything worked great!

Check the daemons link to make sure all your daemons are running: Daemons

Then Navigate to assets and click example_asset and click materialize and choose how you want to tackle it. You could just do the latest, or select a date range to populate, or instruct it to backfill only failed and missing partitions within selection. Regardless, it’s just going to log out dates anyways.

Once it’s done, you can click into it by pressing the little green pill and look at the output - it’s going to show you the date you ran it for (I ran it for 2025-04-01)

Materialize Logs

There you have it! Dagster webserver is up and running! We have a good foundation to work with to start building our orchestration pipelines!

We’re missing one item though, our dockerfile is missing the minio service we created in part 1. For continuity, I’m going to add it in the dockerfile and paste the results below. We’ll now have Minio AND our Dagster Stack running on our local docker setup.

services:
  # ... omitted for brevity
  minio:
    image: quay.io/minio/minio
    command: server --console-address ":9001" /data
    volumes:
      - ./data/minio:/data
    expose:
      - "9000"
      - "9001"
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_PASS}
    healthcheck:
      test: [ "CMD", "mc", "ready", "local" ]
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - docker_example_network
  # ...

That’s it!

In the next blog post I’m going to talk about how to create a non-trivial dagster project, extract some data with some partitions, develop locally using dagster dev and finally publish it to my fancy new dagster project!