Spark — Distributed Processing

Apache Spark 3.5.1 provides distributed data processing for large-scale analytics workloads. It runs alongside a Livy server, which provides a REST API for submitting Spark jobs — enabling Prefect workflows to trigger Spark jobs without requiring direct cluster access.

Distributed ComputePySpark jobs run across a Spark cluster for large-scale data processing.

Livy REST APISubmit and monitor Spark jobs over HTTP — no direct cluster access required.

Prefect IntegrationSpark jobs are orchestrated through Prefect, inheriting the same scheduling, monitoring, and retry semantics.

Kubernetes-nativeDeployed via the `spark` Helm chart in `infra/kube-cluster/`.

Quick Reference

Attribute	Value
Spark version	3.5.1
Python	3.9
PySpark	3.5.1
Livy	REST API gateway
Location	`apps/analytics/spark/`
Helm chart	`infra/kube-cluster/charts/spark/`