Spark — Distributed Processing
Apache Spark 3.5.1 provides distributed data processing for large-scale analytics workloads. It runs alongside a Livy server, which provides a REST API for submitting Spark jobs — enabling Prefect workflows to trigger Spark jobs without requiring direct cluster access.
Distributed ComputePySpark jobs run across a Spark cluster for large-scale data processing.
Livy REST APISubmit and monitor Spark jobs over HTTP — no direct cluster access required.
Prefect IntegrationSpark jobs are orchestrated through Prefect, inheriting the same scheduling, monitoring, and retry semantics.
Kubernetes-nativeDeployed via the `spark` Helm chart in `infra/kube-cluster/`.
Quick Reference
| Attribute | Value |
|---|---|
| Spark version | 3.5.1 |
| Python | 3.9 |
| PySpark | 3.5.1 |
| Livy | REST API gateway |
| Location | apps/analytics/spark/ |
| Helm chart | infra/kube-cluster/charts/spark/ |
Last updated on