Skip to Content
AnalyticsSparkOverview

Spark — Distributed Processing

Apache Spark 3.5.1 provides distributed data processing for large-scale analytics workloads. It runs alongside a Livy server, which provides a REST API for submitting Spark jobs — enabling Prefect workflows to trigger Spark jobs without requiring direct cluster access.

Distributed ComputePySpark jobs run across a Spark cluster for large-scale data processing.
Livy REST APISubmit and monitor Spark jobs over HTTP — no direct cluster access required.
Prefect IntegrationSpark jobs are orchestrated through Prefect, inheriting the same scheduling, monitoring, and retry semantics.
Kubernetes-nativeDeployed via the `spark` Helm chart in `infra/kube-cluster/`.

Quick Reference

AttributeValue
Spark version3.5.1
Python3.9
PySpark3.5.1
LivyREST API gateway
Locationapps/analytics/spark/
Helm chartinfra/kube-cluster/charts/spark/
Last updated on