Question

How to start spark cluster on Kubernetes on Digital Ocean step-by-step

Posted August 21, 2020 118 views
DigitalOceanKubernetes

How to start a spark cluster on Kubernetes on Digital Ocean step-by-step. On the web I found the following video that explains how to set up Kubernetes cluster (including kubectl) on Digital Ocean: https://www.youtube.com/watch?v=_waZw9jiyhQ.
Starting from there i.e. assuming that I have setup my kubectl on my local machine I am interested in the following questions:
1) How to prepare a base docker spark image for the Kubernetes on Digital Ocean.
2) How to tell Kubernetes which image to use for Spark?
3) Has this image be deployed somehow to Digital Ocean? If yes then how?
4) What is the best was to transfer a large csv file (~100GB) to Digital Ocean Cloud and enable my spark jobs to access and read that file.
5) Can I parquet data format on Digital Ocean? If yes how one can store it?
3) How to save the result of the spark jobs on the Digital Ocean Cloud? As the simplies option I would prefer the results to be written into a persistent data storage which is not deleted after the Kubernetes cluster has finished the jobs and ceased to exists.
Thank you!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

×
1 answer

1) Looks like there may be some solid info here on getting it started on kubernetes: https://spark.apache.org/docs/latest/running-on-kubernetes.html
2) Here’s a segment on using docker images specifically: https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images
3) Any images deployed in DOKS would need to be pushed to a registry that is publicly accessible.
4) This sounds like you may need to scp that file over to a droplet or pod that has a persistent volume attached to it to ensure it persists.
5) I am not familiar with parquet, but chances are if you can run it on linux you can run it on DO.
6) You can use our persistent storage via our block storage product which is integrated into DOKS. You can find documentation on using Block storage to persist data in your pods. https://www.digitalocean.com/docs/kubernetes/how-to/add-volumes/

Hope this helps!

John

Submit an Answer