How to start spark cluster on Kubernetes on Digital Ocean step-by-step

How to start a spark cluster on Kubernetes on Digital Ocean step-by-step. On the web I found the following video that explains how to set up Kubernetes cluster (including kubectl) on Digital Ocean: Starting from there i.e. assuming that I have setup my kubectl on my local machine I am interested in the following questions:

  1. How to prepare a base docker spark image for the Kubernetes on Digital Ocean.
  2. How to tell Kubernetes which image to use for Spark?
  3. Has this image be deployed somehow to Digital Ocean? If yes then how?
  4. What is the best was to transfer a large csv file (~100GB) to Digital Ocean Cloud and enable my spark jobs to access and read that file.
  5. Can I parquet data format on Digital Ocean? If yes how one can store it?
  6. How to save the result of the spark jobs on the Digital Ocean Cloud? As the simplies option I would prefer the results to be written into a persistent data storage which is not deleted after the Kubernetes cluster has finished the jobs and ceased to exists. Thank you!

Submit an answer

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Sign In or Sign Up to Answer

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Want to learn more? Join the DigitalOcean Community!

Join our DigitalOcean community of over a million developers for free! Get help and share knowledge in Q&A, subscribe to topics of interest, and get courses and tools that will help you grow as a developer and scale your project or business.

  1. Looks like there may be some solid info here on getting it started on kubernetes:
  2. Here’s a segment on using docker images specifically:
  3. Any images deployed in DOKS would need to be pushed to a registry that is publicly accessible.
  4. This sounds like you may need to scp that file over to a droplet or pod that has a persistent volume attached to it to ensure it persists.
  5. I am not familiar with parquet, but chances are if you can run it on linux you can run it on DO.
  6. You can use our persistent storage via our block storage product which is integrated into DOKS. You can find documentation on using Block storage to persist data in your pods.

Hope this helps!