Need to convert 1000 images in 5 seconds-- what are our options?

February 26, 2017 137 views
Clustering Ubuntu 16.04

Hi, we're looking for a server architecture that will allow for converting 1000 images in 5 seconds. As a test, we ran some benchmarks using a 48GB (16 core) droplet, using GNU Parallel to run 1,000 image conversions.

Sample command:

ls -1 *.pdf | parallel --eta convert {} {.}.png

Each image takes around 1.0 second to convert, and with 16 cores running at 100% (monitored via htop), we were able to render all 1000 images in about 60 seconds.

We'd like to, someday (as budget allows), get this down to 5 seconds. We obviously need more droplets working in a distributed environment-- we just don't know where to start. We'd like to stay with DigitalOcean, as we love the simplicity, value and performance.

What sort of server architecture, applications, tools, services, technologies, etc. would you suggest we look into?

Thank you for your help.

1 Answer

@stevetenuto

The first question I'd ask is why is the 5s marker that important? I ask as you've not provided much to go on as far as what your use case is and why it requires such quick processing time.

When it comes to image processing, and the speed at which images are processed, CPU is going to be the limiting factor. The size of the file being processed matters -- larger files will take more CPU to process, thus resulting in a fluctuating variable that's relatively difficult to account for due to the fact it's not a constant.

If, for example, we knew every single PDF was 5MB and we can process 250 files per second, we'd then know that we could quickly spin up 5x 48GB Droplets and offload 250 images to each one. As this would be a very special case, it's not all that feasible as I'm sure each PDF is not 100% identical.

What We Know

As per your test, you're processing 1,000 images in 60 seconds, or one image every 0.06 seconds.

To put conversion in to perspective, the above test is converting:

1,000 = 60 seconds
500   = 30 seconds
250   = 15 seconds
125   = 7.5 seconds
62.5  = 3.75 seconds

Images per Second

That means you should be able to do ~83 images in 5 seconds (4.98 seconds to be exact) on that the Droplet w/ 16 CPU Cores, which equates to offloading the images to 12 Droplets at the same time and of the same spec to reach your desired 5 second processing time.

This, however, doesn't include the time it'll take to pull them down from these servers, back to either your location or your main Droplet.

You'd need to factor the pull down time in as well.

Alternative to Convert

You can install poppler-utils and use pdftoppm to convert, though there's no option to combine the pages if each PDF has more than one page, so you'd need to pull multiple images down as soon as the conversion process is complete.

...

That being said, converting images at that speed is going to need multiple servers and actually getting that speed is going to depend on numerous factors (size is one of the main). The good news is that you can keep costs in line by using the DigitalOcean API to spawn the Droplets for processing.

Connecting to the API, you can deploy on-demand, run your tasks, and then destroy the Droplet so you're not incurring recurring costs by keeping the Droplets alive for the entire month.

As far as how you'd connect, it really depends. You can access the API using any programming language you'd like -- PHP, NodeJS, Ruby, etc.

...

How you'd really set something like this up to achieve the desired results really depends on how you need to do this, so knowing more would be ideal as right now suggestions are only that :-).

Have another answer? Share your knowledge.