The object storage service Amazon Simple Storage Service (Amazon S3) is a foundational storage building block powering a variety of workloads from asset backup and serving, to analytics and machine learning. In this blog post, we describe how to search and find a scenario-specific optimized S3 download configuration in minutes using the open source distributed parameter search library Syne Tune, open sourced by AWS in November 2021.

Amazon S3 is used in machine learning (ML) to store training datasets and trained model artifacts in a scalable and cost-effective fashion. In ML workloads, Amazon Elastic Compute Cloud (Amazon EC2) virtual machines complement Amazon S3 storage by providing the compute necessary to train models and to serve model inferences. It is important to have fast downloads to Amazon EC2 virtual machines from Amazon S3; during the training phase, this enables cost efficiency and speed, and during the inference phase, this enables serving-fleet agility for auto-scaling and model replacements. The Boto3 download_file method from the AWS SDK for Python (Boto3) can be used to download files from S3 to EC2. download_file has numerous parameters that can influence download speed. Understanding and tuning those parameters requires both expert knowledge and significant experimentation budget.

Syne Tune implements several popular parameter search techniques, such as random search, population-based training, Hyperband (Li et al.), Mobster (Klein et al.), and Bayesian optimization, in a use-case agnostic fashion. It can run tuning experiments locally or on distributed, remote workers. In our specific case, we use the Syne Tune Bayesian optimization scheduler to navigate a 4-dimensional search space and reduce the download time between S3 and EC2.

The challenge: exploring download_file parameters for reduced download time

The AWS SDK for Python (Boto3) proposes the download_file function to download files from Amazon S3 storage to a local client, which is an Amazon EC2 instance in our example.

A screenshot of the boto3 manual entry for download_file, showing parameters and an example usage case

The function can be configured with a TransferConfig, itself parametrized with the eight parameters below (definitions from Boto3 documentation):

  1. multipart_threshold The transfer size threshold for which multipart uploads, downloads, and copies will automatically be triggered.
  2. max_concurrency The maximum number of threads that will be making requests to perform a transfer. If use_threads is set to False, the value provided is ignored and the transfer will only use the main thread.
  3. multipart_chunksize The partition size of each part for a multipart transfer.
  4. num_download_attempts The number of download attempts that will be retried upon errors with downloading an object in Amazon S3. Note that these retries account for errors that occur when streaming down the data from S3 (i.e. socket errors and read timeouts that occur after receiving an OK response from S3). Other retryable exceptions such as throttling errors and 5xx errors are already retried by botocore (this default is 5). This does not take into account the number of exceptions retried by botocore.
  5. max_io_queue The maximum amount of read parts that can be queued in memory to be written for a download. The size of each of these read parts is at most the size of io_chunksize.
  6. io_chunksize The maximum size of each chunk in the io queue. Currently, this is size used when read is called on the downloaded stream as well.
  7. use_threads If True, threads will be used when performing Amazon S3 transfers. If False, no threads will be used in performing transfers; all logic will be run in the main thread.
  8. max_bandwidth The maximum bandwidth that will be consumed in uploading and downloading file content. The value is an integer in terms of bytes per second.

In this example, we leave use_threads=True and don’t specify a max_bandwidth, and try to find a set of values for the four parameters max_concurrency, multipart_chunksize, max_io_queue and io_chunksize that reduces the download time versus the default setting. Finding appropriate values for those parameters can take time; for example, if we want to test 10 values for each of the four remaining integer parameters, that represents a total of 10,000 downloads to test.

The solution: Syne Tune and Bayesian optimization

Instead of exhaustively testing all possible combinations in the search space, we use a machine learning model to explore the parameter space in an efficient manner. We use the Bayesian optimization implemented in Syne Tune, which models the probability of success over the whole parameter space and samples the next experiments appropriately. The experiment described in this blog post is open-sourced on the AWS Samples GitHub organization: https://github.com/aws-samples/syne-tune-s3-transfer

Implementation

To run optimization with Syne Tune two scripts are needed:

1. single-experiment code: We write our Boto3 download attempt in a script, which reports the download time to Syne Tune using its Reporter class. In our case we call this experiment script experiment.py. Note that in the experiment script we average the download time of N downloads (by default N=3 in our sample code), to reduce the impact of outliers caused by network jitter.

2. Tuning orchestration code: We author a launcher.py script that contains the tuning orchestration code. We expose a number of script arguments to parametrize the search. The orchestration and experiment code are connected together via the syne_tune.tuner.Tuner class, which launches local experiments via a custom entry point script, that is our experiment.py file in our case:

... from syne_tune.backend.local_backend import LocalBackend
from syne_tune.tuner import Tuner ... backend = LocalBackend(entry_point="experiment.py") tuner = Tuner( backend=backend, scheduler=..., stop_criterion=..., n_workers=...,
) ...

The experiment and orchestration code are connected together via the backend parameter in the Tuner class

Results

We created a synthetic 3GiB file using fallocate -l 3.0GiB $file_name.

We conduct our download time optimization between Amazon S3 and a Amazon SageMaker-managed EC2 m5d.12xlarge instance, in the AWS Paris region. First, we download with the default configuration, which takes 8.3 seconds (averaged over three downloads).

from pathlib import Path
import tempfile
import time
import boto3 s3 = boto3.resource('s3') durations = []
for _ in range(3): # write with tempfile to avoid overwrites with tempfile.TemporaryDirectory(dir=file_path) as local_path: t1 = time.time() s3.Object(bucket_name=bucket, key=file_name).download_file( Filename=str(Path(local_path) / file_name) ) duration = time.time() - t1 print(duration) durations.append(duration) print(f"avg: {sum(durations)/len(durations)}")

We then launch the Syne Tune tuner script.

A screenshot showing the launcher python script, with options, being run

After only 125 seconds (~2min) of experimentation, Syne Tune finds a configuration to run the download in 4.9 seconds only—a 41% improvement found in less time than it took you to read this blog post. For reference, below is the associated transfer configuration for our 3GiB file transfer to the m5d.12xlarge instance in the AWS Paris region.

max_concurrency=339
max_io_queue=122
io_chunksize=15096201
multipart_chunksize=400619332

Syne Tune can plot to visualize the best result over time.

A visualization from Syne Tune showing the results over time in a graph, along with the Python code to generate the graph

The Syne Tune SDK provides all the experimental data in a Pandas dataframe.

The experimental data displayed in a Pandas dataframe, along with the Python code to generate it

Note that if Syne Tune runs longer, it may find better configurations.

Starting from a baseline with points_to_evaluate

Instead of starting the search with a random sample within the parameter space, we can assume that Boto3 defaults are reasonably strong and suggest the Bayesian optimization tuning start with those. Then, the optimization begins from our baseline, and the tuner will do its best to top it directly from the second experiment on. To specify which points to evaluate first, use the Syne Tune scheduler parameter points_to_evaluate, as illustrated in the snippet below.

# Syne Tune lets us specify a search starting point. We use Boto3 default
baseline = [{ "max_concurrency": 10, "max_io_queue": 100, "io_chunksize": 262144, "multipart_chunksize": 8388608
}] # We will run the various experiments on the local machine (same as the tuner)
backend = LocalBackend(entry_point='experiment.py') scheduler = FIFOScheduler(
config_space,
searcher="bayesopt",
mode="min",
metric="download_time",
points_to_evaluate=baseline
)

Note that initializing with the Boto3 defaults does not lead to faster search in this specific example. We found it to accelerate search on other long-running experiments done with bigger files. We encourage to experiment various Syne Tune possibilities based on your optimization use-case!

Conclusion

Amazon S3 and Amazon EC2 form the building blocks of several complex data processing systems. The efficient and timely movement of data within these complex systems will reduce costs and time-to-results. Parameter search libraries, such as Syne Tune, provide a pragmatic and efficient way to test and tune the movement of data in these workloads. The code sample shows how the AWS-released Syne Tune open source library can search and learn a configuration for S3 Transfer that is optimized for your use case. You can learn more about Syne Tune in its launch blog.