- You can get resumable uploads and don't have to worry about high-stakes uploading of a 5GB file which might fail after 4.9GB. Instead, you can upload in parts and know that the all of the parts that have successfully uploaded are there patiently waiting for the rest of the bytes to make it to S3.
- You can parallelize your upload operation. So, not only can you break your 5GB file into 1000 5MB chunks, you can run 20 uploader processes and get much better overall throughput to S3.
Below is a transcript from an interactive IPython session that exercises the new features. Below that is a line by line commentary of what's going on.
- Self-explanatory, I hope 8^)
- We create a connection to the S3 service and assign it to the variable
c
. - We lookup an existing bucket in S3 and assign that to the variable
b
. - We initiate a MultiPart Upload to bucket
b
. We pass in thekey_name
. Thiskey_name
will be the name of the object in S3 once all of the parts are uploaded. This creates a new instance of a MultiPartUpload object and assigns it to the variablemp
. - You might want to do a bit of exploration of the new object. In particular, it has an attribute called
id
which is the upload transaction ID assigned by S3. This transaction ID must accompany all subsequent requests related to this MultiPart Upload. - I open a local file. In this case, I had a 17MB PDF file. I split that into 5MB chunks using the split command ("
split -b5m test.pdf
"). This creates 3 5MB chunks and one smaller chunk with the leftovers. You can use larger chunk sizes if you want but 5MB is the minimum size (except for the last, of course). - I upload this chunk to S3 using the
upload_part_from_file
method of the MultiPartUpload object. - Close the filepointer
- Open the file for the second chunk.
- Upload it.
- Close it.
- Open the file for the third chunk.
- Upload it.
- Close it.
- Open the file for the fourth and final chunk (the small one).
- Upload it.
- Close it.
- I can now examine all of the parts that are currently uploaded to S3 related to this key_name. As you can see, I can use the MultiPartUpload object as an iterator and, when so doing, the generator object handles any pagination of results from S3 automatically. Each object in the list is an instance of the Part class and, as you can see, have attributes such as
part_number, size, etag
. - Now that the last part has been uploaded I can complete the MultiPart Upload transaction by calling the
complete_upload method
of the MultiPartUpload object. If, on the other hand, I wanted to cancel the operation I could callcancel_upload
and all of the parts that had been uploaded would be deleted in S3.