Fast S3 listing for buckets with millions/billions of Objects

jboothomas
2 min readJul 19, 2023

Intro

With a steadily increasing number of businesses leveraging object based services, the amount of data stored in S3 buckets has been exploding. These repositories often contain millions or even billions of items, making it an uphill task to quickly list and manage S3 objects, without some prefix rules implemented. This blog is about PS3 a tool written in Golang that can help list S3 objects in a short period of time when you have no knowledge of the prefixes.

In order to list objects in a performant manner we need to parallelise our operations. Now the only way to distribute listing S3 objects is to provide a prefix. Using a list of prefixes we could parallelise the operations and hence retrieve all the objects in a short amount of time, but as stated we do not know/have the prefixes.

Solving the unknown prefix problem

In order to discover the prefixes I apply a simple ‘brute force’ methodology. For each character in a list I check to see if any object keys exist with that starting character, for example a*. If the return has more than one page then this is considered as a large item prefix and is resubmitted to be checked by incrementing the prefix with each character of the list, for example aa*, ab*, ac*,…. and so on. Objects returning just one page are processed, and keys returning multiple pages are checked to see if the prefix exists as an object, for example object key ‘abc’ and object prefixes ‘abc1*’ and abc2*’.

Results compared to AWS s3api and s5cmd

For a local storage array S3 bucket containing 15Million objects:

  • Go Tool: 160 seconds
  • AWS S3api: 1110 seconds
  • S5CMD: 733 seconds

Conclusion

PS3 the fast object listing tool using brute force prefix discovery can be obtained from my github repository also has the latest complied binary for linux amd64.

--

--

jboothomas

Infrastructure engineering for modern data applications