S3 or Simple Storage Service, is one of the first services introduced by AWS. S3 and Glacier, is the core object storage in AWS. It provides a secure, durable and highly scalable cloud storage and consist of a simple web-service interface where it can be used to store or retrieve any amount of data. Further, as a plus you only require paying for the storage you have used.
There are many AWS service Amazon Elastic MapReduce (EMR) that depends on S3 as its target storage. Amazon Kinesis and are a few. Its also used as the storage for Amazon Elastic Block store(EBS) and Amazon Relational Database snapshots. In addition, it’s also used for data staging or loading mechanism for Amazon Redshift and DynamoDB.
Out from the many uses of S3, it can be used as a backup and archive facility for on premises data storage. Further, it could be used to store content, media, software and data for distribution. It is also useful in Bigdata analytics, static website hosting and cloud native mobile and internet application hosting. Also used in disaster recovery.
S3 offers 3 storage classes
- General Purpose
- Infrequent access
By using lifecycle-policies, one can automatically migrate objects stored to the appropriate storage class. For example: If you are the administrator of a social media website, you could write a lifecycle policy that states to archive all profile pictures that’s being saved for than 12 months. (This may not be the best example but hope you understood what lifecycle policy is all about.)
S3, also provides functionality that provides a rich set of permission, access-control and encryption options which we’ll visit in this post.
On the other hand, Amazon Glaciers provides very cheap, data archiving and backup functionality, which is used to store ‘Cold Data’ or rarely accessible data. The retrieval time is usually 3 – 5 hours and can be used as a storage class for Amazon S3.
Traditional storage options
In traditional IT environments, there are 2 kinds of storage that are used.
- Block storage – Operate at a lower level (Raw storage device level), stored in bits/bytes, Fixed block size
- File Storage – Operates at a higher-level ex: OS level. Manages data as a hierarchy of folders and files.
S3, on the other hand provides ‘Object Storage’ feature on the cloud. It runs independent of any servers and accessed directly over the internet via an API. An Object consist of both data-stored and its meta-data. These Objects are stored in a Bucket, and it should have a unique user-specified key. It is said that these Buckets can hold an unlimited amount of files and a file size can range from 0 – 5TB in size. These data can be replicated across multiple Availability Zones(AZs) and automatically replicate on multiple devices in multiple facilities in a region. In case the request rate of S3 goes up, it can partition buckets to support a high request rate thus, its highly scalable.
It’s a container, like a web-folder for objects (files). Bucket names are global. So, if you have named your bucket ‘TheHungryFatCoder’, you or anyone else in the world will not be able to create a bucket with the same name ever again.
A Bucket name can contain up to 62 lowercase letters, numbers, hyphen or periods. By default you can have up to 100 buckets per account. If you need additional buckets, you can increase your account bucket limit to a maximum of 1,000 buckets by submitting a service limit increase. By default, data in a S3 bucket are stored in a region, and you need to specify to copy it another region if required.
An unlimited number of Objects can be stored in a S3 Bucket, where an object can be of size between 0 and 5TB. These Objects stored in Buckets consist of:
- Key – The name you give your object
- Version Id – a key and version ID to uniquely identify an object
- Value – the content that’s being stored
- Metadata – Data about the file.
- System data: Date modified, object size, MD5, HTTP content type
- User data: Can be specified at the time of creating the object, Its optional. Can tag data with meaningful attributes.
- Sub resource – additional information about objects
- Access control information – control access to the objects that’s stored.
Note: S3 supports bit-torrents protocol, you can use the BitTorrent protocol to retrieve any publicly-accessible object in Amazon S3. You can only get a torrent file for objects that are less than 5 GBs in size.
The data is stored as a stream of bytes, and Amazon doesn’t know the type of data which is stored.
Every Object saved in a Bucket has a unique name called a Key. It can be considered as a filename. It needs to be a unique name within the bucket. It can have up to 1024 bytes Unicode UTF, and can consist of dashes, slashes, backlashes to dots.
In general practice a key will have the naming convention : bucket-key-versionId.
Bucket can be tagged, but the individual Objects within the Bucket doesn’t inherit the tagging. Individual Objects require to be tagged separately.
All objects stored in a Bucket is unique. It consists of the following URL format.
One can also name its object as
/my/folder/structure/object.pdf. We discussed that a Key can contain dashes, slashes, backlashes to dots in it.
There’s a common misconception among individuals considering S3 to be a file system, and you could create a Bucket within a Bucket. However, S3 is not a file-system, yet you could navigate to the Bucket as in a folder hierarchy using the Amazon Console. You could learn more about this by visiting the AWS Link here.
Amazon S3 Operations
S3, can perform the following operations.
- Create or delete bucket
- Write an object
- Read an object
- Delete an object
- List keys in bucket
Amazon S3 is a REST API. One can use HTTP or HTTPS to request the above S3 operations.
- Create – PUT or at times POST
- Read – GET
- Delete – DELETE
- Update – POST or at times PUT
What is Durability? What is Availability?
Durability addresses the question “Will my data be there in the future”?
Availability address the question “Can I access the data right now?
S3, is designed so that it has 99.999999999% (9 9s) durability and 99.99% availability. This means if you store 10K objects, you’ll expect a loss of a single object every 10 million years. Durability is maintained by Amazon through automatically storing data redundantly on multiple devices in locations within a region.
In case, there’s a scenario where you want to store non-critical data or data that’s easily reproducible, for ex: thumbnail image. You can use Reduced-Redundancy-Storage (RRS) which offers 99.99% Durability at a lower cost for storage.
Despite all the high level of Durability provided at infrastructure level, it is your responsibility to protect the data from accidental deletion, overwriting etc. Features such as Versioning, Cross regional replication and MFA Delete can you used to prevent such unforeseen events.
How is data consistency maintained?
Data stored in S3, automatically gets replicated in multiple servers, in multiple locations within a region. However, it takes some time to propagate changes to all locations when a modification happens. If you have already used S3, you might have come across instances where you try to read an object after an immediate update, and it’ll return the old object. S3, provides ‘Eventual Consistency’ for PUTs to existing objects and for DELETEs.
For PUTs to a new object this is not a concern as S3 provides ‘Read after write consistency’.
What is Eventual Consistency and Read after write consistency?
Read-after-write consistency guarantees immediate visibility of new data to all clients. A newly created object / file / table row will immediately be visible without any delays.
Eventual consistency means that after an update, you may or not see the changes if you immediately read the data.
S3 is secure by default where you only get access to it when a Bucket or Object is created. It provides the following access mechanisms:
- Coarse-grained access control – Amazon S3 ACL
Allows you to grant READ, WRITE or FULL-CONTROL at Objects or Bucket level. ACL was created before, IAM and it is a Legacy access control mechanism.
ACLs are used today to enable bucket logging, or to provide World-Read permission for website hosting.
- Fine grained access control – S3 bucket policies, IAM policies, query string authentication.
S3 bucket policies are the most recommended access control mechanism. It also provides much fine grain control. It is very similar to IAM policies but differs:
- Associated with bucket resource
- Include an explicit reference to IAM principal in the policy. This means that it can be associated with a different account making S3 bucket policies allow you to assign cross account access to Amazon S3 resources.
- With Amazon S3 bucket policies, you can specify who can access the bucket, from where, and at what time of the day.
- One can also use IAM policies to grant permission to Amazon S3, as it grants access to other AWS service and resources.
Stay tuned for for the second post : 7 Steps to host your static website on S3.
If you happen to be on Instagram please follow @thehungryfatcoder.
If you happen to like my blog, please subscribe and let your friends know about it.