If you pay any attention to data storage, you’ve heard of files, blocks, and objects many times, but do you really know what they mean and where you would use each of them?
What’s a file?
Short answer: duh; of course you know what a file is, as does your grandmother. Everybody uses files, in the form of songs, pictures, documents, presentations, and movies. We take it for granted that a file that you have can be shared with others and used on multiple machines. And for the most part, they can. Files can be created and managed by an application or directly by humans.
In a way, that makes file systems a basic form of a collaboration app, with a rudimentary UI (the file folder organization and browsing tools) and APIs that can be used by applications to interact with stored files.
File storage started out on user machines and dedicated file servers, with the disk resources on board, directly cabled or connected by block protocols, as with any other application server. NetApp led the NAS revolution, offering high-performance, high-scale, expandable storage systems with built-in file services. Today, clustered file storage allows for multiple nodes to share a single namespace, allowing clusters to achieve petabyte scale.
With advances in computing performance, NAS protocols can drive comparable performance to SAN, leading application vendors like Oracle, VMWare, and others to support application networking over file protocols, blurring the line to some extent between SAN and NAS.
What’s a block?
Blocks are fixed-length chunks of data that are read into memory when requested by an application. Block storage is used in SAN environments to deliver high-performance application storage and is characterized by a mapping of data blocks to the underlying disk drives.
Back in the day when storage performance was genuinely hard to deliver, block storage was the only way to give applications the unvarying fast access to data they needed to avoid crashing. It required a dedicated high-performance network and significant expertise and gymnastics on the part of storage administrators to get it right. You had to map each server explicitly to the adaptors, switch ports, and disk sectors on the storage system, and then manually modify settings as the demands on the capacity and network components changed over time.
Block storage today is more virtualized, allowing system or external software to optimize data layout and placement automatically. High performance is increasingly achieved with industry-standard hardware, reducing the dependence on special-purpose, high-cost chipsets and components. Wide striping is now a common way to virtualize data layout, spreading chunks of data around all the resources in the system rather than laying it all out across a smaller number of drives. Additionally, the dependence on dedicated storage networking protocols like FICON, ESCON, and Fibre Channel are slowly giving way to Ethernet protocols like iSCSI and FCoE, reducing the cost and skill barrier of managing block storage environments.
In the end, though, block storage is all about application data — without an application properly mapped to the storage system, there’s no metadata that can give access or context of data the way that a file system does.
What’s an object?
Objects are the newest of the data storage constructs, typically used to store files and other discrete pieces of data, but with a different rule set than that employed by file systems. Object storage was born from the need to increase the scalability and programmatic accessibility of storing files or other content, at lower cost than is available with file or block storage.
The big advance of object storage is a unique identifier for each piece of data stored. This provides several advantages:
- You can easily compare data to see if it’s an identical copy, without needing to open it — if the unique identifiers are the same, you know the data is the same.
- Data is more easily accessed; an application just needs to ask for the uniquely-identified piece of data but doesn’t need to know exactly where it is.
- The data repository can be bigger, because it’s composed of independent nodes that hold data, without cache consistency or central knowledge of where everything is.
- Geographic dispersion of data is viable, because the lack of cache consistency eliminates the need for low-latency communication among nodes.
- Management is easier, because clusters can use metadata policies to ensure that there are enough copies spread around to meet redundancy and management policies automatically.
Object storage has been used widely for stale, archival data, which fits nicely with the fact that changes are accommodated by creation of new versions of data, rather than modifying existing data. This model is great for data that doesn’t change frequently. Many object systems are designed for low performance and low cost, although it doesn’t have to be that way. Scality is an example of an object storage system that can deliver high application performance with an object storage back-end.
Object data is accessed through APIs, most notably Amazon S3 and OpenStack Swift. Proprietary APIs specific to storage systems have made potential customers anxious of lock-in and have hindered adoption. An API is a rule set that allows applications to access storage with simple commands like “put” (write), “get” (read), and “delete,” along with others for management and monitoring.
Most web-scale applications are built to interact with objects instead of files or blocks. With older apps that were built for file or block access, administrators need to convert their apps to address objects to be able to take advantage of cloud repositories and their associated economics and scalability, use a cloud gateway, or devise an embedded translator to avoid a costly conversion.
Many say that object storage will take over the file landscape, but the movement of enterprises in that direction has been limited at best. Today, greenfield apps are commonly built to interact with object storage, but legacy conversion sees limited momentum, partly due to the cost but also because of the limited options for conversion methodologies.