    These can be single devices or multiple mirrored devices, and are fully dedicated to the type of cache designated. Cache usage and its detailed settings can be fully deleted, created and modified without limit during live use.

    A list of ZFS cache types is given later in this article. ZFS can handle devices formatted into partitions for certain purposes, but this is not common use.

    Generally caches and data pools are given complete devices or multiple complete devices. The top level of data management is a ZFS pool or zpool.

    A ZFS system can have multiple pools defined. The vdevs to be used for a pool are specified when the pool is created others can be added later , and ZFS will use all of the specified vdevs to maximize performance when storing data — a form of striping across the vdevs.

    Therefore, it is important to ensure that each vdev is sufficiently redundant , as loss of any vdev in a pool would cause loss of the pool, as with any other striping.

    A ZFS pool can be expanded at any time by adding new vdevs, including when the system is 'live'. However, as explained above, the individual vdevs can each be modified at any time within stated limits , and new vdevs added at any time, since the addition or removal of mirrors, or marking of a redundant disk as offline, do not affect the ability of that vdev to store data.

    Since volumes are presented as block devices, they can also be formatted with any other file system, to add ZFS features to that file system, although this is not usual practice.

    For example, a ZFS volume can be created, and then the block device it presents can be partitioned and formatted with a file system such as ext4 or NTFS.

    This can be done either locally or over a network using iSCSI or similar. Snapshots are an integral feature of ZFS. They provide immutable read only copies of the file system at a single point in time, and even very large file systems can be snapshotted many times every hour, or sustain tens of thousands of snapshots.

    Snapshot versions of individual files, or an entire dataset or pool, can easily be accessed, searched and restored. An entire snapshot can be cloned to create a new "copy", copied to a separate server as a replicated backup , or the pool or dataset can quickly be rolled back to any specific snapshot.

    Snapshots can also be compared to each other, or to the current data, to check for modified data. Snapshots do not take much disk space, but when data is deleted, the space will not be marked as free until any data is no longer referenced by the current system or any snapshot.

    As such, snapshots are also an easy way to avoid the impact of ransomware. Generally ZFS does not expect to reduce the size of a pool, and does not have tools to reduce the set of vdevs that a pool is stored on.

    Additional capacity can be added to a pool at any time, simply by adding more devices if needed, defining the unused devices into vdevs and adding the new vdevs to the pool.

    The capacity of an individual vdev is generally fixed when it is defined. There is one exception to this rule: A pool can be expanded into unused space, and the datasets and volumes within a pool can be likewise expanded to use any unused pool space.

    Datasets do not need a fixed size and can dynamically grow as data is stored, but volumes, being block devices, need to have their size defined by the user, and must be manually resized as required which can be done 'live'.

    Next, the block pointer is checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree.

    ZFS stores the checksum of each block in its parent block pointer so the entire pool self-validates. When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be.

    If the checksums match, the data are passed up the programming stack to the process that asked for it; if the values do not match, then ZFS can heal the data if the storage pool provides data redundancy such as with internal mirroring , assuming that the copy of data is undamaged and with matching checksums.

    If other copies of the damaged data exist or can be reconstructed from checksums and parity data, ZFS will use a copy of the data or recreate it via a RAID recovery mechanism , and recalculate the checksum—ideally resulting in the reproduction of the originally expected value.

    If the data passes this integrity check, the system can then update all faulty copies with known-good data and redundancy will be restored.

    For ZFS to be able to guarantee data integrity, it needs multiple copies of the data, usually spread across multiple disks. This is because ZFS relies on the disk for an honest view, to determine the moment data is confirmed as safely written, and it has numerous algorithms designed to optimize its use of caching , cache flushing , and disk handling.

    If a third-party device performs caching or presents drives to ZFS as a single system, or without the low level view ZFS relies upon, there is a much greater chance that the system will perform less optimally, and that a failure will not be preventable by ZFS or as quickly or fully recovered by ZFS.

    For example, if a hardware RAID card is used, ZFS may not be able to determine the condition of disks or whether the RAID array is degraded or rebuilding, it may not know of all data corruption, and it cannot place data optimally across the disks, make selective repairs only, control how repairs are balanced with ongoing use, and may not be able to make repairs even if it could usually do so, as the hardware RAID card will interfere.

    While it is possible to read the data with a compatible hardware RAID controller, this isn't always possible, and if the controller card develops a fault then a replacement may not be available, and other cards may not understand the manufacturer's custom data which is needed to manage and restore an array on a new card.

    Therefore, unlike most other systems, where RAID cards or similar are used to offload resources and processing and enhance performance and reliability, with ZFS it is strongly recommended these methods not be used as they typically reduce the system's performance and reliability.

    The schemes are highly flexible. This, when combined with the copy-on-write transactional semantics of ZFS, eliminates the write hole error.

    This would be impossible if the filesystem and the RAID array were separate products, whereas it becomes feasible when there is an integrated view of the logical and physical structure of the data.

    Going through the metadata means that ZFS can validate every block against its bit checksum as it goes, whereas traditional RAID products usually cannot do this.

    In addition to handling whole-disk failures, RAID-Z can also detect and correct silent data corruption , offering "self-healing data": Then, it repairs the damaged data and returns good data to the requestor.

    RAID-Z and mirroring do not require any special hardware: During those weeks, the rest of the disks in the RAID are stressed more because of the additional intensive repair process and might subsequently fail, too.

    ZFS has no tool equivalent to fsck the standard Unix and Linux data checking and repair tool for file systems. ZFS is a bit file system, [44] [45] so it can address 1.

    The maximum limits of ZFS are designed to be so large that they should never be encountered in practice. During writes, a block may be compressed, encrypted, checksummed and then deduplicated, in that order.

    The policy for encryption is set at the dataset level when datasets file systems or ZVOLs are created. The default behaviour is for the wrapping key to be inherited by any child data sets.

    The data encryption keys are randomly generated at dataset creation time. Only descendant datasets snapshots and clones share data encryption keys.

    ZFS will automatically allocate data storage across all vdevs in a pool and all devices in each vdev in a way that generally maximises the performance of the pool.

    ZFS will also update its write strategy to take account of new disks added to a pool, when they are added. As a general rule, ZFS allocate writes across vdevs based on the free space in each vdev.

    This ensures that vdevs which have proportionately less data already, are given more writes when new data is to be stored. This helps to ensure that as the pool becomes more used, the situation does not develop that some vdevs become full, forcing writes to occur on a limited number of devices.

    It also means that when data is read and reads are much more frequent than writes in most uses , different parts of the data can be read from as many disks as possible at the same time, giving much higher read performance.

    Therefore, as a general rule, pools and vdevs should be managed and new storage added, so that the situation does not arise that some vdevs in a pool are almost full and others almost empty, as this will make the pool less efficient.

    Pools can have hot spares to compensate for failing disks. When mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis.

    Storage pool composition is not limited to similar devices, but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to diverse filesystems [ clarification needed ] as needed.

    Arbitrary storage device types can be added to existing pools to expand their size. The storage capacity of all vdevs is available to all of the file system instances in the zpool.

    A quota can be set to limit the amount of space a file system instance can occupy, and a reservation can be set to guarantee that space will be available to a file system instance.

    ZFS uses different layers of disk cache to speed up read and write operations. Ideally, all data should be stored in RAM, but that is usually too expensive.

    Therefore, data is automatically cached in a hierarchy to optimize performance versus cost; [51] these are often called "hybrid storage pools".

    Data that is not often accessed is not cached and left on the slow hard drives. ZFS caching mechanisms include one each for reads and writes, and in each case, two levels of caching can exist, one in computer memory RAM and one on fast storage usually solid state drives SSDs , for a total of four caches.

    This becomes crucial if a large number of synchronous writes take place such as with ESXi , NFS and some databases , [53] where the client requires confirmation of successful writing before continuing its activity; the SLOG allows ZFS to confirm writing is successful much more quickly than if it had to write to the main store every time, without the risk involved in misleading the client as to the state of data storage.

    If there is no SLOG device then part of the main data pool will be used for the same purpose, although this is slower. If the log device itself is lost, it is possible to lose the latest writes, therefore the log device should be mirrored.

    In earlier versions of ZFS, loss of the log device could result in loss of the entire zpool, although this is no longer the case. Therefore, one should upgrade ZFS if planning to use a separate log device.

    A number of other caches, cache divisions, and queues also exist within ZFS. For example, each vdev has its own data cache, and the ARC cache is divided between data stored by the user and metadata used by ZFS, with control over the balance between these.

    ZFS uses a copy-on-write transactional object model. All block pointers within the filesystem contain a bit checksum or bit hash currently a choice between Fletcher-2 , Fletcher-4 , or SHA [54] of the target block, which is verified when the block is read.

    Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any metadata blocks referencing it are similarly read, reallocated, and written.

    To reduce the overhead of this process, multiple updates are grouped into transaction groups, and ZIL intent log write cache is used when synchronous write semantics are required.

    The blocks are arranged in a tree, as are their checksums see Merkle signature scheme. An advantage of copy-on-write is that, when ZFS writes new data, the blocks containing the old data can be retained, allowing a snapshot version of the file system to be maintained.

    ZFS snapshots are consistent they reflect the entire data as it existed at a single point in time , and can be created extremely quickly, since all the data composing the snapshot is already stored, with the entire storage pool often snapshotted several times per hour.

    They are also space efficient, since any unchanged data is shared among the file system and its snapshots. Snapshots are inherently read-only, ensuring they will not be modified after creation, although they should not be relied on as a sole means of backup.

    Entire snapshots can be restored and also files and directories within snapshots. Writeable snapshots "clones" can also be created, resulting in two independent file systems that share a set of blocks.

    As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist.

    This is an implementation of the Copy-on-write principle. ZFS file systems can be moved to other pools, also on remote hosts over the network, as the send command creates a stream representation of the file system's state.

    This stream can either describe complete contents of the file system at a given snapshot, or it can be a delta between snapshots. Computing the delta stream is very efficient, and its size depends on the number of blocks changed between the snapshots.

    This provides an efficient strategy, e. Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus, all disks in a pool are used, which balances the write load across them.

    Available features allow the administrator to tune the maximum block size which is used, as certain workloads do not perform well with large blocks.

    If data compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput though at the cost of increased CPU use for the compression and decompression operations.

    In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or expand a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.

    Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders.

    The ZFS block pointer format stores filesystem metadata in an endian -adaptive way; individual metadata blocks are written with the native byte order of the system writing the block.

    When reading, if the stored endianness does not match the endianness of the system, the metadata is byte-swapped in memory.

    This does not affect the stored data; as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

    Data deduplication capabilities were added to the ZFS source repository at the end of October , [56] and relevant OpenSolaris ZFS development packages have been available since December 3, build Other storage vendors use modified versions of ZFS to achieve very high data compression ratios.

    Two examples in were GreenBytes [60] and Tegile. As described above, deduplication is usually not recommended due to its heavy resource requirements especially RAM and impact on performance especially when writing , other than in specific circumstances where the system and data are well-suited to this space-saving technique.

    The authors of a study that examined the ability of file systems to detect and prevent data corruption, with particular focus on ZFS, observed that ZFS itself is effective in detecting and correcting data errors on storage devices, but that it assumes data in RAM is "safe", and not prone to error.

    For ZFS to protect data against disk failure, it needs to be configured with redundant storage - either RAID-Z or mirrored so all data is copied to at least two disks.

    If a single disk is used, redundant copies of the data should be enabled which duplicates the data on the same logical drive - this is far less safe since it is vulnerable to the failure of the single disk.

    Using ZFS copies is a good feature to use on notebooks and desktop computers, since the disks are large and it at least provides some limited redundancy with just a single drive.

    There are over a dozen 3rd-party distributions, of which nearly a dozen are mentioned here. OpenIndiana and illumos are two new distributions not included on the OpenSolaris distribution reference page.

    ZFS version 28 used up to version a3. O3X implements zpool version , and includes the Solaris Porting Layer SPL originally written for MacZFS, which has been further enhanced to include a memory management layer based on the illumos kmem and vmem allocators.

    This was derived from code included in FreeBSD 7. An update to storage pool 28 is in progress in 0. This project is a continuation of FreeNAS 7 series project.

    As of 31 January , the ZPool version available is 14 for the Squeeze release, and 28 for the Wheezy-9 release. According to the Free Software Foundation , the wording of the GPL license legally prohibits redistribution of the resulting product as a derivative work , [] [] though this viewpoint has caused some controversy.

    The filesystem ran entirely in userspace instead of being integrated into the Linux kernel, and was therefore not considered a derivative work of the kernel.

    This approach was functional, but suffered from significant performance penalties when compared with integrating the filesystem as a native kernel module running in kernel space.

    This pool version is an unchanging number that is expected to never conflict with version numbers given by Oracle.

    A release supporting zpool v28 was announced in January While the license incompatibility may arise with the distribution of compiled binaries containing ZFS code, it is generally agreed that distribution of the source code itself is not affected by this.

    In Gentoo , configuring a ZFS root filesystem is well documented and the required packages can be installed from its package repository.

    The question of the CDDL license's compatibility with the GPL license resurfaced in , when the Linux distribution Ubuntu announced that it intended to make precompiled OpenZFS binary kernel modules available to end-users directly from the distribution's official package repositories.

    A port of open source ZFS was attempted in but after a hiatus of over one year development ceased in List of Operating Systems, distributions and add-ons that support ZFS, the zpool version it supports, and the Solaris build they are based on if any:.

    It was announced on September 14, , [] but development started in The name at one point was said to stand for "Zettabyte File System", [] but by was no longer considered to be an abbreviation.

    Sun counter-sued in October the same year claiming the opposite. The lawsuits were ended in with an undisclosed settlement. The following is a list of events in the development of open-source ZFS implementations: The first indication of Apple Inc.

    On that site, Apple provided the source code and binaries of their port of ZFS which includes read-write access, but there was no installer available [] until a third-party developer created one.

    That is to say that their own hosting and involvement in ZFS was summarily discontinued. No explanation was given, just the following statement: The mailing list and repository will also be removed shortly.

    Apple's "10a" source code release, and versions of the previously released source and binaries, have been preserved and new development has been adopted by a group of enthusiasts.

    The project has an active mailing list. Features that are available in specific file system versions require a specific pool version. Distributed development of OpenZFS involves feature flags [83] and pool version , an unchanging number that is expected to never conflict with version numbers given by Oracle.

    Legacy version numbers still exist for pool versions 1—28, implied by the version The Solaris version under development by Sun since the release of Solaris 10 in was codenamed 'Nevada', and was derived from what was the OpenSolaris codebase.

