Tiered Storage (Experimental)

How to use

RocksDB Tiered storage feature can now assign the data to various types of storage media based on the temperature of the data (how hot the data is) within the same db column family. For example, the user can set the temperate of the last level to cold:

AdvancedColumnFamilyOptions.last_level_temperature = Temperature::kCold

Then the temperature information will be passed to the FileSystem APIs like NewRandomAccessFile(), NewWritableFile(), etc. It's up to the user to place the file in its corresponding storage with the implementation of its own FileSystem. Also use the temperature information to find the file in corresponding storage. In general, the high levels data are written most recently and more likely to be hot. Also high level data is much more likely to go though compaction, having them in a faster storage media can improve the compaction process. Currently, only the last level temperature can be specified. Which has its limitation, for example for a skewed data set, the hot data set may be compacted frequently and compacted to the last level. To prevent that, a per-key based hot/cold data splitting compaction is introduced.

Hot Data Time Range

If the data is skewed or major compaction (more likely for universal compaction), the recent inserted data may be compacted to the last level, which is stored in cold storage tier. To prevent that, the user can specify the hot data time range by:

AdvancedColumnFamilyOptions.preclude_last_level_data_seconds = 259200 // 3 days

Then the data written in the last 3 days, won't be compacted to the last level.

Internally, RocksDB compaction can split the hot and cold data in its last level compaction: A per-key based placement is implemented to place the data older than now - preclude_last_level_data_seconds to the last level (cold tier) and other data to penultimate level (hot tier). RocksDB uses the data sequence number to estimate its' insertion time. Once the feature is enabled, RocksDB samples the sequence number to time information and stores that with the SSTable. Based on that, compaction is able to estimate the data insertion time.

Metrics

A last level vs. non last level read/write bytes are added to statistics: https://github.com/facebook/rocksdb/blob/72a3fb3424c6605517d7ed09bb2004589aa287c0/include/rocksdb/statistics.h#L428-L432 IO context includes per temperature IO stats: https://github.com/facebook/rocksdb/blob/cc2099803a1de4dab8aa748cb26b2650e740d197/include/rocksdb/iostats_context.h#L78

Update File Temperature

The file maybe moved between different tiered storage, the information can be synced back to RocksDB (otherwise it has to be handled by the customized user FileSystem):

experimental::UpdateManifestForFilesState()

Or command:

$ ldb update_manifest --update_temperatures

Internally, the file temperature information is tracked by the Manifest file: https://github.com/facebook/rocksdb/wiki/MANIFEST . During DB open/close or backup/restore, the temperature information is persistent there. If the file temperature is changed, for example, the user manually copied the file from cold storage to hot, RocksDB still think the file is in cold storage. The above command can have the db re-sync the temperature information.

Limitations and Future Improvements

1. Key-range base hot/cold data

Currently, only time based hot/cold data separation is supported, which assumes the new data is hot. Which may not be the case, in some case, the specified key range is hotter than other. It maybe supported in the future (currently, as a workaround, the user could separate the hot and cold key-ranges into different column families if it's possible.)

2. Tiered Storage only support universal compaction

Universal compaction is more likely to compact recent inserted data to the cold tier (the last level), so the tiered compaction feature is first adapted universal compaction. For level compaction, it may cause infinite auto compaction if majority of data is hot, which cause large penultimate level which has compaction score > 1. But compaction is unable to place the data to the last level as majority of the data is hot. A improved compaction score calculation needs to be introduced for level compaction.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly