CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解
2022/7/8 6:20:18
本文主要是介绍CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
前言
该实验要求实现一个基于线性探测法的哈希表,但是与直接放在内存中的哈希表不同的是,该实验假设哈希表非常大,无法整个放入内存中,因此需要将哈希表进行分割,将多个键值对放在一个 Page 中,然后搭配上一个实验实现的 Buffer Pool Manager 一起食用。哈希表的大致结构如下图所示:
下面介绍如何实现一个线程安全的哈希表。
代码实现
Page 布局
从上图可以看出,多个键值对被放在 Page 里面,作为 Page 的数据存在磁盘中。为了更好地组织和管理这些键值对,实验任务一要求我们实现两个类:HashTableHeaderPage
和 HashTableBlockPage
,HashTableHeaderPage
保存着 block index
到 page id
的映射关系以及其他哈希表元数据,每个哈希表只有一个 HashTableHeaderPage
,而 HashTableBlockPage
可以有多个。
Hash Table Header Page
HashTableHeaderPage
有以下几个字段:
字段 | 大小 | 描述 |
---|---|---|
lsn_ |
4 bytes | Log sequence number (Used in Project 4) |
size_ |
4 bytes | Number of Key & Value pairs the hash table can hold |
page_id_ |
4 bytes | Self Page Id |
next_ind_ |
4 bytes | The next index to add a new entry to block_page_ids_ |
block_page_ids_ |
4080 bytes | Array of block page_id_t |
这些字段总共 4096 字节,正好是一个 Page 的大小,在 src/include/common/config.h
中可以修改 PAGE_SIZE
的大小。该类的实现代码如下:
复制namespace bustub { page_id_t HashTableHeaderPage::GetBlockPageId(size_t index) { assert(index < next_ind_); return block_page_ids_[index]; } page_id_t HashTableHeaderPage::GetPageId() const { return page_id_; } void HashTableHeaderPage::SetPageId(bustub::page_id_t page_id) { page_id_ = page_id; } lsn_t HashTableHeaderPage::GetLSN() const { return lsn_; } void HashTableHeaderPage::SetLSN(lsn_t lsn) { lsn_ = lsn; } void HashTableHeaderPage::AddBlockPageId(page_id_t page_id) { block_page_ids_[next_ind_++] = page_id; } size_t HashTableHeaderPage::NumBlocks() { return next_ind_; } void HashTableHeaderPage::SetSize(size_t size) { size_ = size; } size_t HashTableHeaderPage::GetSize() const { return size_; } } // namespace bustub
Hash Table Block Page
HashTableBlockPage
包含多个 slot,用于保存键值对,所以该类定义了查询、插入和删除键值对的函数。为了跟踪每个 slot 的使用情况,该类包含以下三个数据成员:
occupied_
: 第 i 位置 1 表示 Page 的第 i 个 slot 上存储了键值对或者之前存了键值对但之后被删除了(起到墓碑的作用)readable_
: 第 i 位置 1 表示 Page 的第 i 个 slot 上存储了键值对array_
: 用于保存键值对的数组
每个键值对的大小为 sizeof(std::pair<KeyType, ValueType>)
字节(记为 PS
),每个键值对对应两个 bit(occupied
和 readable
)即 1/4 个字节,所以一个 Page 最多能保存 BLOCK_ARRAY_SIZE = PAGE_SIZE / (PS + 1/4)
个键值对,即每个 Page 有 BLOCK_ARRAY_SIZE
个 slot。
由于 occupied_
和 readable_
被定义为 char
数组,所以需要两个辅助函数 GetBit
和 SetBit
来访问第 i 位的比特。
复制namespace bustub { /** * Store indexed key and and value together within block page. Supports * non-unique keys. * * Block page format (keys are stored in order): * ---------------------------------------------------------------- * | KEY(1) + VALUE(1) | KEY(2) + VALUE(2) | ... | KEY(n) + VALUE(n) * ---------------------------------------------------------------- * * Here '+' means concatenation. * */ template <typename KeyType, typename ValueType, typename KeyComparator> class HashTableBlockPage { public: // Delete all constructor / destructor to ensure memory safety HashTableBlockPage() = delete; KeyType KeyAt(slot_offset_t bucket_ind) const; ValueType ValueAt(slot_offset_t bucket_ind) const; bool Insert(slot_offset_t bucket_ind, const KeyType &key, const ValueType &value); void Remove(slot_offset_t bucket_ind); bool IsOccupied(slot_offset_t bucket_ind) const; bool IsReadable(slot_offset_t bucket_ind) const; private: bool GetBit(const std::atomic_char *array, slot_offset_t bucket_ind) const; void SetBit(std::atomic_char *array, slot_offset_t bucket_ind, bool value); std::atomic_char occupied_[(BLOCK_ARRAY_SIZE - 1) / 8 + 1]; // 0 if tombstone/brand new (never occupied), 1 otherwise. std::atomic_char readable_[(BLOCK_ARRAY_SIZE - 1) / 8 + 1]; MappingType array_[0]; }; } // namespace bustub
实现代码如下:
复制namespace bustub { template <typename KeyType, typename ValueType, typename KeyComparator> KeyType HASH_TABLE_BLOCK_TYPE::KeyAt(slot_offset_t bucket_ind) const { return array_[bucket_ind].first; } template <typename KeyType, typename ValueType, typename KeyComparator> ValueType HASH_TABLE_BLOCK_TYPE::ValueAt(slot_offset_t bucket_ind) const { return array_[bucket_ind].second; } template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_BLOCK_TYPE::Insert(slot_offset_t bucket_ind, const KeyType &key, const ValueType &value) { if (IsReadable(bucket_ind)) { return false; } array_[bucket_ind] = {key, value}; SetBit(readable_, bucket_ind, true); SetBit(occupied_, bucket_ind, true); return true; } template <typename KeyType, typename ValueType, typename KeyComparator> void HASH_TABLE_BLOCK_TYPE::Remove(slot_offset_t bucket_ind) { SetBit(readable_, bucket_ind, false); } template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_BLOCK_TYPE::IsOccupied(slot_offset_t bucket_ind) const { return GetBit(occupied_, bucket_ind); } template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_BLOCK_TYPE::IsReadable(slot_offset_t bucket_ind) const { return GetBit(readable_, bucket_ind); } template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_BLOCK_TYPE::GetBit(const std::atomic_char *array, slot_offset_t bucket_ind) const { return array[bucket_ind / 8] & (1 << bucket_ind % 8); } template <typename KeyType, typename ValueType, typename KeyComparator> void HASH_TABLE_BLOCK_TYPE::SetBit(std::atomic_char *array, slot_offset_t bucket_ind, bool value) { if (value) { array[bucket_ind / 8] |= (1 << bucket_ind % 8); } else { array[bucket_ind / 8] &= ~(1 << bucket_ind % 8); } } // DO NOT REMOVE ANYTHING BELOW THIS LINE template class HashTableBlockPage<int, int, IntComparator>; template class HashTableBlockPage<GenericKey<4>, RID, GenericComparator<4>>; template class HashTableBlockPage<GenericKey<8>, RID, GenericComparator<8>>; template class HashTableBlockPage<GenericKey<16>, RID, GenericComparator<16>>; template class HashTableBlockPage<GenericKey<32>, RID, GenericComparator<32>>; template class HashTableBlockPage<GenericKey<64>, RID, GenericComparator<64>>; } // namespace bustub
哈希表
声明
实验要求我们实现哈希表的插入、查找、删除和调整大小的的操作,对应的类声明如下,为了完成这些操作,我们多定义了几个私有的辅助函数和成员变量:
复制namespace bustub { #define HASH_TABLE_TYPE LinearProbeHashTable<KeyType, ValueType, KeyComparator> template <typename KeyType, typename ValueType, typename KeyComparator> class LinearProbeHashTable : public HashTable<KeyType, ValueType, KeyComparator> { public: explicit LinearProbeHashTable(const std::string &name, BufferPoolManager *buffer_pool_manager, const KeyComparator &comparator, size_t num_buckets, HashFunction<KeyType> hash_fn); bool Insert(Transaction *transaction, const KeyType &key, const ValueType &value) override; bool Remove(Transaction *transaction, const KeyType &key, const ValueType &value) override; bool GetValue(Transaction *transaction, const KeyType &key, std::vector<ValueType> *result) override; void Resize(size_t initial_size); size_t GetSize(); private: using slot_index_t = size_t; using block_index_t = size_t; enum class LockType { READ = 0, WRITE = 1 }; /** * initialize header page and allocate block pages for it * @param page the hash table header page */ void InitHeaderPage(HashTableHeaderPage *page); /** * get index according to key * @param key the key to be hashed * @return a tuple contains slot index, block page index and bucket index */ std::tuple<slot_index_t, block_index_t, slot_offset_t> GetIndex(const KeyType &key); /** * linear probe step forward * @param bucket_index the bucket index * @param block_index the hash table block page index * @param header_page hash table header page * @param raw_block_page raw hash table block page * @param block_page hash table block page * @param lock_type lock type of block page */ void StepForward(slot_offset_t &bucket_index, block_index_t &block_index, Page *&raw_block_page, HASH_TABLE_BLOCK_TYPE *&block_page, LockType lockType); bool InsertImpl(Transaction *transaction, const KeyType &key, const ValueType &value); inline bool IsMatch(HASH_TABLE_BLOCK_TYPE *block_page, slot_offset_t bucket_index, const KeyType &key, const ValueType &value) { return !comparator_(key, block_page->KeyAt(bucket_index)) && value == block_page->ValueAt(bucket_index); } inline HashTableHeaderPage *HeaderPageCast(Page *page) { return reinterpret_cast<HashTableHeaderPage *>(page->GetData()); } inline HASH_TABLE_BLOCK_TYPE *BlockPageCast(Page *page) { return reinterpret_cast<HASH_TABLE_BLOCK_TYPE *>(page->GetData()); } /** * get the slot number of hash table block page * @param block_index the index of hash table block page * @return the slot number of block page */ inline size_t GetBlockArraySize(block_index_t block_index){ return block_index < num_pages_ - 1 ? BLOCK_ARRAY_SIZE : last_block_array_size_; } // member variable page_id_t header_page_id_; BufferPoolManager *buffer_pool_manager_; KeyComparator comparator_; std::vector<page_id_t> page_ids_; size_t num_buckets_; size_t num_pages_; size_t last_block_array_size_; // Readers includes inserts and removes, writer is only resize ReaderWriterLatch table_latch_; // Hash function HashFunction<KeyType> hash_fn_; }; } // namespace bustub
构造函数
在构造函数中负责根据用户指定的 num_buckets
(也就是 slot 的数量)分配 Page,最后一个 Page 的 slot 数量可能少于前面的 Page。这里还将每个 HashTableBlockPage
对应的 page_id
保存到 page_ids_
成员里面了,这样之后就不需要仅仅为了知道某个 HashTableBlockPage
的 page_id
而去找 BufferPoolManager
索要 HashTableHeaderPage
。
复制template <typename KeyType, typename ValueType, typename KeyComparator> HASH_TABLE_TYPE::LinearProbeHashTable(const std::string &name, BufferPoolManager *buffer_pool_manager, const KeyComparator &comparator, size_t num_buckets, HashFunction<KeyType> hash_fn) : buffer_pool_manager_(buffer_pool_manager), comparator_(comparator), num_buckets_(num_buckets), num_pages_((num_buckets - 1) / BLOCK_ARRAY_SIZE + 1), last_block_array_size_(num_buckets - (num_pages_ - 1) * BLOCK_ARRAY_SIZE), hash_fn_(std::move(hash_fn)) { auto page = buffer_pool_manager->NewPage(&header_page_id_); page->WLatch(); InitHeaderPage(HeaderPageCast(page)); page->WUnlatch(); buffer_pool_manager_->UnpinPage(header_page_id_, true); } template <typename KeyType, typename ValueType, typename KeyComparator> void HASH_TABLE_TYPE::InitHeaderPage(HashTableHeaderPage *header_page) { header_page->SetPageId(header_page_id_); header_page->SetSize(num_buckets_); page_ids_.clear(); for (size_t i = 0; i < num_pages_; ++i) { page_id_t page_id; buffer_pool_manager_->NewPage(&page_id); buffer_pool_manager_->UnpinPage(page_id, false); header_page->AddBlockPageId(page_id); page_ids_.push_back(page_id); } }
查找
哈希表使用线性探测法进行键值对的查找,由于实验要求哈希表支持插入同键不同值的键值对,所以在线性探测过程中需要将所有相同键的值插入 result
向量中:
复制template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_TYPE::GetValue(Transaction *transaction, const KeyType &key, std::vector<ValueType> *result) { table_latch_.RLock(); // get slot index, block page index and bucket index according to key auto [slot_index, block_index, bucket_index] = GetIndex(key); // get block page that contains the key auto raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]); raw_block_page->RLatch(); auto block_page = BlockPageCast(raw_block_page); // linear probe while (block_page->IsOccupied(bucket_index)) { // find the correct position if (block_page->IsReadable(bucket_index) && !comparator_(key, block_page->KeyAt(bucket_index))) { result->push_back(block_page->ValueAt(bucket_index)); } StepForward(bucket_index, block_index, raw_block_page, block_page, LockType::READ); // break loop if we have returned to original position if (block_index * BLOCK_ARRAY_SIZE + bucket_index == slot_index) { break; } } // unlock raw_block_page->RUnlatch(); buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), false); table_latch_.RUnlock(); return result->size() > 0; }
GetIndex
函数根据 key
计算出对应的 slot_index
、block_index
和 bucket_index
(就是 slot offset),结合上图就能理解该函数的工作原理了:
复制template <typename KeyType, typename ValueType, typename KeyComparator> auto HASH_TABLE_TYPE::GetIndex(const KeyType &key) -> std::tuple<slot_index_t, block_index_t, slot_offset_t> { slot_index_t slot_index = hash_fn_.GetHash(key) % num_buckets_; block_index_t block_index = slot_index / BLOCK_ARRAY_SIZE; slot_offset_t bucket_index = slot_index % BLOCK_ARRAY_SIZE; return {slot_index, block_index, bucket_index}; }
在线性探测过程中,我们可能从从一个 HashTableBlockPage
跳到下一个,这时候需要更新 bucket_index
和 block_index
:
复制template <typename KeyType, typename ValueType, typename KeyComparator> void HASH_TABLE_TYPE::StepForward(slot_offset_t &bucket_index, block_index_t &block_index, Page *&raw_block_page, HASH_TABLE_BLOCK_TYPE *&block_page, LockType lockType) { if (++bucket_index != GetBlockArraySize(block_index)) { return; } // move to next block page if (lockType == LockType::READ) { raw_block_page->RUnlatch(); } else { raw_block_page->WUnlatch(); } buffer_pool_manager_->UnpinPage(page_ids_[block_index], false); // update index bucket_index = 0; block_index = (block_index + 1) % num_pages_; // update page raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]); if (lockType == LockType::READ) { raw_block_page->RLatch(); } else { raw_block_page->WLatch(); } block_page = BlockPageCast(raw_block_page); }
插入
实验要求哈希表不允许插入已经存在的键值对,同时插入过程中如果回到了最初的位置,说明没有可用的 slot 用于插入键值对,这时需要将哈希表的大小翻倍。由于 Resize
的函数也要用到插入操作,如果直接调用 Insert
会出现死锁,所以这里使用 InsertImpl
来实现插入:
复制template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_TYPE::Insert(Transaction *transaction, const KeyType &key, const ValueType &value) { table_latch_.RLock(); auto success = InsertImpl(transaction, key, value); table_latch_.RUnlock(); return success; } template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_TYPE::InsertImpl(Transaction *transaction, const KeyType &key, const ValueType &value) { // get slot index, block page index and bucket index according to key auto [slot_index, block_index, bucket_index] = GetIndex(key); // get block page that contains the key auto raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]); raw_block_page->WLatch(); auto block_page = BlockPageCast(raw_block_page); bool success = true; while (!block_page->Insert(bucket_index, key, value)) { // return false if (key, value) pair already exists if (block_page->IsReadable(bucket_index) && IsMatch(block_page, bucket_index, key, value)) { success = false; break; } StepForward(bucket_index, block_index, raw_block_page, block_page, LockType::WRITE); // resize hash table if we have returned to original position if (block_index * BLOCK_ARRAY_SIZE + bucket_index == slot_index) { raw_block_page->WUnlatch(); buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), false); Resize(num_pages_); std::tie(slot_index, block_index, bucket_index) = GetIndex(key); raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]); raw_block_page->WLatch(); block_page = BlockPageCast(raw_block_page); } } raw_block_page->WUnlatch(); buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), success); return success; }
调整大小
由于实验假设哈希表很大,所以我们不能将原本的键值对全部保存到内存中,然后调整 HashTableHeaderPage
的大小,复用 HashTableBlockPage
并创建新的 Page,再把键值对重新插入。而是应该直接创建新的 HashTableHeaderPage
和 HashTableBlockPage
,并删除旧的哈希表页:
复制template <typename KeyType, typename ValueType, typename KeyComparator> void HASH_TABLE_TYPE::Resize(size_t initial_size) { table_latch_.WLock(); num_buckets_ = 2 * initial_size; num_pages_ = (num_buckets_ - 1) / BLOCK_ARRAY_SIZE + 1; last_block_array_size_ = num_buckets_ - (num_pages_ - 1) * BLOCK_ARRAY_SIZE; // save the old header page id auto old_header_page_id = header_page_id_; std::vector<page_id_t> old_page_ids(page_ids_); // get the new header page auto raw_header_page = buffer_pool_manager_->NewPage(&header_page_id_); raw_header_page->WLatch(); InitHeaderPage(HeaderPageCast(raw_header_page)); // move (key, value) pairs to new space for (size_t block_index = 0; block_index < num_pages_; ++block_index) { auto old_page_id = old_page_ids[block_index]; auto raw_block_page = buffer_pool_manager_->FetchPage(old_page_id); raw_block_page->RLatch(); auto block_page = BlockPageCast(raw_block_page); // move (key, value) pair from each readable slot for (slot_offset_t bucket_index = 0; bucket_index < GetBlockArraySize(block_index); ++bucket_index) { if (block_page->IsReadable(bucket_index)) { InsertImpl(nullptr, block_page->KeyAt(bucket_index), block_page->ValueAt(bucket_index)); } } // delete old page raw_block_page->RUnlatch(); buffer_pool_manager_->UnpinPage(old_page_id, false); buffer_pool_manager_->DeletePage(old_page_id); } raw_header_page->WUnlatch(); buffer_pool_manager_->UnpinPage(header_page_id_, false); buffer_pool_manager_->DeletePage(old_header_page_id); table_latch_.WUnlock(); }
删除
删除操作和查找操作很像,不过是将找到的 slot 标上墓碑罢了:
复制template <typename KeyType, typename ValueType, typename KeyComparator> bool HASH_TABLE_TYPE::Remove(Transaction *transaction, const KeyType &key, const ValueType &value) { table_latch_.RLock(); // get slot index, block page index and bucket index according to key auto [slot_index, block_index, bucket_index] = GetIndex(key); // get block page that contains the key auto raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]); raw_block_page->WLatch(); auto block_page = BlockPageCast(raw_block_page); bool success = false; while (block_page->IsOccupied(bucket_index)) { // remove the (key, value) pair if find the matched readable one if (IsMatch(block_page, bucket_index, key, value)) { if (block_page->IsReadable(bucket_index)) { block_page->Remove(bucket_index); success = true; } else { success = false; } break; } // step forward StepForward(bucket_index, block_index, raw_block_page, block_page, LockType::WRITE); // break loop if we have returned to original position if (block_index * BLOCK_ARRAY_SIZE + bucket_index == slot_index) { break; } } raw_block_page->WUnlatch(); buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), success); table_latch_.RUnlock(); return success; }
获取大小
最后是获取哈希表的大小操作,直接返回 num_buckets_
就行了:
复制template <typename KeyType, typename ValueType, typename KeyComparator> size_t HASH_TABLE_TYPE::GetSize() { return num_buckets_; }
测试
对哈希表的测试结果如下,6 个测试全部通过了:
总结
该实验主要考察对线性探测哈希表、缓冲池管理器和读写锁的理解,难度相比上一个实验略有提升,但是理解了哈希表的结构图之后应该就不难完成该实验了,以上~~
转 https://www.cnblogs.com/zhiyiYo/p/16453495.html
这篇关于CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-22怎么实现ansible playbook 备份代码中命名包含时间戳功能?-icode9专业技术文章分享
- 2024-11-22ansible 的archive 参数是什么意思?-icode9专业技术文章分享
- 2024-11-22ansible 中怎么只用archive 排除某个目录?-icode9专业技术文章分享
- 2024-11-22exclude_path参数是什么作用?-icode9专业技术文章分享
- 2024-11-22微信开放平台第三方平台什么时候调用数据预拉取和数据周期性更新接口?-icode9专业技术文章分享
- 2024-11-22uniapp 实现聊天消息会话的列表功能怎么实现?-icode9专业技术文章分享
- 2024-11-22在Mac系统上将图片中的文字提取出来有哪些方法?-icode9专业技术文章分享
- 2024-11-22excel 表格中怎么固定一行显示不滚动?-icode9专业技术文章分享
- 2024-11-22怎么将 -rwxr-xr-x 修改为 drwxr-xr-x?-icode9专业技术文章分享
- 2024-11-22在Excel中怎么将小数向上取整到最接近的整数?-icode9专业技术文章分享