incubator-doris [Proposal] Bitmap Index File Format for V2 segment

ix0qys7i  于 2022-04-22  发布在  Java
关注(0)|答案(10)|浏览(369)

Background and Goal

Bitmap index is very popular among OLAP systems. A bitmap index contains two parts.

  • First, there is anordered dictionarywhich contains all distinct values of a column and maps each value to an id. The first value is mapped to 0, second mapped to 1, and the last mapped to N-1 where N is the cardinality of the column. The ordered dictionary enables fastrange searchon its key.
  • Second, there is aposting listwhich stores one bitmap for each value in the dictionary. The bitmap is used to represent the list of rowid where a particular value exists. E.g, if the column contains 10 rows and value 'x' exists in rowid [1, 2, 3, 7], then the bitmap for value 'x' will be [1 1 1 0 0 0 1 0 0 0] where the n-th bit is 1 if the n-th row contains value 'x' and 0 otherwise. The posting list supportsfast retrievalof the n-th bitmap.

To support bitmap index in Doris, we first design an efficient file format for it. The goals for such file format are

1.page based format: we can reuse page encoding, compression, and cache utility already built for V2 segment
1.separate from data files: new index can be added without modifying data files and decoding rows doesn't require reading index (otherwise performance of point query will downgrade)
1.space efficient and fast range search: dictionary and bitmaps should be compressed without hurting too much read performance
1.memory efficient: search for a particular key only needs to load a few pages and enable computing directly on compressed page
1.extendable: the format should be easy to extend in the future

Now we present the detailed design to fulfill above requirements.

Overall index file structure

Each segment file in Doris can have an accompanied bitmap index file. Since a segment file contains at most INT32_MAX rows, rowid ranges from 0 to INT32_MAX - 1. To reduce the number of files and open FDs, an index file contains bitmap indexes of all indexed columns. The overall file structure are like

<beginning of the file>
[Bitmap Index for Column 1]
[Bitmap Index for Column 2]
...
[Bitmap Index for Column n]
[Footer] // FooterPB, FooterPBSize(4), FooterCRC(4), MagicNumber(4)
<end of the file>

To add a new bitmap index to a table, there are several approaches to consider.

The first approach (shown below) is to copy existing indexes to a new file and then appends new index and footer to it. It's easy to implement but incurs additional I/O and invalidates cache for existing index page.

<beginning of the file>
[Bitmap Index for Column 1]  // copied
[Bitmap Index for Column 2]  // copied
...
[Bitmap Index for Column n]  // copied
[Bitmap Index for Column n+1]  // added
[Footer]  // added
<end of the file>

The second approach (shown below) is to hard link old index file and appends new index and then footer to it. It has less I/O but is more complicated to implement and still has the cache invalidation problem.

<beginning of the file>
[Bitmap Index for Column 1]
[Bitmap Index for Column 2]
...
[Bitmap Index for Column n]
[Footer] // old footer (ignored)
[Bitmap Index for Column n+1]  // added
[Footer]  // new footer added
<end of the file>

More approaches can be expolored and discussed. For the purpose of this proposal, let's move on to the layout of bitmap index (see below).

// dictionary section
[DictDataPage 1]
...
[DictDataPage N]
[DictIndexPage] // used for locating DictDataPage for a particular key
// bitmap section
[BitmapDataPage 1]
...
[BitmapDataPage M]
[BitmapIndexPage] // used for locating BitmapDataPage for a particular bitmap (indexed by dict id)

Each bitmap index contains two section. The "dictionary section" stores ordered dictionary and is composed of several DictDataPage and one DictIndexPage. The encoding of DictDataPage and DictIndexPagedepends onthe type of the indexed column. The "bitmap section" stores posting list and is composed of several BitmapDataPage and one BitmapIndexPage. The encoding of BitmapDataPage and BitmapIndexPage is the same for all column types.

Search operation usually requires two I/O, one for index page and one for data page. There is an optimization, though. When we have only one data page, index page is not needed. E.g, a bitmap index with one DictDataPage and one BitmapDataPage contains only two pages, we don't need to store DictIndexPage and BitmapIndexPage.

Encoding for bitmap section

To achieve good space and time efficiency, RoaringBitmap is used as the main bitmap structure.

BinaryPlainPage is used to encode BitmapDataPage. Each entry is a serialized RoaringBitmap. Since bitmaps are already highly compressed, we don't need to further compress BitmapDataPage.

Bitmaps are stored sequentially according to its dict ID. To efficient locate the data page that contains the bitmap of a particular ID, BitmapIndexPage stores the first ID and file pointer of each BitmapDataPage. OrdinalIndexPage can be used to encode BitmapIndexPage.

Encoding for string dictionary

String dictionary uses prefix encoding and format similar to LevelDB's DataBlock and IndexBlock. The differences are that we only store the key part because the value (dict ID) in an ordered dictionary can be calculated as the ordinal of the key.

A new encoding called DeltaBinaryPage is added to encode string's DictDataPage. Besides we can also leverage DeltaBinaryPage to encode short key index due to its good compression and search performance on sorted keys.

DeltaBinaryPage := Entry^EntryNum, Trailer
Entry := SharedPrefixLength(vint), UnsharedLength(vint), Byte^UnsharedLength
Trailer := RestartPointStartOffset(4)^NumRestartPoints, NumRestartPoints(4)

Note that
- In delta encoding, for each element in a sequence of strings, we store the shared prefix length of the previous entry plus the unshared suffix.
- Every 16 elements, we create a restart point. The element at restart point will be stored fully (SharedPrefixLength == 0)
- Since elements in DictDataPage are sorted, to search for an element, we first binary search the restart points, then sequentially decode and search elements after it
- EntryNum is not stored because the start offset of Trailer is known when `NumRestartPoints` is read and it's exactly where the last element ends

We use BinaryPlainPage to encode DictIndexPage, it contains one entry for each DictDataPage. For string type, the entry format are

DictIndexPage(string)'s entry := SplitKeyLength(vint), SplitKey, PageStartId(4), PageOffset(vlong), PageSize(vint)

It contains the split key, start id, and location for each data page. Start ID is used to calculate the ordinal of a particular key. Split key is a string >= last key in the corresponding data block and < the first key in the successive data block. For simplicity, last key of the data page can be used. But shorter key can be used using optimization similar to LevelDB.

Encoding for integer dictionary

Integer dictionary contains sorted list of integers, which can be efficiently encoded using frame of reference.

A new encoding called OrderedIntPage is added to encode integer's DictDataPage.

OrderedInt32Page := BitPackingFrame^(FrameCount-1), VarIntFrame, FrameOffset(4)^FrameCount, FrameCount(4), NumValues(4)
BitPackingFrame := FirstValue(4), Delta(BitWidth)^128
VarIntFrame := FirstValue(4), Delta(vint)^(NumValues - FrameCount * 128), Padding

Note that
- Each frame except the last contains 128 values and uses bit packing to store deltas. The last frame may contains less than 128 values and it will switch to store delta as vint if that happens.
- Frame size must be a multiple of 4. BitPackingFrame guarantees this property. VarIntFrame will insert padding to ensure it.
- Each frame can be decoded independently. The first (also minimum) value of each frame is stored. When search for a particular value, we first use FrameOffset to binary search the first value of each frame, then decode the frame to search sequentially.
- BitWidth for BitPackingFrame is not stored because it can be calculated from FrameOffset: 
BitWidth[i] = (FrameOffset[i+1] - FrameOffset[i] - 4) * 8 / 128
- For 64-bit integers, the only differences are FirstValue is 8 bytes, Delta in VarIntFrame uses vlong instead of vint, and BitWidth for BitPackingFrame ranges from 0 to 64

We use BinaryPlainPage to encode integer's DictIndexPage. The entry format are

DictIndexPage(int32)'s entry := LastValue(4), PageStartId(4), PageOffset(vlong), PageSize(vint)
DictIndexPage(int64)'s entry := LastValue(8), PageStartId(4), PageOffset(vlong), PageSize(vint)

FooterPB contains encoding, compression, and index page location for each column's bitmap index. It allows us to choose different encoding/compression schemes in the future.

enum PostingsTypePB {
    UNKNOWN_ENCODING = 0;
    ROARING_BITMAP = 1;
}

message BitmapIndexSectionPB {
    optional uint32 num_pages = 1;
    optional EncodingTypePB data_page_encoding = 2;
    optional EncodingTypePB index_page_encoding = 3;
    // point to data page when `num_pages == 1`, otherwise point to index page
    optional PagePointerPB page_pointer = 4;
    optional uint64 data_page_size = 5;
    optional uint64 index_page_size = 6;
    optional CompressionTypePB data_page_compression = 7;
    optional CompressionTypePB index_page_compression = 8;
}

message BitmapIndexColumnPB {
    optional uint32 column_id = 1;
    optional uint32 unique_id = 2;
    optional int32 type = 3;
    optional BitmapIndexSectionPB dict = 4;
    optional BitmapIndexSectionPB postings = 5;
    optional PostingsTypePB postings_type = 6;
}

message BitmapIndexFileFooterPB {
    optional uint32 versoin = 1 [default = 1];
    repeated BitmapIndexColumnPB columns = 2;
}
xbp102n0

xbp102n01#

Why bitmap index should be seperated from data file? Is there any situation where change index without changing data?

oipij1gg

oipij1gg2#

@gaodayue

Good proposal, I have some questions about this proposal.

  1. If we support other type of index, for example B+Tree, do we need to store all index type in one index file?
  2. I think we should keep file immutable, so I don't agree with the second option to add new index. What about having multiple index files for one data file? If we add a new index, we can create a new index file for it. and we can merge all index files into one index file when we do compaction.
  3. Now, we store dictionary in two levels, how much data size does this can support? Why not we abstract this to a B-tree which can support multiple layer. here is kudu's implementation which we can reference.
xyhw6mcr

xyhw6mcr3#

And for Encoding for integer dictionary, I think it's OK to use one byte to store BitWidth. If we store BitWidth, last frame also can be encoded in BitPackingFrame format.

7ivaypg9

7ivaypg94#

@imay thanks for the comments.
If we support other type of index, for example B+Tree, do we need to store all index type in one index file?

No we don't. Different types of index will have different structure and metadata, I don't think it's a good idea write them into a single file.

I think we should keep file immutable, so I don't agree with the second option to add new index. What about having multiple index files for one data file? If we add a new index, we can create a new index file for it. and we can merge all index files into one index file when we do compaction.

Agree. In this way the I/O cost of adding a new index is minimum, but reader now needs to know what indexes are available for this segment and the place of each index. Where do you think is the best place for such information? Perhaps in RowsetMetaPB?

Now, we store dictionary in two levels, how much data size does this can support? Why not we abstract this to a B-tree which can support multiple layer. here is kudu's implementation which we can reference.

Since the dictionary is per segment, it won't be very large. So I think two-level is enough for most dataset. It's also a lot simpler and easier to implement than B-Tree. I think we should first start with the simpler approach and switch to the more complicated approach only when it's proven to be really necessary. Thanks for the link, I'll take a look.

jtoj6r0c

jtoj6r0c5#

@imay thanks for the comments.
If we support other type of index, for example B+Tree, do we need to store all index type in one index file?

No we don't. Different types of index will have different structure and metadata, I don't think it's a good idea write them into a single file.

I think we should keep file immutable, so I don't agree with the second option to add new index. What about having multiple index files for one data file? If we add a new index, we can create a new index file for it. and we can merge all index files into one index file when we do compaction.

Agree. In this way the I/O cost of adding a new index is minimum, but reader now needs to know what indexes are available for this segment and the place of each index. Where do you think is the best place for such information? Perhaps in RowsetMetaPB?

I think it's ok to save this information in RowsetMetaPB. RowsetMetaPB is seen as immutable, when we create a new index for rowset, we will create a new RowsetMetaPB for new rowset, and leverages link-schema-change to create hard link of old files. Then we can generate index for this rowset and save this information in new RowsetMetaPB.

Now, we store dictionary in two levels, how much data size does this can support? Why not we abstract this to a B-tree which can support multiple layer. here is kudu's implementation which we can reference.

Since the dictionary is per segment, it won't be very large. So I think two-level is enough for most dataset. It's also a lot simpler and easier to implement than B-Tree. I think we should first start with the simpler approach and switch to the more complicated approach only when it's proven to be really necessary. Thanks for the link, I'll take a look.

If we can make it easy to change, I think it's OK.

zdwk9cvp

zdwk9cvp6#

Can we write bitmap index of each column into a independent file, named with column_id in its name, eg: segment_name.1.bitmap is the bitmap index file for column 1. So we can add index for a column easily and drop index file directly when delete index. And we can keep the generated file immutable.
The shortcome is that it will open more files and incur more IO when load multiple columns simultaneously, But compared to the data size and with page cache, maybe it can be ignored.

suzh9iv8

suzh9iv87#

Agree. In this way the I/O cost of adding a new index is minimum, but reader now needs to know what indexes are available for this segment and the place of each index. Where do you think is the best place for such information? Perhaps in RowsetMetaPB?

I think we can use TabletMetaPB's ColumnPB.is_bitmap_column to indicate whether has bitmap index for column in doris. But for segment itself, I have no idea where to put this field.

btqmn9zl

btqmn9zl8#

@kangpinghuang
Can we write bitmap index of each column into a independent file, named with column_id in its name, eg: segment_name.1.bitmap is the bitmap index file for column 1. So we can add index for a column easily and drop index file directly when delete index. And we can keep the generated file immutable.
The shortcome is that it will open more files and incur more IO when load multiple columns simultaneously, But compared to the data size and with page cache, maybe it can be ignored.

I think the independent file approach brings more troubles than benefits. If you have a table with many indexed column (which is very common in OLAP), you end up with many small index files to open and read, which will hurt read performance. The biggest advantage is drop index operation can delete the index file right away, but that operation is less frequent than add index IMO and marked delete in other approaches isn't necessarily bad.

dpiehjr4

dpiehjr49#

@gaodayue good proposal!
May I know how you deal with the delete rows operation? if remove some rows, what happen to posting list and ordered dictionary ?

qacovj5a

qacovj5a10#

@FANLONGFANLONG V2 segment is designed to be immutable. AFAIK, currently Doris implements delete by persisting delete conditions in the metadata and removing deleted rows in the query path. The actual data deletion will happen at compaction time.

相关问题