Releases: donalshijan/Database-Indexing
A Tree that grows in proportion in all direction is a good tree.
This version only improves upon BTreeIndex's performance over pervious version. In this version we introduce a max Entries limit on index files of BTree Index, where upon crossing this limit, the index file itself splits into two, along with adding a new entry to the original leaf node, whose child pointer pointed to that index file. Doing so, not only allows the tree to grow proportionally and predictably, it also makes operations faster by the virtue of better structure and organisation of entries in the tree including the entries in the index files.
This limit is set as 10 x max_entries_allowed_in_internal_node, and max_entries_allowed_in_internal_node is specific for a branching factor as 2*b-1 where b is branching factor, so that means the max entries limit that we impose on index files is directly proportional to the branching factor. Which also means that imposing this condition makes indexing capacity also directly proportional to branching factor, as branching factor will decide when to split the index file and add and fill up leaf node with a new entry and cause the tree to grow, and as tree grows bigger it reaches memory usage limit.
We use a clever approach here to speed up the process of splitting the index file contents, when the need arises, and that is the use of the approach used to solve "The median of a running data stream" problem on leetcode, using two heaps, a min heap and a max heap, here we use it to split the contents of an index file with max entries into two halves, one half will have all values lesser than other half and the other will have all values greater than the other, we do it to maintain the tree invariance, because we are splitting the original index file and taking "what should have been" the sorted median entry from that index file, and moving it into the leaf node entries at the right position in sorted order and then the two new split index files become the left and right pointing index files of that entry, so they must satisfy the condition where left index file has all entries smaller than leaf node entry and right index file has all entries greater than leaf node entry, and splitting using this approach helps us achieve that in a very clever and efficient way. After that, we delete the original index file as it's contents are preserved into two new index files and as an entry in the tree.
The reason for improvement in performance is that, now any operation involving index file access, would have a constant worst case time complexity, as every index file has a max limit on entries, whereas in the previous approach there were some index files which were holding massive amounts of entries compared to the others which often held relatively less, some even held almost nothing, and this uneven distribution of entries was bogging the performance down. In previous approach operations often times had to look deep down in a single index file holding large amounts of entries, adding huge number of disk seek costs, now the index files do not grow too large, as it splits upon crossing max limit, so operations only have to look less far into the index file.
This splitting of index file where we also move an entry to the leaf node, causes the tree to expand and grow more rapidly and because of that, this approach technically indexes less entries than previous one did for the same branching factor, still way more than hash index , but we do loose a lot of indexing capacity which is totally expected as we are forcing the tree to grow more aggressively and maintain evenly distributed and proportionate BTreeIndex, but you gotta remember that, in the previous approach performance only got progressively worse of as more entries were added, that scales very poorly, as a large index would take proportionally larger time for every operation. In this approach performance remains mostly steady albeit by compromising a little bit over indexing capacity, which is totally worth it for the performance gains it achieves.
Here are the test results from running performance tests:
Results are the same for hash index as they were in the last version, since it has not gotten any upgrades.
Here are the results for BTree Index with the default test of upto 2000 operations
2025/03/09 16:56:41 =======================================
2025/03/09 16:56:41 TEST RESULTS FOR BTREE INDEXING (Branching factor 4)
2025/03/09 16:56:41 =======================================
2025/03/09 16:56:41 Average Insert Time: 352.421µs
2025/03/09 16:56:41 Average Search Time: 17.872µs
2025/03/09 16:56:42 Average Update Time: 510.694µs
2025/03/09 16:56:43 Average Delete Time: 285.128µs
2025/03/09 16:57:01 Indexing capacity : 32161 entries indexed in 0.10 MB memory
2025/03/09 17:00:11 =======================================
2025/03/09 17:00:11 TEST RESULTS FOR BTREE INDEXING (Branching factor 8)
2025/03/09 17:00:11 =======================================
2025/03/09 17:00:11 Average Insert Time: 191.803µs
2025/03/09 17:00:11 Average Search Time: 22.897µs
2025/03/09 17:00:13 Average Update Time: 597.697µs
2025/03/09 17:00:13 Average Delete Time: 343.406µs
2025/03/09 17:00:57 Indexing capacity : 98958 entries indexed in 0.10 MB memory
2025/03/09 17:02:13 =======================================
2025/03/09 17:02:13 TEST RESULTS FOR BTREE INDEXING (Branching factor 32)
2025/03/09 17:02:13 =======================================
2025/03/09 17:02:13 Average Insert Time: 185.57µs
2025/03/09 17:02:13 Average Search Time: 31.922µs
2025/03/09 17:02:16 Average Update Time: 1.343546ms
2025/03/09 17:02:18 Average Delete Time: 700.768µs
2025/03/09 17:04:27 Indexing capacity : 520911 entries indexed in 0.10 MB memory
Here are the test results after bumping up the number of test operations to 8000 from 2000.
2025/03/09 17:05:51 =======================================
2025/03/09 17:05:51 TEST RESULTS FOR BTREE INDEXING (Branching factor 4)
2025/03/09 17:05:51 =======================================
2025/03/09 17:05:53 Average Insert Time: 244.978µs
2025/03/09 17:05:53 Average Search Time: 24.618µs
2025/03/09 17:05:56 Average Update Time: 427.652µs
2025/03/09 17:05:59 Average Delete Time: 292.146µs
2025/03/09 17:06:08 Indexing capacity : 33625 entries indexed in 0.10 MB memory
2025/03/09 17:07:24 =======================================
2025/03/09 17:07:24 TEST RESULTS FOR BTREE INDEXING (Branching factor 8)
2025/03/09 17:07:24 =======================================
2025/03/09 17:07:25 Average Insert Time: 165.978µs
2025/03/09 17:07:25 Average Search Time: 19.715µs
2025/03/09 17:07:30 Average Update Time: 548.932µs
2025/03/09 17:07:32 Average Delete Time: 323.523µs
2025/03/09 17:07:52 Indexing capacity : 98975 entries indexed in 0.10 MB memory
2025/03/09 17:09:34 =======================================
2025/03/09 17:09:34 TEST RESULTS FOR BTREE INDEXING (Branching factor 32)
2025/03/09 17:09:34 =======================================
2025/03/09 17:09:35 Average Insert Time: 178.2µs
2025/03/09 17:09:36 Average Search Time: 34.221µs
2025/03/09 17:09:47 Average Update Time: 1.406415ms
2025/03/09 17:09:53 Average Delete Time: 735.495µs
2025/03/09 17:12:01 Indexing capacity : 523698 entries indexed in 0.10 MB memory
Here are the test results after bumping number of test operations to 80000
2025/03/09 17:48:06 =======================================
2025/03/09 17:48:06 TEST RESULTS FOR BTREE INDEXING (Branching Factor 16)
2025/03/09 17:48:06 =======================================
2025/03/09 17:48:20 Average Insert Time: 168.256µs
2025/03/09 17:48:22 Average Search Time: 25.832µs
2025/03/09 17:49:42 Average Update Time: 1.009256ms
2025/03/09 17:50:27 Average Delete Time: 562.778µs
2025/03/09 17:51:16 Indexing capacity : 218256 entries indexed in 0.10 MB memory
2025/03/09 17:52:32 =======================================
2025/03/09 17:52:32 TEST RESULTS FOR BTREE INDEXING (Branching Factor 32)
2025/03/09 17:52:32 =======================================
2025/03/09 17:52:49 Average Insert Time: 211.363µs
2025/03/09 17:52:52 Average Search Time: 32.642µs
2025/03/09 17:54:50 Average Update Time: 1.476961ms
2025/03/09 17:55:56 Average Delete Time: 828.031µs
2025/03/09 17:58:50 Indexing capacity : 545367 entries indexed in 0.10 MB memory
Insertion seems to get faster with higher branching factor whereas deletion seems to get faster with lower branching factors.
Reason for that is, since writes are slower than reads, and writes in the case of insertion, when entry is supposed to go into an index file, that entry gets appended at the end of that index file, and split operations take place less frequently as they need to get filled up to max entries by writes first, and we know the split is the expensive operation which involves splitting the index file entries by writing all it's contents to two separate files. The fact that higher branching factor will hold more entries in index files and take longer to fill up before split happens, results in a scenario, where expensive split operations get hit less frequently in the case of higher branching factors, which manifests itself in the test results as we can see, insertion times with higher branching factors were always less than lower branching factor test cases with same number of operations. In the case of delete operation, every deletion of an index file entry is very expensive as it involves writing all the file entries to a new file skipping over the entry to be deleted and then renaming it as the new respective index file. With lower branching factors, maximum number of entries in index files is lower compared to higher branching factors, which results in having the delete from index file operation to do less number of writes and hence it manifests itself in the test results as we can see deletion times for ...
Initial Release
This is the initial version release of this project and it includes an implementation of a Hash Indexed and BTree Indexed Database, and we compare their performance, with scope of improvements in the future iterations.
Here are the Results from running performance tests on Both:
2025/03/09 08:34:32 =======================================
2025/03/09 08:34:32 TEST RESULTS FOR HASH INDEXING
2025/03/09 08:34:32 =======================================
2025/03/09 08:34:32 Average Insert Time: 67.327µs
2025/03/09 08:34:32 Average Search Time: 7.804µs
2025/03/09 08:34:33 Average Update Time: 470.587µs
2025/03/09 08:34:34 Average Delete Time: 297.896µs
2025/03/09 08:34:34 Indexing capacity : 2560 entries indexed in 0.10 MB memory
2025/03/09 08:25:31 =======================================
2025/03/09 08:25:31 TEST RESULTS FOR BTREE INDEXING (Branching Factor: 8)
2025/03/09 08:25:31 =======================================
2025/03/09 08:25:31 Average Insert Time: 103.253µs
2025/03/09 08:25:31 Average Search Time: 62.982µs
2025/03/09 08:25:35 Average Update Time: 2.137692ms
2025/03/09 08:25:38 Average Delete Time: 1.15976ms
⠦ Entries Indexed: 82977^Csignal: interrupt
2025/03/09 08:36:25 =======================================
2025/03/09 08:36:25 TEST RESULTS FOR BTREE INDEXING (Branching Factor: 4)
2025/03/09 08:36:25 =======================================
2025/03/09 08:36:25 Average Insert Time: 92.498µs
2025/03/09 08:36:25 Average Search Time: 45.349µs
2025/03/09 08:36:28 Average Update Time: 1.629123ms
2025/03/09 08:36:30 Average Delete Time: 904.175µs
⠙ Entries Indexed: 15129^Csignal: interrupt
2025/03/09 08:38:35 =======================================
2025/03/09 08:38:35 TEST RESULTS FOR BTREE INDEXING (Branching Factor: 3)
2025/03/09 08:38:35 =======================================
2025/03/09 08:38:36 Average Insert Time: 82.494µs
2025/03/09 08:38:36 Average Search Time: 31.373µs
2025/03/09 08:38:38 Average Update Time: 1.004831ms
2025/03/09 08:38:39 Average Delete Time: 585.257µs
⠹ Entries Indexed: 7732^Csignal: interrupt
2025/03/09 08:43:27 =======================================
2025/03/09 08:43:27 TEST RESULTS FOR BTREE INDEXING (Branching Factor:32)
2025/03/09 08:43:27 =======================================
2025/03/09 08:43:27 Average Insert Time: 76.649µs
2025/03/09 08:43:27 Average Search Time: 30.104µs
2025/03/09 08:43:30 Average Update Time: 1.254582ms
2025/03/09 08:43:31 Average Delete Time: 691.912µs
⠼ Entries Indexed: 23844^Csignal: interrupt
Running Performance tests on Btree Index after bumping up the number of test operations from 2000 to 8000
2025/03/09 08:46:35 =======================================
2025/03/09 08:46:35 TEST RESULTS FOR BTREE INDEXING (Branching Factor:32)
2025/03/09 08:46:35 =======================================
2025/03/09 08:46:37 Average Insert Time: 200.899µs
2025/03/09 08:46:39 Average Search Time: 191.382µs
2025/03/09 08:47:43 Average Update Time: 7.99702ms
2025/03/09 08:48:20 Average Delete Time: 4.652212ms
⠹ Entries Indexed: 66888^Csignal: interrupt
2025/03/09 08:51:35 =======================================
2025/03/09 08:51:35 TEST RESULTS FOR BTREE INDEXING (Branching Factor:4)
2025/03/09 08:51:35 =======================================
2025/03/09 08:51:36 Average Insert Time: 131.54µs
2025/03/09 08:51:37 Average Search Time: 83.961µs
2025/03/09 08:52:04 Average Update Time: 3.471825ms
2025/03/09 08:52:20 Average Delete Time: 1.982814ms
⠦ Entries Indexed: 9717^Csignal: interrupt
From these test results, it is clear that Hash Index always outperforms BTree Index in terms of performance (Insert, Search, Update , Delete), in pretty much every scenario, which is expected as it keeps the entire index in a hashmap which supports constant time lookups. But when it comes to Indexing capacity, BTree Index completely dominates, hash index can barely index a fraction of the entries that BTree index managed to index in the same amount of memory usage.
Speaking of performance of BTree Index, from default tests, which were ran with upto 2000 test operations, a limit which we had to have set, because currently set memory usage limit (0.1 MB) gets used up for hash index when it reaches 2560 insert operations, and since we are trying to make a fair comparison, by making same amount of operations in both the indexes while testing, so the number was decided to be a little less than minimum of (maximum number of insert operations possible for each(HashIndex,BTreeIndex)). The results from the default test scenario for BTree Index slightly indicated that if we reduced the branching factor, the performance could improve in terms of all operations that is insert, search , update and delete. To be fair, nothing seemed really convincing or definitive as running the tests multiple times even with the same branching factor gave results which were mixed, overlapping and deviating either ways.
Things start to become way more clear and evident when we increase the number of test operations from 2000 to 8000 and results in this case definitively conclude that we gain significant improvements in performance for insert, search, update and delete with lower branching factor over higher. The reason why this improvement was not surfacing as evidently as it did with higher number of test operations, is that at lower number of test operations the tree is not branched out enough, with higher number of operations the tree grows in size and there are more index files at leaf nodes and more entries in each one of those files on average than it did for lower number of test operations, lower branching factors further widens that gap forcing the tree to branch more aggressively, with more index files being created for entries to distribute and spread over more evenly, which causes the index files to hold relatively fewer entries as a side effect, whereas higher branching factor takes longer to branch and results in having fewer number of index files but dense with whole lot more entries on average, so pretty much any operation in that case would have to be carried out with more frequent disk access as most entries are stored in those index files on disk. Which explains the toll it seems to take with higher branching factor over lower, in the results above, as search, updates and deletes got twice as fast for lower branching factor compared to higher.
The indexing capacity for BTreeIndex, is definitely way higher than hash index and in our tests it went all the way upto 400,000 entries indexed and still kept going without hitting the memory usage limit, which is due to the fact that, index files hold entries in a range and there is no hard cap limit on how many entries a index file should hold , most key entries which are randomly generated fall in one of the ranges mapped to one of the index files, so most entries get stored into index files, which does not cause the tree to branch and grow as often, because for branching, leaf entries need to overflow. This is something we improve in the next iteration where we impose a max number of entries limit on index files and beyond which they are forced to split into two index files and make an entry to leaf node, enabling branching and growth of the tree, which ends up building a beautiful well balanced Tree, resulting in improved performance as things get way more structured and organised.