Oracle

December 7, 2009

Hash Join

Filed under: Uncategorized — srivenu @ 8:18 am

I wrote this note while running some test cases in trying to understand hash join as implemented by oracle. I was running 11.1.0.6.0 on Windows XP 32-bit. I used events 10046 & 10104 for gathering the traces while running my test cases. For reference I was using Jonathan’s excellent book on CBO Fundamentals, Steve Adams notes on Hash Join and an Oracle technical document on hash join

I was using this simple test case.
(I played with the data volume and values while testing out various scenarios)

*******************************************
drop table x1;
drop table x2;
create table x1 (a char(…)) tablespace data1;

begin
for i in 1…. loop
insert into x1 values(i);
end loop;
end;
/

commit;

create table x2 (a char(…)) tablespace data2;

begin
for i in 1…. loop
insert into x2 values(i);
end loop;
end;
/

commit;

exec dbms_stats.gather_table_stats(‘TEST’,’X1′);
exec dbms_stats.gather_table_stats(‘TEST’,’X2′);

alter session set workarea_size_policy = manual;
alter session set hash_area_size=….;
alter session set “_hash_multiblock_io_count”=..;
alter system flush shared_pool;
alter system flush buffer_cache;

alter session set events ‘10104 trace name context forever’;
alter session set events ‘10046 trace name context forever, level 12’;

set pause on

select /*+leading(x2) use_hash(x1)*/
*
from x1, x2
where x1.a=x2.a
/

*******************************************

In the trace
kxhfInit() seems to be the entry point for the 10104 event.

Hash Initialization
*************
*** RowSrcId: 1 HASH JOIN STATISTICS (INITIALIZATION) ***
Here we get to see the memory allocation details.

Join Type: INNER join
Original hash-area size: 971776
Memory for slot table: 884736
Calculated overhead for partitions and row/slot managers: 87040
Hash-join fanout: 8
Number of partitions: 8
Number of slots: 12
Multiblock IO: 9
Block size(KB): 8
Cluster (slot) size(KB): 72
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 32
Per partition bit vector length(KB): 4
Maximum possible row length: 413
Estimated build size (KB): 1028
Estimated Build Row Length (includes overhead): 117
# Immutable Flags:
Not BUFFER(execution) output of the join for PQ
Evaluate Left Input Row Vector
Evaluate Right Input Row Vector
# Mutable Flags:
IO sync

Out of the total allocated hash area some of it is used as overhead and appears as “Calculated overhead for partitions and row/slot managers”. (I think this includes among other things the bit map vectors and the as the name suggests the memory allocated for row/slot managers increases with row count). The remaining hash area, after overhead (Memory for slot table) is split up into clusters, each of size “_hash_multiblock_io_count” * Block size. Most of these clusters are used for partition data and some for asynchronous io.

Build table scan
************
kxhfSetPhase: phase=BUILD
The above line in the trace file seems to mark the scan of the build table.

One of the decisions that made earlier on is the number of partitions. This i think is determined by the cluster size, hash area size and estimated build table size. Actually if the “build table” fits in memory there is no need to partition the data. But there is no way to know that for sure and it wouldn’t be optimal to start partitioning the build table after reading it the first time and determining that it doesn’t fit in memory. So i think the build table is always partitioned whether it fits in memory or not. For each row read from the build table, a hash function (hash function 1) is applied on the hash join key and the row is placed in the appropriate partition. At the same time another hash function, hash function 2, is also applied on the join key. This determines the hash bucket that the row goes into. Steve Adams suggests that this hash value is obtained by applying the dbms_utility.get_hash_value function. This value is stored along with the row. (I tested this by making the build table so as to spill on to disk and dumping the tempfile blocks. An 8 byte string, which i think is the hashkey, was attached to each row). Hash function 2 determines which hash bucket the row goes into when the hash table is built. I think the bit vector (seperate for each partition ?) is also built at this time. I couldn’t find info about the internals of the bit vector, maybe its a bloom filter (?). The basic purpose of the bit vector is to provide a cost-effective way to eliminate as many rows as possible from the probe table during the partitioning phase.

kxhfWrite: hash-join is spilling to disk
If the build table is too big to fit in memory it starts spilling onto disk and you see the above line in the trace (for trivia, this line actually appears 1 row before it starts to spill). At this stage, not all partitions are spilled to disk. Initially it spills only one partition and keeps the rest completely in memory. If more data is coming in, it will continue to spill more and more partitions. But it tries to keep atleast 1 partition completely in memory as far as possible. Lets say your hash area (slot memory) is 160K and your cluster size is 16K and the partition count is 4. So it keeps 4 clusters (one for each partition in memory) and atleast 1 used for async io. So 5 of the 10 clusters are used up. Lets say it decided to retain clusters on p2 in memory. It will retain p2 clusters in memory till they exceed 4 clusters. If more data from build table is going on into that partition, it spills that partition as well onto disk.

kxhfSetPhase: phase=PROBE_1
This line appears just after the end of build table input. It reads the header and some blocks from the probe table at this stage.
I wonder why its reading some of the probe table now ? It then marks the end of the build phase 1.

*** RowSrcId: 1 END OF BUILD (PHASE 1) ***
This marks the end of data input from build table.

It prints out new stats of the build table
Revised row length: 115
Revised build size: 1009KB

If any of the partitions spilled to disk, you can see that info in the trace
kxhfFlush(): pid=0 nRows=… build= topQ= (this means partition 0 spilled to disk)

*** RowSrcId: 1 HASH JOIN RESIZE BUILD (PHASE 1) ***
I wonder what RESIZE BUILD means ? Does it adjust memory requirements as per the partition size ?

You see the following lines in trace file
Total number of partitions: 8
Number of partitions which could fit in memory: 3
Number of partitions left in memory: 3
Total number of slots in in-memory partitions: 6

If the build table completely fit in memory (1-pass), you see that number of partitions left in memory the same as the total number of partitions. If all partitions spilled to disk, you will see “Number of partitions left in memory: 0”

*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 1) ***
The partitioning phase of the build table is complete. Now it starts building a hash table using the hash keys (generated during partitioning phase of each row using hash function 2).

You see it dumping the following info
Total number of partitions: 8
Number of partitions left in memory: 3
Total number of rows in in-memory partitions: 3365

The number of rows determines bucket count for the hash table. The bucket count is the next power of 2 just greater than the row count. For ex – in above case the number of buckets are 4096 (power of 2 greater than 3365). But i saw sometimes it goes for the next one, ie 8192 (under memory pressure i saw it getting reduced. But i don’t see any point in having more than 4096)

Estimated max # of build rows that can fit in avail memory: 15336
I did not understand this line. for ex – In my test rowlength was 115. So 15336*115=around 1850K. How could it fit this in my slot memory of around 860K ?

It shows data in the in-memory partitions

### Partition Distribution ###
Partition:0 rows:0 clusters:0 slots:0 kept=0
Partition:1 rows:0 clusters:0 slots:0 kept=0
Partition:2 rows:0 clusters:0 slots:0 kept=0
Partition:3 rows:0 clusters:0 slots:0 kept=0
Partition:4 rows:0 clusters:0 slots:0 kept=0
Partition:5 rows:1128 clusters:2 slots:2 kept=1
Partition:6 rows:1120 clusters:2 slots:2 kept=1
Partition:7 rows:1117 clusters:2 slots:2 kept=1
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***

Revised number of hash buckets (after flushing): 3365
Allocating new hash table.
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***

Requested size of hash table: 1024
Actual size of hash table: 1024
Number of buckets: 8192
Match bit vector allocated: FALSE
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of rows (may have changed): 3365
Number of in-memory partitions (may have changed): 3
Final number of hash buckets: 8192
Size (in bytes) of hash table: 32768

The size of hash table is 4*number of hash buckets (on my 32-bit windows system).
It is here that i see that it spills are remaining clusters from non-memory resident partitions to disk.
I don’t know why it says “may have changed” in Total number of rows and Number of in-memory partitions ? Maybe it is one more place it could change workarea sizes (in auto PGA mode) and the number of in-memory partitions might change ?

Now the hash table is built if and only if there is atleast 1 partition (of build table input) completely in memory. If there are more than 1 partitions completely in memory, it builds the hash table for all the memory resident partitions. (In the above example it built the hash table for the partitions 5, 6 & 7). If there are no memory resident partitions, it doesn’t build the hash table now.

If a hash table is built, you will see the following lines
kxhfIterate(end_iterate): numAlloc=12, maxSlots=12
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
# than the true number.
Number of buckets with 0 rows: 5437
Number of buckets with 1 rows: 2217
Number of buckets with 2 rows: 472
Number of buckets with 3 rows: 61
Number of buckets with 4 rows: 4
Number of buckets with 5 rows: 1
Number of buckets with 6 rows: 0
Number of buckets with 7 rows: 0
Number of buckets with 8 rows: 0
Number of buckets with 9 rows: 0
Number of buckets with between 10 and 19 rows: 0
Number of buckets with between 20 and 29 rows: 0
Number of buckets with between 30 and 39 rows: 0
Number of buckets with between 40 and 49 rows: 0
Number of buckets with between 50 and 59 rows: 0
Number of buckets with between 60 and 69 rows: 0
Number of buckets with between 70 and 79 rows: 0
Number of buckets with between 80 and 89 rows: 0
Number of buckets with between 90 and 99 rows: 0
Number of buckets with 100 or more rows: 0
### Hash table overall statistics ###
Total buckets: 8192 Empty buckets: 5437 Non-empty buckets: 2755
Total number of rows: 3365
Maximum number of rows in a bucket: 5
Average number of rows in non-empty buckets: 1.221416

This shows the data distribution of rows across the hash table buckets. This is the place where you can check for skew in the data distribution, which might cause plenty of cpu consumption during the join.

It now starts reading the probe table. What it does with the probe data depends on whether the hash table is built or not.
If there is no hash table – It partitions the probe table data into the same number of partitions as the build table using the same hash function 1. But it now uses the bit vector (which it built during the scan of the build table) to determine whether the row is retained or thrownaway. The bit vector doesn’t do a complete elimination but does provide a very good estimate as to whether a matching key could exist. A hash key is then generated (using hash function 2) and the row is added to the corresponding partition’s cluster.
If there is a hash table – Then i think the next step depends on whether the hash table is for all the data or only part of the build table data.
If the hash table is only for 1 or more (but not all) partitions of the build table, then i think the probe table row is first partitioned and if it corresponds to a partition in memory, a hash key is generated (using hash function 2) it is just verified in matching the hash bucket in memory (no need for bit vectoring isn’t it ?). If a matching row is found, data is fetched to next step in the plan. If the probe row corresponds to a partition which is not memory resident, then it is evaluated against the bit vector for that partition to see if it can be retained.
If the hash table corresponds to all partitions (ie the build table completely fit in memory), then the partitioning and bit vector steps could be skipped. (I couldn’t get this detail for Oracle but i saw a patent in mysql for this optimization). Hash function 2 is applied on the row key values and the corresponding hash bucket searched for matching rows.

You can see data fetch starting at this phase if the hash table exists (for the build table).
An optimal hash join ends here. No further steps are necessary.

After the end of probe input, you can see it dumping the probe partitions to disk
kxhfSetPhase: phase=PROBE_2
kxhfFlush(): pid=0 nRows=131 build=0 topQ=0
kxhfFlush(): pid=1 nRows=135 build=0 topQ=1
kxhfFlush(): pid=2 nRows=124 build=0 topQ=2
kxhfFlush(): pid=3 nRows=119 build=0 topQ=3
kxhfFlush(): pid=4 nRows=124 build=0 topQ=4

So in the above lines, it evaluated and fetched all rows corresponding to partitions 5, 6 & 7. (A hash table exists in memory for these partitions). Data from probe table for partitions 0, 1, 2, 3, 4 is written down to disk as a hash table for these partitions, of the build table, doesn’t exist.

qerhjFetchPhase2(): building a hash table
It now starts iterating over the remaining partitions on disk (partitions 0, 1, 2, 3 & 4 in this case). It picks the partitions based on size, starting from smaller to larger ones (It has this data in the Partition Histogram). It picks matching partitions from probe and build tables. It takes the smaller of the two partitions to build a hash table in memory and uses the other one to probe into this hash table. For ex – partition 0 data for probe table fills 2 clusters and partition 0 data for build table fills 1 cluster, it will build the hash table using probe table data. This is called Dynamic Role Reversal.

*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partions.
BUILD PARTION: nrows:1169 size=(2 slots, 144K)
PROBE PARTION: nrows:131 size=(1 slots, 72K)
ROLE REVERSAL OCCURRED

You can see it reading the data that it previously spilled to disk. It builds the hash table first
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***

Number of blocks that may be used to build the hash hable 63
Number of rows left to be iterated over (start of function): 131
Number of rows iterated over this function call: 131
Number of rows left to be iterated over (end of function): 0

You see here that the “Number of rows left to be iterated over (end of function): ” is 0. This is the next best case (first being optimal join where the build table fit completely in memory). The smallest partition of either the probe or hash table fits in memory. This will cause only 1 scan of the other tables partition.

At each stage of this hash table build and probe, you will see the hash table stats in the trace like this
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
# than the true number.
Number of buckets with 0 rows: 8062
Number of buckets with 1 rows: 129
Number of buckets with 2 rows: 1
Number of buckets with 3 rows: 0
………….
### Hash table overall statistics ###
Total buckets: 8192 Empty buckets: 8062 Non-empty buckets: 130
Total number of rows: 131
Maximum number of rows in a bucket: 2
Average number of rows in non-empty buckets: 1.007692

After each partition, you start the iteration on the next partition
kxhfResetIter(0C7F9700)

It picks another partition
*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partions.
BUILD PARTION: nrows:1071 size=(2 slots, 144K)
PROBE PARTION: nrows:124 size=(1 slots, 72K)
ROLE REVERSAL OCCURRED

The worst case scenario for the hash join results in a nested-loops hash join. This will occur if, even after partitioning, the probe or build table partitions are too big to fit in memory.

In such a case, for each partition, the smallest of the 2 inputs will be picked up and a hash table will be built with those rows that fit in memory and the other input is scanned to probe into this hash table. This process is repeated till all data is processed. In such a case of nested loops hash join, you will see lines like these in the trace

*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partions.
BUILD PARTION: nrows:2984 size=(20 slots, 320K)
PROBE PARTION: nrows:1492 size=(10 slots, 160K)
ROLE REVERSAL OCCURRED

*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash hable 2
Number of rows left to be iterated over (start of function): 1492
Number of rows iterated over this function call: 151
Number of rows left to be iterated over (end of function): 1341

So for this partition data, it has picked the probe table to build the hash table (ROLE REVERSAL OCCURRED), and it was able to build the hash table for 151 rows (out of the total 1492). It now scans all build table data for that partition to find the matching rows. Then you see the following rows.

kxhfResetIter(0CA476F8)
qerhjFetchPhase2(): building a hash table

*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash hable 2
Number of rows left to be iterated over (start of function): 1341
Number of rows iterated over this function call: 151
Number of rows left to be iterated over (end of function): 1190

It picked up the next 151 rows of the probe table to build the hash table. It again iterates over the build table data to find matching rows. This is called nested loops hash join and is the worst case scenario in a hash join.

The following line, i think marks the end of the hash join.
*** RowSrcId: 1, qerhjFreeSpace(): free hash-join memory

Significance of table order in a Hash Join

One of the important decisions taken by the CBO in a hash join is the choice of the Build table. Which of the 2 tables in the hash join should be chosen to build the hash table. I was testing the impact of this decision on the resource usage and performance of the join. Leaving other contentions aside, the major components that we need to look at are going to be cpu & io.

Case 1) Either of the join tables fit in available memory

Even if you have a hash area large enough to either of the tables in memory, it would be wiser to choose the smaller one as the build table. Even though the IO cost is going to be the same, the cost of building a hash table is proportionate to the input data size. The hash join cost is the cost of building the hash table plus the cost of probing this hash table. Building hash table is going to be more costly than probing, given the same size of inputs. But things could turn around if the the collision chains get longer which increases the probe cost.

Case 2) Only the smaller of the 2 inputs one fits in memory.

Its ideal to pick the smaller one and build the hash table in memory and use the larger one to probe this. This has the minimum cpu & io cost.

Case 3) Either of the join tables don’t fit in available memory.

I think several factors come into play here. Dynamic Role Reversal which automatically kicks in this case would choose the smallest partition to build the hash table. I think the one major factor you have to consider here is bit-vector filtering. (I’m assuming no data major skew).

Choosing the smaller table as the probe table and applying bit vector filtering on it could reduce the size of the spilled partitions. This would reduce the number of passes in a nested loops hash join. In some border cases, the join might even complete in 1-pass

At the same time, taking advantage of bit-vector filtering on the larger table could reduce the probe cost.

The cost difference between either choice might not be significant.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: