The world's most popular open source database
00001 /* Innobase relational database engine; Copyright (C) 2001 Innobase Oy 00002 00003 This program is free software; you can redistribute it and/or modify 00004 it under the terms of the GNU General Public License 2 00005 as published by the Free Software Foundation in June 1991. 00006 00007 This program is distributed in the hope that it will be useful, 00008 but WITHOUT ANY WARRANTY; without even the implied warranty of 00009 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 00010 GNU General Public License for more details. 00011 00012 You should have received a copy of the GNU General Public License 2 00013 along with this program (in file COPYING); if not, write to the Free 00014 Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ 00015 /****************************************************** 00016 The database buffer buf_pool 00017 00018 (c) 1995 Innobase Oy 00019 00020 Created 11/5/1995 Heikki Tuuri 00021 *******************************************************/ 00022 00023 #include "buf0buf.h" 00024 00025 #ifdef UNIV_NONINL 00026 #include "buf0buf.ic" 00027 #endif 00028 00029 #include "mem0mem.h" 00030 #include "btr0btr.h" 00031 #include "fil0fil.h" 00032 #include "lock0lock.h" 00033 #include "btr0sea.h" 00034 #include "ibuf0ibuf.h" 00035 #include "dict0dict.h" 00036 #include "log0recv.h" 00037 #include "log0log.h" 00038 #include "trx0undo.h" 00039 #include "srv0srv.h" 00040 00041 /* 00042 IMPLEMENTATION OF THE BUFFER POOL 00043 ================================= 00044 00045 Performance improvement: 00046 ------------------------ 00047 Thread scheduling in NT may be so slow that the OS wait mechanism should 00048 not be used even in waiting for disk reads to complete. 00049 Rather, we should put waiting query threads to the queue of 00050 waiting jobs, and let the OS thread do something useful while the i/o 00051 is processed. In this way we could remove most OS thread switches in 00052 an i/o-intensive benchmark like TPC-C. 00053 00054 A possibility is to put a user space thread library between the database 00055 and NT. User space thread libraries might be very fast. 00056 00057 SQL Server 7.0 can be configured to use 'fibers' which are lightweight 00058 threads in NT. These should be studied. 00059 00060 Buffer frames and blocks 00061 ------------------------ 00062 Following the terminology of Gray and Reuter, we call the memory 00063 blocks where file pages are loaded buffer frames. For each buffer 00064 frame there is a control block, or shortly, a block, in the buffer 00065 control array. The control info which does not need to be stored 00066 in the file along with the file page, resides in the control block. 00067 00068 Buffer pool struct 00069 ------------------ 00070 The buffer buf_pool contains a single mutex which protects all the 00071 control data structures of the buf_pool. The content of a buffer frame is 00072 protected by a separate read-write lock in its control block, though. 00073 These locks can be locked and unlocked without owning the buf_pool mutex. 00074 The OS events in the buf_pool struct can be waited for without owning the 00075 buf_pool mutex. 00076 00077 The buf_pool mutex is a hot-spot in main memory, causing a lot of 00078 memory bus traffic on multiprocessor systems when processors 00079 alternately access the mutex. On our Pentium, the mutex is accessed 00080 maybe every 10 microseconds. We gave up the solution to have mutexes 00081 for each control block, for instance, because it seemed to be 00082 complicated. 00083 00084 A solution to reduce mutex contention of the buf_pool mutex is to 00085 create a separate mutex for the page hash table. On Pentium, 00086 accessing the hash table takes 2 microseconds, about half 00087 of the total buf_pool mutex hold time. 00088 00089 Control blocks 00090 -------------- 00091 00092 The control block contains, for instance, the bufferfix count 00093 which is incremented when a thread wants a file page to be fixed 00094 in a buffer frame. The bufferfix operation does not lock the 00095 contents of the frame, however. For this purpose, the control 00096 block contains a read-write lock. 00097 00098 The buffer frames have to be aligned so that the start memory 00099 address of a frame is divisible by the universal page size, which 00100 is a power of two. 00101 00102 We intend to make the buffer buf_pool size on-line reconfigurable, 00103 that is, the buf_pool size can be changed without closing the database. 00104 Then the database administarator may adjust it to be bigger 00105 at night, for example. The control block array must 00106 contain enough control blocks for the maximum buffer buf_pool size 00107 which is used in the particular database. 00108 If the buf_pool size is cut, we exploit the virtual memory mechanism of 00109 the OS, and just refrain from using frames at high addresses. Then the OS 00110 can swap them to disk. 00111 00112 The control blocks containing file pages are put to a hash table 00113 according to the file address of the page. 00114 We could speed up the access to an individual page by using 00115 "pointer swizzling": we could replace the page references on 00116 non-leaf index pages by direct pointers to the page, if it exists 00117 in the buf_pool. We could make a separate hash table where we could 00118 chain all the page references in non-leaf pages residing in the buf_pool, 00119 using the page reference as the hash key, 00120 and at the time of reading of a page update the pointers accordingly. 00121 Drawbacks of this solution are added complexity and, 00122 possibly, extra space required on non-leaf pages for memory pointers. 00123 A simpler solution is just to speed up the hash table mechanism 00124 in the database, using tables whose size is a power of 2. 00125 00126 Lists of blocks 00127 --------------- 00128 00129 There are several lists of control blocks. The free list contains 00130 blocks which are currently not used. 00131 00132 The LRU-list contains all the blocks holding a file page 00133 except those for which the bufferfix count is non-zero. 00134 The pages are in the LRU list roughly in the order of the last 00135 access to the page, so that the oldest pages are at the end of the 00136 list. We also keep a pointer to near the end of the LRU list, 00137 which we can use when we want to artificially age a page in the 00138 buf_pool. This is used if we know that some page is not needed 00139 again for some time: we insert the block right after the pointer, 00140 causing it to be replaced sooner than would noramlly be the case. 00141 Currently this aging mechanism is used for read-ahead mechanism 00142 of pages, and it can also be used when there is a scan of a full 00143 table which cannot fit in the memory. Putting the pages near the 00144 of the LRU list, we make sure that most of the buf_pool stays in the 00145 main memory, undisturbed. 00146 00147 The chain of modified blocks contains the blocks 00148 holding file pages that have been modified in the memory 00149 but not written to disk yet. The block with the oldest modification 00150 which has not yet been written to disk is at the end of the chain. 00151 00152 Loading a file page 00153 ------------------- 00154 00155 First, a victim block for replacement has to be found in the 00156 buf_pool. It is taken from the free list or searched for from the 00157 end of the LRU-list. An exclusive lock is reserved for the frame, 00158 the io_fix field is set in the block fixing the block in buf_pool, 00159 and the io-operation for loading the page is queued. The io-handler thread 00160 releases the X-lock on the frame and resets the io_fix field 00161 when the io operation completes. 00162 00163 A thread may request the above operation using the buf_page_get- 00164 function. It may then continue to request a lock on the frame. 00165 The lock is granted when the io-handler releases the x-lock. 00166 00167 Read-ahead 00168 ---------- 00169 00170 The read-ahead mechanism is intended to be intelligent and 00171 isolated from the semantically higher levels of the database 00172 index management. From the higher level we only need the 00173 information if a file page has a natural successor or 00174 predecessor page. On the leaf level of a B-tree index, 00175 these are the next and previous pages in the natural 00176 order of the pages. 00177 00178 Let us first explain the read-ahead mechanism when the leafs 00179 of a B-tree are scanned in an ascending or descending order. 00180 When a read page is the first time referenced in the buf_pool, 00181 the buffer manager checks if it is at the border of a so-called 00182 linear read-ahead area. The tablespace is divided into these 00183 areas of size 64 blocks, for example. So if the page is at the 00184 border of such an area, the read-ahead mechanism checks if 00185 all the other blocks in the area have been accessed in an 00186 ascending or descending order. If this is the case, the system 00187 looks at the natural successor or predecessor of the page, 00188 checks if that is at the border of another area, and in this case 00189 issues read-requests for all the pages in that area. Maybe 00190 we could relax the condition that all the pages in the area 00191 have to be accessed: if data is deleted from a table, there may 00192 appear holes of unused pages in the area. 00193 00194 A different read-ahead mechanism is used when there appears 00195 to be a random access pattern to a file. 00196 If a new page is referenced in the buf_pool, and several pages 00197 of its random access area (for instance, 32 consecutive pages 00198 in a tablespace) have recently been referenced, we may predict 00199 that the whole area may be needed in the near future, and issue 00200 the read requests for the whole area. 00201 00202 AWE implementation 00203 ------------------ 00204 00205 By a 'block' we mean the buffer header of type buf_block_t. By a 'page' 00206 we mean the physical 16 kB memory area allocated from RAM for that block. 00207 By a 'frame' we mean a 16 kB area in the virtual address space of the 00208 process, in the frame_mem of buf_pool. 00209 00210 We can map pages to the frames of the buffer pool. 00211 00212 1) A buffer block allocated to use as a non-data page, e.g., to the lock 00213 table, is always mapped to a frame. 00214 2) A bufferfixed or io-fixed data page is always mapped to a frame. 00215 3) When we need to map a block to frame, we look from the list 00216 awe_LRU_free_mapped and try to unmap its last block, but note that 00217 bufferfixed or io-fixed pages cannot be unmapped. 00218 4) For every frame in the buffer pool there is always a block whose page is 00219 mapped to it. When we create the buffer pool, we map the first elements 00220 in the free list to the frames. 00221 5) When we have AWE enabled, we disable adaptive hash indexes. 00222 */ 00223 00224 buf_pool_t* buf_pool = NULL; /* The buffer buf_pool of the database */ 00225 00226 #ifdef UNIV_DEBUG 00227 ulint buf_dbg_counter = 0; /* This is used to insert validation 00228 operations in excution in the 00229 debug version */ 00230 ibool buf_debug_prints = FALSE; /* If this is set TRUE, 00231 the program prints info whenever 00232 read-ahead or flush occurs */ 00233 #endif /* UNIV_DEBUG */ 00234 /************************************************************************ 00235 Calculates a page checksum which is stored to the page when it is written 00236 to a file. Note that we must be careful to calculate the same value on 00237 32-bit and 64-bit architectures. */ 00238 00239 ulint 00240 buf_calc_page_new_checksum( 00241 /*=======================*/ 00242 /* out: checksum */ 00243 byte* page) /* in: buffer page */ 00244 { 00245 ulint checksum; 00246 00247 /* Since the field FIL_PAGE_FILE_FLUSH_LSN, and in versions <= 4.1.x 00248 ..._ARCH_LOG_NO, are written outside the buffer pool to the first 00249 pages of data files, we have to skip them in the page checksum 00250 calculation. 00251 We must also skip the field FIL_PAGE_SPACE_OR_CHKSUM where the 00252 checksum is stored, and also the last 8 bytes of page because 00253 there we store the old formula checksum. */ 00254 00255 checksum = ut_fold_binary(page + FIL_PAGE_OFFSET, 00256 FIL_PAGE_FILE_FLUSH_LSN - FIL_PAGE_OFFSET) 00257 + ut_fold_binary(page + FIL_PAGE_DATA, 00258 UNIV_PAGE_SIZE - FIL_PAGE_DATA 00259 - FIL_PAGE_END_LSN_OLD_CHKSUM); 00260 checksum = checksum & 0xFFFFFFFFUL; 00261 00262 return(checksum); 00263 } 00264 00265 /************************************************************************ 00266 In versions < 4.0.14 and < 4.1.1 there was a bug that the checksum only 00267 looked at the first few bytes of the page. This calculates that old 00268 checksum. 00269 NOTE: we must first store the new formula checksum to 00270 FIL_PAGE_SPACE_OR_CHKSUM before calculating and storing this old checksum 00271 because this takes that field as an input! */ 00272 00273 ulint 00274 buf_calc_page_old_checksum( 00275 /*=======================*/ 00276 /* out: checksum */ 00277 byte* page) /* in: buffer page */ 00278 { 00279 ulint checksum; 00280 00281 checksum = ut_fold_binary(page, FIL_PAGE_FILE_FLUSH_LSN); 00282 00283 checksum = checksum & 0xFFFFFFFFUL; 00284 00285 return(checksum); 00286 } 00287 00288 /************************************************************************ 00289 Checks if a page is corrupt. */ 00290 00291 ibool 00292 buf_page_is_corrupted( 00293 /*==================*/ 00294 /* out: TRUE if corrupted */ 00295 byte* read_buf) /* in: a database page */ 00296 { 00297 ulint checksum; 00298 ulint old_checksum; 00299 ulint checksum_field; 00300 ulint old_checksum_field; 00301 #ifndef UNIV_HOTBACKUP 00302 dulint current_lsn; 00303 #endif 00304 if (mach_read_from_4(read_buf + FIL_PAGE_LSN + 4) 00305 != mach_read_from_4(read_buf + UNIV_PAGE_SIZE 00306 - FIL_PAGE_END_LSN_OLD_CHKSUM + 4)) { 00307 00308 /* Stored log sequence numbers at the start and the end 00309 of page do not match */ 00310 00311 return(TRUE); 00312 } 00313 00314 #ifndef UNIV_HOTBACKUP 00315 if (recv_lsn_checks_on && log_peek_lsn(¤t_lsn)) { 00316 if (ut_dulint_cmp(current_lsn, 00317 mach_read_from_8(read_buf + FIL_PAGE_LSN)) 00318 < 0) { 00319 ut_print_timestamp(stderr); 00320 00321 fprintf(stderr, 00322 " InnoDB: Error: page %lu log sequence number %lu %lu\n" 00323 "InnoDB: is in the future! Current system log sequence number %lu %lu.\n" 00324 "InnoDB: Your database may be corrupt or you may have copied the InnoDB\n" 00325 "InnoDB: tablespace but not the InnoDB log files. See\n" 00326 "http://dev.mysql.com/doc/mysql/en/backing-up.html for more information.\n", 00327 (ulong) mach_read_from_4(read_buf + FIL_PAGE_OFFSET), 00328 (ulong) ut_dulint_get_high( 00329 mach_read_from_8(read_buf + FIL_PAGE_LSN)), 00330 (ulong) ut_dulint_get_low( 00331 mach_read_from_8(read_buf + FIL_PAGE_LSN)), 00332 (ulong) ut_dulint_get_high(current_lsn), 00333 (ulong) ut_dulint_get_low(current_lsn)); 00334 } 00335 } 00336 #endif 00337 00338 /* If we use checksums validation, make additional check before 00339 returning TRUE to ensure that the checksum is not equal to 00340 BUF_NO_CHECKSUM_MAGIC which might be stored by InnoDB with checksums 00341 disabled. Otherwise, skip checksum calculation and return FALSE */ 00342 00343 if (srv_use_checksums) { 00344 old_checksum = buf_calc_page_old_checksum(read_buf); 00345 00346 old_checksum_field = mach_read_from_4(read_buf + UNIV_PAGE_SIZE 00347 - FIL_PAGE_END_LSN_OLD_CHKSUM); 00348 00349 /* There are 2 valid formulas for old_checksum_field: 00350 00351 1. Very old versions of InnoDB only stored 8 byte lsn to the 00352 start and the end of the page. 00353 00354 2. Newer InnoDB versions store the old formula checksum 00355 there. */ 00356 00357 if (old_checksum_field != mach_read_from_4(read_buf 00358 + FIL_PAGE_LSN) 00359 && old_checksum_field != old_checksum 00360 && old_checksum_field != BUF_NO_CHECKSUM_MAGIC) { 00361 00362 return(TRUE); 00363 } 00364 00365 checksum = buf_calc_page_new_checksum(read_buf); 00366 checksum_field = mach_read_from_4(read_buf + 00367 FIL_PAGE_SPACE_OR_CHKSUM); 00368 00369 /* InnoDB versions < 4.0.14 and < 4.1.1 stored the space id 00370 (always equal to 0), to FIL_PAGE_SPACE_SPACE_OR_CHKSUM */ 00371 00372 if (checksum_field != 0 && checksum_field != checksum 00373 && checksum_field != BUF_NO_CHECKSUM_MAGIC) { 00374 00375 return(TRUE); 00376 } 00377 } 00378 00379 return(FALSE); 00380 } 00381 00382 /************************************************************************ 00383 Prints a page to stderr. */ 00384 00385 void 00386 buf_page_print( 00387 /*===========*/ 00388 byte* read_buf) /* in: a database page */ 00389 { 00390 dict_index_t* index; 00391 ulint checksum; 00392 ulint old_checksum; 00393 00394 ut_print_timestamp(stderr); 00395 fprintf(stderr, " InnoDB: Page dump in ascii and hex (%lu bytes):\n", 00396 (ulint)UNIV_PAGE_SIZE); 00397 ut_print_buf(stderr, read_buf, UNIV_PAGE_SIZE); 00398 fputs("InnoDB: End of page dump\n", stderr); 00399 00400 checksum = srv_use_checksums ? 00401 buf_calc_page_new_checksum(read_buf) : BUF_NO_CHECKSUM_MAGIC; 00402 old_checksum = srv_use_checksums ? 00403 buf_calc_page_old_checksum(read_buf) : BUF_NO_CHECKSUM_MAGIC; 00404 00405 ut_print_timestamp(stderr); 00406 fprintf(stderr, 00407 " InnoDB: Page checksum %lu, prior-to-4.0.14-form checksum %lu\n" 00408 "InnoDB: stored checksum %lu, prior-to-4.0.14-form stored checksum %lu\n", 00409 (ulong) checksum, (ulong) old_checksum, 00410 (ulong) mach_read_from_4(read_buf + FIL_PAGE_SPACE_OR_CHKSUM), 00411 (ulong) mach_read_from_4(read_buf + UNIV_PAGE_SIZE 00412 - FIL_PAGE_END_LSN_OLD_CHKSUM)); 00413 fprintf(stderr, 00414 "InnoDB: Page lsn %lu %lu, low 4 bytes of lsn at page end %lu\n" 00415 "InnoDB: Page number (if stored to page already) %lu,\n" 00416 "InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) %lu\n", 00417 (ulong) mach_read_from_4(read_buf + FIL_PAGE_LSN), 00418 (ulong) mach_read_from_4(read_buf + FIL_PAGE_LSN + 4), 00419 (ulong) mach_read_from_4(read_buf + UNIV_PAGE_SIZE 00420 - FIL_PAGE_END_LSN_OLD_CHKSUM + 4), 00421 (ulong) mach_read_from_4(read_buf + FIL_PAGE_OFFSET), 00422 (ulong) mach_read_from_4(read_buf + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID)); 00423 00424 if (mach_read_from_2(read_buf + TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_TYPE) 00425 == TRX_UNDO_INSERT) { 00426 fprintf(stderr, 00427 "InnoDB: Page may be an insert undo log page\n"); 00428 } else if (mach_read_from_2(read_buf + TRX_UNDO_PAGE_HDR 00429 + TRX_UNDO_PAGE_TYPE) 00430 == TRX_UNDO_UPDATE) { 00431 fprintf(stderr, 00432 "InnoDB: Page may be an update undo log page\n"); 00433 } 00434 00435 switch (fil_page_get_type(read_buf)) { 00436 case FIL_PAGE_INDEX: 00437 fprintf(stderr, 00438 "InnoDB: Page may be an index page where index id is %lu %lu\n", 00439 (ulong) ut_dulint_get_high(btr_page_get_index_id(read_buf)), 00440 (ulong) ut_dulint_get_low(btr_page_get_index_id(read_buf))); 00441 00442 /* If the code is in ibbackup, dict_sys may be uninitialized, 00443 i.e., NULL */ 00444 00445 if (dict_sys != NULL) { 00446 00447 index = dict_index_find_on_id_low( 00448 btr_page_get_index_id(read_buf)); 00449 if (index) { 00450 fputs("InnoDB: (", stderr); 00451 dict_index_name_print(stderr, NULL, index); 00452 fputs(")\n", stderr); 00453 } 00454 } 00455 break; 00456 case FIL_PAGE_INODE: 00457 fputs("InnoDB: Page may be an 'inode' page\n", stderr); 00458 break; 00459 case FIL_PAGE_IBUF_FREE_LIST: 00460 fputs("InnoDB: Page may be an insert buffer free list page\n", 00461 stderr); 00462 break; 00463 case FIL_PAGE_TYPE_ALLOCATED: 00464 fputs("InnoDB: Page may be a freshly allocated page\n", 00465 stderr); 00466 break; 00467 case FIL_PAGE_IBUF_BITMAP: 00468 fputs("InnoDB: Page may be an insert buffer bitmap page\n", 00469 stderr); 00470 break; 00471 case FIL_PAGE_TYPE_SYS: 00472 fputs("InnoDB: Page may be a system page\n", 00473 stderr); 00474 break; 00475 case FIL_PAGE_TYPE_TRX_SYS: 00476 fputs("InnoDB: Page may be a transaction system page\n", 00477 stderr); 00478 break; 00479 case FIL_PAGE_TYPE_FSP_HDR: 00480 fputs("InnoDB: Page may be a file space header page\n", 00481 stderr); 00482 break; 00483 case FIL_PAGE_TYPE_XDES: 00484 fputs("InnoDB: Page may be an extent descriptor page\n", 00485 stderr); 00486 break; 00487 case FIL_PAGE_TYPE_BLOB: 00488 fputs("InnoDB: Page may be a BLOB page\n", 00489 stderr); 00490 break; 00491 } 00492 } 00493 00494 /************************************************************************ 00495 Initializes a buffer control block when the buf_pool is created. */ 00496 static 00497 void 00498 buf_block_init( 00499 /*===========*/ 00500 buf_block_t* block, /* in: pointer to control block */ 00501 byte* frame) /* in: pointer to buffer frame, or NULL if in 00502 the case of AWE there is no frame */ 00503 { 00504 block->magic_n = 0; 00505 00506 block->state = BUF_BLOCK_NOT_USED; 00507 00508 block->frame = frame; 00509 00510 block->awe_info = NULL; 00511 00512 block->buf_fix_count = 0; 00513 block->io_fix = 0; 00514 00515 block->modify_clock = ut_dulint_zero; 00516 00517 block->file_page_was_freed = FALSE; 00518 00519 block->check_index_page_at_flush = FALSE; 00520 block->index = NULL; 00521 00522 block->in_free_list = FALSE; 00523 block->in_LRU_list = FALSE; 00524 00525 block->n_pointers = 0; 00526 00527 rw_lock_create(&block->lock, SYNC_LEVEL_VARYING); 00528 ut_ad(rw_lock_validate(&(block->lock))); 00529 00530 #ifdef UNIV_SYNC_DEBUG 00531 rw_lock_create(&block->debug_latch, SYNC_NO_ORDER_CHECK); 00532 #endif /* UNIV_SYNC_DEBUG */ 00533 } 00534 00535 /************************************************************************ 00536 Creates the buffer pool. */ 00537 00538 buf_pool_t* 00539 buf_pool_init( 00540 /*==========*/ 00541 /* out, own: buf_pool object, NULL if not 00542 enough memory or error */ 00543 ulint max_size, /* in: maximum size of the buf_pool in 00544 blocks */ 00545 ulint curr_size, /* in: current size to use, must be <= 00546 max_size, currently must be equal to 00547 max_size */ 00548 ulint n_frames) /* in: number of frames; if AWE is used, 00549 this is the size of the address space window 00550 where physical memory pages are mapped; if 00551 AWE is not used then this must be the same 00552 as max_size */ 00553 { 00554 byte* frame; 00555 ulint i; 00556 buf_block_t* block; 00557 00558 ut_a(max_size == curr_size); 00559 ut_a(srv_use_awe || n_frames == max_size); 00560 00561 if (n_frames > curr_size) { 00562 fprintf(stderr, 00563 "InnoDB: AWE: Error: you must specify in my.cnf .._awe_mem_mb larger\n" 00564 "InnoDB: than .._buffer_pool_size. Now the former is %lu pages,\n" 00565 "InnoDB: the latter %lu pages.\n", (ulong) curr_size, (ulong) n_frames); 00566 00567 return(NULL); 00568 } 00569 00570 buf_pool = mem_alloc(sizeof(buf_pool_t)); 00571 00572 /* 1. Initialize general fields 00573 ---------------------------- */ 00574 mutex_create(&buf_pool->mutex, SYNC_BUF_POOL); 00575 00576 mutex_enter(&(buf_pool->mutex)); 00577 00578 if (srv_use_awe) { 00579 /*----------------------------------------*/ 00580 /* Allocate the virtual address space window, i.e., the 00581 buffer pool frames */ 00582 00583 buf_pool->frame_mem = os_awe_allocate_virtual_mem_window( 00584 UNIV_PAGE_SIZE * (n_frames + 1)); 00585 00586 /* Allocate the physical memory for AWE and the AWE info array 00587 for buf_pool */ 00588 00589 if ((curr_size % ((1024 * 1024) / UNIV_PAGE_SIZE)) != 0) { 00590 00591 fprintf(stderr, 00592 "InnoDB: AWE: Error: physical memory must be allocated in full megabytes.\n" 00593 "InnoDB: Trying to allocate %lu database pages.\n", 00594 (ulong) curr_size); 00595 00596 return(NULL); 00597 } 00598 00599 if (!os_awe_allocate_physical_mem(&(buf_pool->awe_info), 00600 curr_size / ((1024 * 1024) / UNIV_PAGE_SIZE))) { 00601 00602 return(NULL); 00603 } 00604 /*----------------------------------------*/ 00605 } else { 00606 buf_pool->frame_mem = os_mem_alloc_large( 00607 UNIV_PAGE_SIZE * (n_frames + 1), 00608 TRUE, FALSE); 00609 } 00610 00611 if (buf_pool->frame_mem == NULL) { 00612 00613 return(NULL); 00614 } 00615 00616 buf_pool->blocks = ut_malloc(sizeof(buf_block_t) * max_size); 00617 00618 if (buf_pool->blocks == NULL) { 00619 00620 return(NULL); 00621 } 00622 00623 buf_pool->max_size = max_size; 00624 buf_pool->curr_size = curr_size; 00625 00626 buf_pool->n_frames = n_frames; 00627 00628 /* Align pointer to the first frame */ 00629 00630 frame = ut_align(buf_pool->frame_mem, UNIV_PAGE_SIZE); 00631 00632 buf_pool->frame_zero = frame; 00633 buf_pool->high_end = frame + UNIV_PAGE_SIZE * n_frames; 00634 00635 if (srv_use_awe) { 00636 /*----------------------------------------*/ 00637 /* Map an initial part of the allocated physical memory to 00638 the window */ 00639 00640 os_awe_map_physical_mem_to_window(buf_pool->frame_zero, 00641 n_frames * 00642 (UNIV_PAGE_SIZE / OS_AWE_X86_PAGE_SIZE), 00643 buf_pool->awe_info); 00644 /*----------------------------------------*/ 00645 } 00646 00647 buf_pool->blocks_of_frames = ut_malloc(sizeof(void*) * n_frames); 00648 00649 if (buf_pool->blocks_of_frames == NULL) { 00650 00651 return(NULL); 00652 } 00653 00654 /* Init block structs and assign frames for them; in the case of 00655 AWE there are less frames than blocks. Then we assign the frames 00656 to the first blocks (we already mapped the memory above). We also 00657 init the awe_info for every block. */ 00658 00659 for (i = 0; i < max_size; i++) { 00660 00661 block = buf_pool_get_nth_block(buf_pool, i); 00662 00663 if (i < n_frames) { 00664 frame = buf_pool->frame_zero + i * UNIV_PAGE_SIZE; 00665 *(buf_pool->blocks_of_frames + i) = block; 00666 } else { 00667 frame = NULL; 00668 } 00669 00670 buf_block_init(block, frame); 00671 00672 if (srv_use_awe) { 00673 /*----------------------------------------*/ 00674 block->awe_info = buf_pool->awe_info 00675 + i * (UNIV_PAGE_SIZE / OS_AWE_X86_PAGE_SIZE); 00676 /*----------------------------------------*/ 00677 } 00678 } 00679 00680 buf_pool->page_hash = hash_create(2 * max_size); 00681 00682 buf_pool->n_pend_reads = 0; 00683 00684 buf_pool->last_printout_time = time(NULL); 00685 00686 buf_pool->n_pages_read = 0; 00687 buf_pool->n_pages_written = 0; 00688 buf_pool->n_pages_created = 0; 00689 buf_pool->n_pages_awe_remapped = 0; 00690 00691 buf_pool->n_page_gets = 0; 00692 buf_pool->n_page_gets_old = 0; 00693 buf_pool->n_pages_read_old = 0; 00694 buf_pool->n_pages_written_old = 0; 00695 buf_pool->n_pages_created_old = 0; 00696 buf_pool->n_pages_awe_remapped_old = 0; 00697 00698 /* 2. Initialize flushing fields 00699 ---------------------------- */ 00700 UT_LIST_INIT(buf_pool->flush_list); 00701 00702 for (i = BUF_FLUSH_LRU; i <= BUF_FLUSH_LIST; i++) { 00703 buf_pool->n_flush[i] = 0; 00704 buf_pool->init_flush[i] = FALSE; 00705 buf_pool->no_flush[i] = os_event_create(NULL); 00706 } 00707 00708 buf_pool->LRU_flush_ended = 0; 00709 00710 buf_pool->ulint_clock = 1; 00711 buf_pool->freed_page_clock = 0; 00712 00713 /* 3. Initialize LRU fields 00714 ---------------------------- */ 00715 UT_LIST_INIT(buf_pool->LRU); 00716 00717 buf_pool->LRU_old = NULL; 00718 00719 UT_LIST_INIT(buf_pool->awe_LRU_free_mapped); 00720 00721 /* Add control blocks to the free list */ 00722 UT_LIST_INIT(buf_pool->free); 00723 00724 for (i = 0; i < curr_size; i++) { 00725 00726 block = buf_pool_get_nth_block(buf_pool, i); 00727 00728 if (block->frame) { 00729 /* Wipe contents of frame to eliminate a Purify 00730 warning */ 00731 00732 #ifdef HAVE_purify 00733 memset(block->frame, '\0', UNIV_PAGE_SIZE); 00734 #endif 00735 if (srv_use_awe) { 00736 /* Add to the list of blocks mapped to 00737 frames */ 00738 00739 UT_LIST_ADD_LAST(awe_LRU_free_mapped, 00740 buf_pool->awe_LRU_free_mapped, block); 00741 } 00742 } 00743 00744 UT_LIST_ADD_LAST(free, buf_pool->free, block); 00745 block->in_free_list = TRUE; 00746 } 00747 00748 mutex_exit(&(buf_pool->mutex)); 00749 00750 if (srv_use_adaptive_hash_indexes) { 00751 btr_search_sys_create( 00752 curr_size * UNIV_PAGE_SIZE / sizeof(void*) / 64); 00753 } else { 00754 /* Create only a small dummy system */ 00755 btr_search_sys_create(1000); 00756 } 00757 00758 return(buf_pool); 00759 } 00760 00761 /************************************************************************ 00762 Maps the page of block to a frame, if not mapped yet. Unmaps some page 00763 from the end of the awe_LRU_free_mapped. */ 00764 00765 void 00766 buf_awe_map_page_to_frame( 00767 /*======================*/ 00768 buf_block_t* block, /* in: block whose page should be 00769 mapped to a frame */ 00770 ibool add_to_mapped_list) /* in: TRUE if we in the case 00771 we need to map the page should also 00772 add the block to the 00773 awe_LRU_free_mapped list */ 00774 { 00775 buf_block_t* bck; 00776 00777 #ifdef UNIV_SYNC_DEBUG 00778 ut_ad(mutex_own(&(buf_pool->mutex))); 00779 #endif /* UNIV_SYNC_DEBUG */ 00780 ut_ad(block); 00781 00782 if (block->frame) { 00783 00784 return; 00785 } 00786 00787 /* Scan awe_LRU_free_mapped from the end and try to find a block 00788 which is not bufferfixed or io-fixed */ 00789 00790 bck = UT_LIST_GET_LAST(buf_pool->awe_LRU_free_mapped); 00791 00792 while (bck) { 00793 if (bck->state == BUF_BLOCK_FILE_PAGE 00794 && (bck->buf_fix_count != 0 || bck->io_fix != 0)) { 00795 00796 /* We have to skip this */ 00797 bck = UT_LIST_GET_PREV(awe_LRU_free_mapped, bck); 00798 } else { 00799 /* We can map block to the frame of bck */ 00800 00801 os_awe_map_physical_mem_to_window( 00802 bck->frame, 00803 UNIV_PAGE_SIZE / OS_AWE_X86_PAGE_SIZE, 00804 block->awe_info); 00805 00806 block->frame = bck->frame; 00807 00808 *(buf_pool->blocks_of_frames 00809 + (((ulint)(block->frame 00810 - buf_pool->frame_zero)) 00811 >> UNIV_PAGE_SIZE_SHIFT)) 00812 = block; 00813 00814 bck->frame = NULL; 00815 UT_LIST_REMOVE(awe_LRU_free_mapped, 00816 buf_pool->awe_LRU_free_mapped, 00817 bck); 00818 00819 if (add_to_mapped_list) { 00820 UT_LIST_ADD_FIRST(awe_LRU_free_mapped, 00821 buf_pool->awe_LRU_free_mapped, 00822 block); 00823 } 00824 00825 buf_pool->n_pages_awe_remapped++; 00826 00827 return; 00828 } 00829 } 00830 00831 fprintf(stderr, 00832 "InnoDB: AWE: Fatal error: cannot find a page to unmap\n" 00833 "InnoDB: awe_LRU_free_mapped list length %lu\n", 00834 (ulong) UT_LIST_GET_LEN(buf_pool->awe_LRU_free_mapped)); 00835 00836 ut_a(0); 00837 } 00838 00839 /************************************************************************ 00840 Allocates a buffer block. */ 00841 UNIV_INLINE 00842 buf_block_t* 00843 buf_block_alloc(void) 00844 /*=================*/ 00845 /* out, own: the allocated block; also if AWE 00846 is used it is guaranteed that the page is 00847 mapped to a frame */ 00848 { 00849 buf_block_t* block; 00850 00851 block = buf_LRU_get_free_block(); 00852 00853 return(block); 00854 } 00855 00856 /************************************************************************ 00857 Moves to the block to the start of the LRU list if there is a danger 00858 that the block would drift out of the buffer pool. */ 00859 UNIV_INLINE 00860 void 00861 buf_block_make_young( 00862 /*=================*/ 00863 buf_block_t* block) /* in: block to make younger */ 00864 { 00865 if (buf_pool->freed_page_clock >= block->freed_page_clock 00866 + 1 + (buf_pool->curr_size / 1024)) { 00867 00868 /* There has been freeing activity in the LRU list: 00869 best to move to the head of the LRU list */ 00870 00871 buf_LRU_make_block_young(block); 00872 } 00873 } 00874 00875 /************************************************************************ 00876 Moves a page to the start of the buffer pool LRU list. This high-level 00877 function can be used to prevent an important page from from slipping out of 00878 the buffer pool. */ 00879 00880 void 00881 buf_page_make_young( 00882 /*================*/ 00883 buf_frame_t* frame) /* in: buffer frame of a file page */ 00884 { 00885 buf_block_t* block; 00886 00887 mutex_enter(&(buf_pool->mutex)); 00888 00889 block = buf_block_align(frame); 00890 00891 ut_a(block->state == BUF_BLOCK_FILE_PAGE); 00892 00893 buf_LRU_make_block_young(block); 00894 00895 mutex_exit(&(buf_pool->mutex)); 00896 } 00897 00898 /************************************************************************ 00899 Frees a buffer block which does not contain a file page. */ 00900 UNIV_INLINE 00901 void 00902 buf_block_free( 00903 /*===========*/ 00904 buf_block_t* block) /* in, own: block to be freed */ 00905 { 00906 ut_a(block->state != BUF_BLOCK_FILE_PAGE); 00907 00908 mutex_enter(&(buf_pool->mutex)); 00909 00910 buf_LRU_block_free_non_file_page(block); 00911 00912 mutex_exit(&(buf_pool->mutex)); 00913 } 00914 00915 /************************************************************************* 00916 Allocates a buffer frame. */ 00917 00918 buf_frame_t* 00919 buf_frame_alloc(void) 00920 /*=================*/ 00921 /* out: buffer frame */ 00922 { 00923 return(buf_block_alloc()->frame); 00924 } 00925 00926 /************************************************************************* 00927 Frees a buffer frame which does not contain a file page. */ 00928 00929 void 00930 buf_frame_free( 00931 /*===========*/ 00932 buf_frame_t* frame) /* in: buffer frame */ 00933 { 00934 buf_block_free(buf_block_align(frame)); 00935 } 00936 00937 /************************************************************************ 00938 Returns the buffer control block if the page can be found in the buffer 00939 pool. NOTE that it is possible that the page is not yet read 00940 from disk, though. This is a very low-level function: use with care! */ 00941 00942 buf_block_t* 00943 buf_page_peek_block( 00944 /*================*/ 00945 /* out: control block if found from page hash table, 00946 otherwise NULL; NOTE that the page is not necessarily 00947 yet read from disk! */ 00948 ulint space, /* in: space id */ 00949 ulint offset) /* in: page number */ 00950 { 00951 buf_block_t* block; 00952 00953 mutex_enter_fast(&(buf_pool->mutex)); 00954 00955 block = buf_page_hash_get(space, offset); 00956 00957 mutex_exit(&(buf_pool->mutex)); 00958 00959 return(block); 00960 } 00961 00962 /************************************************************************ 00963 Resets the check_index_page_at_flush field of a page if found in the buffer 00964 pool. */ 00965 00966 void 00967 buf_reset_check_index_page_at_flush( 00968 /*================================*/ 00969 ulint space, /* in: space id */ 00970 ulint offset) /* in: page number */ 00971 { 00972 buf_block_t* block; 00973 00974 mutex_enter_fast(&(buf_pool->mutex)); 00975 00976 block = buf_page_hash_get(space, offset); 00977 00978 if (block) { 00979 block->check_index_page_at_flush = FALSE; 00980 } 00981 00982 mutex_exit(&(buf_pool->mutex)); 00983 } 00984 00985 /************************************************************************ 00986 Returns the current state of is_hashed of a page. FALSE if the page is 00987 not in the pool. NOTE that this operation does not fix the page in the 00988 pool if it is found there. */ 00989 00990 ibool 00991 buf_page_peek_if_search_hashed( 00992 /*===========================*/ 00993 /* out: TRUE if page hash index is built in search 00994

