
The GNU SQL Server is a portable multi-user DBMS, which supports the SQL89 standard with some extension from SQL92. It implements highly isolated transactions, and static & dynamic query compilation. Currently both the client & server sides of the system work on UNIX-like systems. Client/server interaction is based on an RPC mechanism. The server subprocesses facility requires message passing and memory sharing facilities.
GNU SQL is based on a client-server architecture for DBMS organisations. In fact, we support two kinds of “clients and servers” according to a way of mutual communications. The set of processes operating on a data base may be divided into several layers depending on their distance from the data base. The closest or innermost layer is the SDTM layer. These processes provide the facilities of Synchronisation (two-phase predicate locks), Logical Log (the log of updates of logical level), Micro Log (the log for microoperations), Buffer (main memory bufferization, exchanges with disks, locks of pages), and Sorter (to make external sorts). SDTM also includes utilities to restore data base after failures and crashes and a run-time library for transactions. This library implements the external interface of SDTM. All processes of this layer communicate with message queues and shared memory.
The next layer consists the compiler and interpreter for SQL statements. It implements an interface based on remote procedure calls. It uses the transaction library from the innermost layer to access the data base. The server part of the system also contains administration process that monitors a state of all other processes of the server and starts compiler and interpreter processes on demand. It can also unload all other parts of the server if they aren’t used for a long time. This process communicates with other processes of the internal layer with signals and messages and is accessible for external invocations via remote procedure calls. The Administrator may be registered as an internel server.
Client parts of the system may reside on any computer of a local network. The client part consists of the client part of the SQL compiler and the client part of the interpreter which are linked together with a user’s program. The only requirement for an environment where a client part functions is that it should support the mechanism of remote procedure calls.
The client part of the compiler (the precompiler) is responsible for selection of pure SQL-text and its transmission to the main (server) part of the compiler to convert into a procedural representation. After such processing, user gets a pure C program, in which SQL code is replaced by calls of the runtime-library of the client part of the query interpreter. The corresponded procedural translation of the SQL statements is stored within a data base.
When executing an SQL program, the server part of the interpreter is provided with reference to previously stored compiled code together with parameters for this invocation. The server part of the interpreter retrieves the code from data base, and executes it with appropriate invocation of the SDTM layer.
Engine realises a set of internal facilities for DB handling. These facilities must provide for external storage data management, DB storage reliability and the synchronisation required for multi-access support. The DB storage reliability is secured by logging the DB updates while the transaction synchronisation is made via support of two-phase synchrolock protocol. During regular run mode Engine consists of six resident processes:
The number of Transaction-processes is equal to the number of simultaneously running transactions in the system.
The DB locates in one or a few data segments - external storage files with a page organisation. Today one segment is used. All files-segments of one DB are placed in one directory (its name is name of the DB) and have standard predefined names (numbers of segments). There are three types of segments in the system: log segments, temporary segment and permanent segments. Each type of segments has its own storage structure.
The basic part of the DB consisting of base tables and indexes resides in permanent segments. Each table resides in one segment, its descriptor and all indexes defined on this table are located in the same segment.
Tables’ rows hereafter called tuples, index information and auxiliary management information on the given segment are also placed in pages of this segment. In particular auxiliary pages support a list of page number with this table tuples in an ascending order. Pages related to one index are connected by the tree references.
Description of free memory in a segment is also supported, in particular sizes of free parts in each table pages are known. Each tuple of every table has an identifier unique for its life (tid) providing a direct access to this tuple.
Log segments store information about DB updates. Two logs are supported: Logical Journal which contains records about execution of operations updating DB and Microjournal containing records about page updates. Logical Journal exists until the moment of copying the DB state (it is copied together with other segments). Microjournal begins to fill up afresh after the system buffers content is fixed in external storage.
There is a special auxiliary table in each segment. This is a catalogue of tables and indexes residing in the given segment. Every tuple of this table is a base table descriptor and contains information about the number and types of columns and indexes defined on this table.
At Engine level the table is identified by segment number, unique number of persistent object in the system and tid of descriptor in the table-catalogue. The table- catalogue in each segment has a fixed number (0). An index is identified by table identifier and unique number of persistent object. Table columns are identified by numbers from zero on.
Information about free segment pages stores as bit scale in the first two segment pages (numbered 0 and 1). For each table page a descriptor is supported which consists of a page number in the segment and size of free memory in this page. Such descriptor exists only for those segment pages in which table tuples were placed. Access to this descriptor is enabled by an hierarchical structure organised as a B-tree. A full descriptor key consists of a table number and a page number in segment. The B-tree structure of page descriptors is very similar to structure of indexes. For interaction with these structures a common set of programs is used.
The B-tree root of page descriptors is placed into a fixed segment page which goes next after the page occupancy scale in the segment.
Table page is destined for tuples storage and consists of four parts: heading, records pointers region, free region and records region. There are four types of records: ordinary tuple, displaced tuple, indirect reference (to a tuple), identifier of a transaction (which removed a tuple). Record type is placed in the first byte of each record.
A tuple is placed in one page only. The unique tuple identifier (tid) consists of the segment page number - N (2 bytes) and a tuple identifier in a page - i (2 bytes). Tid identifies a tuple within the given segment and occupies 4 bytes. Internal tuple identifier (within the page) is the tuple pointer number. This pointer stores in the same page. It is important that i is not an offset in bytes of tuple pointer from the beginning of the page, but a number (beginning from zero) of pointers region element containing a pointer to a given tuple or to an indirect reference to this tuple, if the tuple was displaced to another page, or to transaction identifier which removed the tuple, if the tuple was removed. Tuple tid may not be modified during the tuple life.
All records in a page occupy a continuous region in the end of the page and there are no gaps between them.
Records about an indirect reference and transaction identifier have fixed sizes of 5 bytes each. A tuple may have an arbitrary size but not less than 5 bytes. If the real tuple size is less then the size is automatically set to 5 bytes and the last bytes of the record are considered as not essential.
A page header is placed in the beginning of the page. Records pointers region follows the page header. A records pointers region element has a fix size (2 bytes) and contains a record pointer (an offset of start of record from the beginning of page) or zero code.
Nonzero pointers to records are placed in pointers region inversely to arrangement of these records, that is in descending order of the pointers values (records offsets from the beginning of the page).
The first record on which the first nonzero pointer points is placed in the end of a page. Its size is calculates as a difference between the page size and pointer content (record beginning offset from the page beginning). Size of any other record is calculated as a difference between content of previous nonzero pointer and content of pointer to given record.
The last nonzero pointer is placed in the last element of the pointers region and contains offset of the records region beginning from the page beginning. The region between the pointers region and the records region is the free page storage. Its location (initial address) and size are not fixed.
If size of an already existing record is increasing, then all abovelying records shift up (to lesser addresses). If the place is not enough for the tuple size increase, the tuple is displaced to another page with “displaced tuple” record type and in the given page, instead of the tuple, an indirect reference record is made which contains the tid of a new tuple location.
If size of already existing record is decreasing, then all above lying records sgift down. Pointers of shifting records are corrected accordingly.
At insertion of new record time (it is always a tuple) the first null element in pointers region is to be found. If such element is absent a new element is created from the first two bytes of free region. Thus a location for record is defined and all above lying records (indicated by the down lying pointers) are shifted up for the size of the inserted record.
At tuple deletion the tuple is replaced by a record 5 bytes long which contains record type “IDTR” (1 byte) and identifier of the transaction which removed the tuple (4 bytes). This is made in order that at rollback of transaction which removed the tuple this tuple would get same tid which it already had. If the size of the removed tuple is more than 5 bytes its size is decreased down to 5 bytes and all the above lying records are shifted down and pointers of the shifting records are adjusted accordingly.
In all above mentioned shifts addresses of beginning of shifted region coincide with the address of beginning of records region (which is equal to content of the last nonzero pointer). The end of shifted region is defined by beginning of updated record which is decreased or increased, but at insertion of a new tuple it is defined by beginning of record before which the tuple is inserted and on which the last nonzero pointer points above selected pointer to the inserted record.
A tuple consists of two parts: a scale and columns values. A scale occupies an integer of bytes, minimum 1. The 1-st byte of a tuple contains record type: “CORT” or “CREM”. “CORT” record type means that tuple tid points to pointer of the tuple itself. “CREM” record type means the tuple was displaced from another page, tuple tid points to pointer of indirect reference to the displaced tuple but not to pointer of the tuple itself. A displaced tuple will be read through an indirect reference as tid points to its pointer.
The 7-th bit of every byte of a scale contains “0” if given byte is not the last scale byte, and “1” if given byte is the last scale byte. The table descriptor indicates n which is the number of the first table columns which cannot be indefinite. For remaining columns beginning from (n+1)-st the scale contains “bit of definition”: “0” - if the column value is not defined, “1” - if the column value is defined. NULL-values don’t occupy place in a tuple.
Dependence on the type there are two columns classes: column of a fixed size (its size stores in the table descriptor while the tuple stores only the column value) and column of a variable size (a value and its size store in a tuple).
Column values together with sizes are stored in a tuple one after another without gaps. If NULL-value becomes defined or variable length value size is increased a tuple is extended. If a value becomes NULL-value or variable value size is decreased a tuple is decreased. Accordingly the whole page records region above the given tuple is extended or decreased.
Memory in a table page is being released in result of running of a tuple deletion operation during transaction. If transaction has deleted a tuple then the tid of this tuple may not be occupied till the fixation of this transaction. To solve the problem the memory occupied by the tuple is partially released during the tuple deletion process. There remains not a large cut (of 5 bytes), where an unique in time transaction identifier is recorded (i.e. the identifier of deletion transaction). The pointer, the address of which is determined by the tid of the tuple being deleted, contains reference to this shortened record. The size of the released cut is being added to the free memory size which is contained in a descriptor of the given page (including still occupied 5 bytes). Thus the page descriptor contains not an exact current information about the size of the free memory on a page but the information which will become accurate after fixation of the transaction removed tuples.
Accordingly, page descriptors and information within pages are being used during memory allocation for a tuple insertion. At first, of course, it’s necessary to be guided by a page descriptor. Once a page was selected which presumably has enough free memory, information within the page is analysed. If it was found out that a page hasn’t enough free memory it means that some part of the memory is occupied by transactions’ identifiers. Then it’s necessary to check whether they had finished or not, and for all finished transactions to make cleaning (to set pointers to null and to release data memory). If after that there is still a lack of the memory on the page, then it’s necessary to try to make use of a descriptor of another page.
The transaction carrying out an insertion can find out at cleaning time that a page or a set of pages were released entirely. In this case that very transaction produces the return of the released pages to the pool of free pages of a segment. If it’s necessary to rollback transaction which removed a tuple from a page, then it is guaranteed that tuple tid is not occupied by another tuple. If memory in the page isn’t enough for the repeated accommodation of the tuple itself (because another transaction has inserted a tuple on this page), then the recovery tuple is to be placed on some other page, on which an indirect reference is set to (the tid of recovery tuple is not altered). The indirect reference is placed within 5 bytes which were occupied before that by transaction identifier.
It should be noted that a displaced tuple differs from ordinary tuples (their tids indicate the page on which the tuple is being stored) to avoid duplicate reading of the displaced tuples during successive of the table.
SMDT support buffer pool (buffers) – shared memory segments. The main role of the pool – reducing the number of changes with the external memory by the keeping copies of segment pages in operation memory. That means the buffer is program organized cache for DB account. The buffer consists the pages only. The special process – buffer is responsible for buffer pool management. From the other process point of view, buffer offered account to the DB pages in external memory. It realized buffer changes in the pool and executes exchanges with the external memory throw the file system instruments. Also buffer responsible for the synchronization of the pages exchange with the external memory. The buffer supports double-phase protocol of page locks. The information about all current locks saves in special table – ptid table.
The information about current buffer state, including external page address, modification indicator and buffer status. Status may be one of the three the following:
Every buffer is a separated segment, its size is equal to page size. Every buffer has its own definer. Definers form massive in the local buffer field. It includes next information:
The main definer attitude is finding the buffer to replace it contents with a new page. If a page is locked the last field is not used.
In order to check the page presence of any page in buffers the ptid table is used. The ptid consists the information:
The last two fields have a deal with a page definer table structure. In order to guarantee fast account by the full page number to the table hash method is used to table organization.
In order to execute the fixation correctly, the buffer should control the number of operations is executing and prevent new operations beginning.
The finished operation number counter is supported by the buffer. It is equal to null after the fixation and enlarges on 1 after command “End Operation” executing, which comes from transaction processes. The full number of refresh operations in the system after the last fixation come from the Logical Journal as a parameter of the command “Fixation”. When the counters meaning coincides (that means all operations were finished) real buffer fixation begins. The Micro-Journal process clears micro- journal.
The next command group has a deal with the requires and locks:
Pay attention that the part of commands ( 'end operation', 'PULL BUFFER', 'Tact beginning', 'Opt number') doesn't send answer message.
The low level memory manager is modifiying now in order to increase performance. The basic ideas described here