WL#4739: Physical Structure of Server

Affects: Server-9.x — Status: Assigned

Description
Dependent Tasks
High Level Architecture
Low Level Design

In order to simplify working with the server code, both as a developer and when
packaging distributions, we want to have a physical structure of the code that
makes the code easy to work with.

We are also aiming at a long-term solution that will allow the server code base
to grow significantly, without making the code unmaintainable.

In order to support this, we need a structure that supports:

- Clear and simple conventions for creating structure for the code

- Easy adding and removing of features to allow:

  - Features to be added late with a minimal risk of ripple effects into
    unrelated parts of the code. This can be introduced due to merges causing
    unintended code changes, as well as logical dependencies that are not clear.

  - It shall be easy to remove features, should it be necessary for some reason
   (that might not strictly be technical).

- Having a structure that allows the creation of various distribution packages
  from the same source, such as:

  - Client development distributions for application programmers

  - Storage engine development distributions for storage engine writers

  - Plug-in development distributions for plug-in writers (whom may or may not
    be storage engine developers)

- Having a structure that support working with the code using scripts to
  perform common tasks, like building special distributions, release testing,
  and packaging.


Working practice
================

There are some working practice that we need to support in this structure. These
practices are central to how we work with the code and not supporting them will
introduce severe problems for developers.

- Bug fixes and features is introduced as a sequence of patches, where each
  patch is a change to one or more files.

- A single patch should not cause a build failure and the server should
  still pass all tests.  If a bug fix or feature requires several
  patches, each patch should still leave the server in a stable state
  in the sense that it should still build and still pass all tests.

- A patch should not require unwarranted changes in other package. We
  should discourage practice that may require a developer to make
  changes in other packages than the one that he/she is working on.

  Forcing a developer to make changes in code that he/she is not familiar
  with, however small the changes are, increases the risk of introducing
  bugs and may go against design principles originally intended for a
  component or package.

- A patch is normally targeted for a single package only: features
  affecting several packages should be split into separate patches,
  committed in the right order, and preferably pushed together (bu
  this is not a requirement).


Notes
=====

- This worklog needs to be split up into several worklogs, at least:

  - one for the actual design (this one),

  - one for implementing the build frame (WL#4875)

  - one for fixing the current include file header mess (WL#4877)


Continuing work
===============

- In order to not stall the change of the structure for too long, it is
  necessary to set a bar for when the code should be changed. If that is not
  done, we will have to maintain two structures in parallel, which not offer
  any improvements to the development practice and instead solidify the current
  situation.

WL#4875: Server Build Frame
WL#4877: Fix server header files
WL#5030: Split and remove mysql_priv.h

Open Issues
===========

- What names shall we use for the packages? We already have storage/ and
  server/ and client/ (which already exists) have been suggested.

Resolved issues
===============

- Shall each package have a unique prefix for the files? Also consider the
  exported header files.

  The reasons for having different prefixes for header files is to be able to
  separate header files with same names in different packages when including
  them. However, by using the package directory name as prefix, a header file
  prefix is not needed. It would be either:

     #include "pkg_table.h"

  or

     #include "pkg/table.h"

  The reason for using prefixes for source files would be that linkers have
  problems distinguishing between files with the same name, but some tests
  indicate that is not the case on some common platforms (Linux and Solaris).

  In short, there seems to be no good reason to use file prefixes together
  with a package structure.

- Shall a dynamically loadable module be a separate package or not?

  There might be reasons to why a loadable component may consist of several
  packages, so we should not require that each loadable component is a package.

Decisions
=========

2009-02-26: We agreed on going for approach 2 when handling header files.
            The basis was later questioned and clarification of the document
            was asked for.

2009-05-27: It was agreed that we should not  impose a structure on the
            packages from the build system and represent meta-data for a
            package separately(typically as a manifest or configuration file).
            Structure might still be mandated by coding styles and/or practical
            issues.


High-level structure
====================

We envision that the system consists of a number of *packages* that together
make up the code of the system. In order to build the server, and associated
components, we have a *build frame* (or just *frame*) that is used to manage
and, especially, build the system.

In order to support the easy addition and removal of features, we assume that
each feature is contained in a separate package (see below) and a minimum of
changes shall be required (preferably none) to code outside this package to
introduce the feature. To support this convention, the build frame has to be
independent on the number and type of packages that are available, and use
generic methods for deciding what packages are to be included in the build. 
This in turn requires the packages to provide the necessary information so that
the build frame can do its job.


Components
==========

A component consist of a set of header files and a set of associated
C/C++ files. The component is the smallest unit of the physical
design.

Typically, each component consists of a header file and a C/C++ file
with a common base name, for example "parser.h" and
"parser.cc". However, there are some cases where it makes sense to
have multiple header files for a component and cases when it makes
sense to have multiple source files.

- Using several header files can be used to present multiple
  interfaces into a single component.

- Using several source files could be mandated when the linker is
  file-based, and will just map symbols on file-level (loading/linking
  entire files, not individual functions).

In these cases, the files of each component shall have a common
prefix distinguishable from other components.


    =========== =================================================
    Component   Files
    =========== =================================================
    rpl_filter  rpl_filter.h rpl_filter.cc
    reg_main    reg_main_internal.h reg_main_public.h reg_main.cc
    =========== =================================================


Packages
========

Packages are collections of components that server a common purpose.  This
formulation is deliberately not exact since what actually makes sense to turn
into a package wary from case to case. However, the following issues should be
considered when deciding whether a candidate package makes sense as a package:

- Can the candidate package be released independently of the rest of the
  server? If not, i.e., changes to this package is likely to require changes
  to other packages, then maybe it should not be a package.

  Releasing here does *not* mean distributing the code in isolation, it means
  releasing, e.g., a new version of the package for use with the rest of the
  server.

- Is the candidate package very small, e.g., a single component? In this case
  it might make sense to group several such candidate packages with similar
  purpose into a single package.

  A typical example would be support for individual character sets, that does
  not make sense to place in a single package each, but is sensible as a
  package of "character set information". 


Package naming and structure
----------------------------

Each package is represented as a directory. The basic assumption is
that everything related to a package should be placed in the
directory. This includes, but is not limited to: header files, source
files, and unit tests.

Basic goals and assumptions are:

- Changes in the package internals should not inadvertently affect
  other packages that use the package

- It shall be possible to support third-party solutions as package in
  the package structure and shall not require re-organization to fit
  the package structure

The package directories will be placed in a *subsystem directory* alongside the
``sql/`` directory. Apart from that, all package directories are placed at the
same level. We are placing the packages in a new subsystem directory instead of
re-using ``sql/`` to be able to easily distinguish between "unorganized" and
"organized" code.

The following subsystem directories are proposed (some directories
already exists and almost have the basic structure proposed):

    ========== ==============================================
    Package    Purpose
    ========== ==============================================
    storage/   Storage engines
    server/    Server modules
    common/    Common utilities
    ========== ==============================================

There are some other directories that are being considered, such as
``mysys/``, and the above list will be extended as needed.

Package names shall be small letters only, with underscore to separate
individual words in the package name. Note that the package name may
not start with an underscore.  This choice of name is used to allow
the package name to be used both as a file name, a C/C++ symbol, and
as identifier in other tools (such as Doxygen).

Examples: registry, query_model


File names
~~~~~~~~~~

The choice and restrictions on file names is governed by the current coding
style.

The coding style takes into account operating system restrictions and
restrictions imposed by tools such as the compiler, linker, and other processing
tools. However, the physical structure itself does not impose any special
requirements on the file names.


Package namespace
~~~~~~~~~~~~~~~~~

All symbols of a package shall be placed in a single namespace, and the
namespace name shall be the same as the name of the package.  Since package
names as specified above are legal C/C++ symbol names, this will always be possible.


Package interfaces
------------------

For each package, there is a set of interfaces into the package. Each interface
is represented physically as a header file, meaning that each package have one
or more interfaces, but potentially have header files that are not package
interfaces.

The package owner shall be able to decide what header files are available for
users of the package. Initially we will not be able to do this for practical
reasons since it requires the build frame to support that. Instead, we will
assume that every header file in a package is available as a package interface.


Interface usage
~~~~~~~~~~~~~~~

In order to use an interface of a package, the header file is included using the
form:

    #include "package_name/interface.h"

The include path is set up by the build system so that this is possible. Note
that it is an error to include a file that is not a package interface or not a
header file of the same package.  Ideally, the build frame will not allow this,
but before that feature is implemented in the build frame, it will be possible
to do by mistake.

Header files of the same package are included using the form:

    #include "header.h"


Coding requirements
===================

This section outlines some basic rules that are meant to avoid common problems
associated with developing for a package structure as well as allowing
tool-support for checking and manipulating components and packages.  The need of
tool support is necessary to allow the system to grow, since manually resolving
issues will unnecessarily waste effort on maintaining inconsistencies.

The aim is to keep the rules to a bare minimum and specifically only consider
issues that (potentially can) traverse package boundaries or that cause problems
when maintaining or operating the build frame.  Issues on what is "good coding
style" is maintained separately and not part of this worklog.  This is done to
restrict the scope of the worklog and be able to close it.


Every header file should be self-sufficient
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For every header file "header.h", the following program shall compile without
errors:

    #include "header.h"

The reason is that when using a header file "header.h", it should be sufficient
to include "header.h" holding the functionality sought after. If it is necessary
to include any other files before "header.h" because there are definitions
required by "header.h", we have two problems:

1. It is hard to find out what dependencies are needed, and it will eventually
   lead to a trial and error approach that we are now seeing.

2. If the dependencies change, the file might include more files than
   necessary.


Every header file should have an include guard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a header file "header.h" in package "package", the include guard should have
the name PACKAGE_HEADER_INCLUDED. We choose to standardize the include guard so
that we can use external include guards if the need should arise. We omit the
extension from the name, since header files may have a number of different
extensions and we do not want to standardize on any one of them.

Existing include guards that are not violating the standard will not be changed
initially, but developers are encouraged to make the change if they are changing
the header file.


Source and header files should only include definitions it needs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For header files, it is critical to use forward declarations when that suffices.
The problems with including definitions that are not needed are twofold:

1. It introduces additional dependencies that are not necessary since
   definitions contain references to stuff that *it* needs. Note that
   dependencies may not only be on header files, but that unintended symbols
   may be pulled into the system.

2. It unnecessarily increase the compile time since it requires opening *at
   least* one more file (but usually several).  This problem is, however,
   secondary.


There shall be no convenience include files inside the server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Convenience include files are include files whose only purpose is to bundle
other include files.

The reason to why we want to avoid this *inside* the server is that it
introduces unnecessary dependencies between packages (recall that dependencies
between components are represented as an #include directive). Should some
include file be added to the convenience include file because *one* component
needs it, *all* components that include this convenience include will be
affected. To avoid introducing unnecessary dependencies in this way we could:

1. Have a rule stating that convenience include may only hold includes that
   are used by *all* components including this convenience include. This adds
   an additional burden on developers wanting to add an include to the
   convenience include to locate each user of the convenience include and
   decide if they need it. Since the includers of the convenience include is
   not easily visible in the file, it means searching all packages.

   Furthermore: with this approach it can be expected that over time, the set
   of includes in the file will shrink and the purpose of having a convenience
   include will diminish.

2. Have a rule stating that convenience includes shall not be used, which
   requires all necessary include files to be mentioned. This is a minor
   problem from a development perspective and make dependencies between
   components explicit, hence clear.

We chose the latter.

However, convenience include files serve a purpose for maintaining interfaces
*into* the server is accepted (for example, to make it easier to work with the
client interface). For these files it is, however, critical that they are
convenience includes and not contain separate definitions.


No ``using`` directives (``using namespace``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Placing using directives at namespace level in header files will force any file
that includes the header file to resolve symbols in a namespace they have no
control over. This can lead to subtle and hard to find bugs, and should
therefore not be used.

Placing ``using`` directives at namespace level in source files will inject all
symbols of that namespace (as ``pkga``) into another namespace (say ``pkgb``).
If changes are made to ``pkga``, they may conflict with definitions in ``pkgb``
and since a developer have to ensure the system builds for each patch, he would
be forced to make changes in ``pkgb`` despite the fact that the change itself is
localized to ``pkga``.


No ``using`` declarations before #include directives
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Placing a using declaration before including another file will place all the
symbols of the included file in a namespace and should not be used.


Entities declared in a component shall be defined in the component
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

An object or function declared in the header file of a component shall be
defined in the same component (usually in an implementation file). The reason
for this rule is that it shall be easy to know what components that need to be
linked in order to use the component. If some definition is in another file, it
will be hard to find and manage the right dependencies between components in the
system.


No gratuitous link-time dependencies between components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Such dependencies can occur if a component, for example, declares an ``extern``
variable and do not include the proper header file. All dependencies shall be
explicit in the sense that they shall be visible in the file as an ``#include``
directive. This will allow dependencies to be detected and tracked automatically.


#######################################
Appendix A. Definitions and discussions
#######################################


Packages
========

Packages are collections of components organized as a cohesive unit (that is,
serve a common purpose).

Each package has one or more (exported) interfaces, which are represented by one
or more header files.

Defining files of a package
---------------------------

In order to define what files are part of a package, there are basically two
options: either supply a file for each package that lists the files of that
package, or put all the files of a package into a subdirectory.

The advantages of using the file system to define packages by putting all the
files of a package in a separate directory suggests that this approach should be
used.


Interfaces into a package
-------------------------

Each package have one or more interfaces represented physically as one or more
header files. The header files contains objects and definitions necessary to
interface with the package, so we have no restrictions on the structure (but do
have some recommendations for how to structure the interfaces for maintainability).

Each interface is normally defined with some strategic objective, i.e., it is
created for an intended set of users. We use *export target* to denote such a
set of users of an interface. Library functions usually have only one export
target, but many of our packages have several export targets such as "client
developers" who write application clients to the server, "storage engine
developers" who are creating a storage engine for the server, "plugin
developers" that are writing a plug-in for the server.

For each export target, we should ensure that the header files holding the
interfaces is defined in such a manner that only the parts needed by that export
target is included when that header file is included. Gratuitous definitions is
a problem since they might clash with the names defined by the user, and also
introduces an unnecessary dependency on parts of the server that the user does
not in reality depend on.

In short, interfaces into packages are represented as one or more header files,
and we have two basic methods to identify the interface files: by naming
convention (for example, placing the interface file in a separate directory) or
by using one or more configuration files that explicitly the interface interface
files.


Using naming conventions
~~~~~~~~~~~~~~~~~~~~~~~~

For this discussion, we assume that the exported interface header files are put
into the export/ directory.  However, the same arguments apply to other schemes
for using naming conventions.

Note that each header file might correspond to a source file placed in the main
package directory, like this:

      goobar/
        export/
          goo_interface.h
        goo_impl.cc
           .
           .
           .

The advantage of this approach is:

- Simplicity: normal file commands can be used to work with files. For
  example, to copy all files needed by a plugin-sdk could be as simple as:

   cp package/export/*.h /distro/include

The disadvantages are:

- Changing the status of a file from, e.g., internal to public requires
  moving the file and not all VCS systems support that well.

- Having multiple "export targets" (users of the interface) require separate
  directories. For example, a package could export an interface for
  third-party users and one towards the rest of the server packages.


Configuration file
~~~~~~~~~~~~~~~~~~

We somehow add extra configuration file(s) in the package to denote if the
header file is exported. For this approach, we have two alternatives:

a) Add a file parallel to the header file, e.g., the fact that "foo.h.export"
   exists could mean that the header file "foo.h" is an exported file.

b) We introduce a "manifest" file for each package, containing information
   about the files in the package.

The advantages of this approach is [incomplete list]:

- Changing the properties of a file (e.g., from "internal" to "exported") does
  not require any changes to the file itself.

- It allows header files to be marked with other properties, such as header
  files that are supposed to be exported to third-party developers.

The disadvantages are:

- Working with files is not trivial, e.g., copying all header files that goes
  into the plugin SDK could be:

     cp `grep plugin-sdk package/manifest | cut -f1` /distro/include


Include file and path management
--------------------------------

In order to manage the include path and the include files, it is necessary to
ensure that all the header files that are exported are available for every
package in the system, and *only* those files.

To handle this, we basically have two approaches:

1. We have an include path containing the directory where the exported header
   files for each package is stored.  This require the header files to be
   placed in a dedicated "export/" directory inside the package: otherwise,
   all header files of a package will be exportable, which is not the intention.

   So, for example, the include path could be set to

       pkb_a/export;pkg_b/export;pkg_c/export

   Whenever a package is added or removed, this would mean that the include
   path would have to be updated to match the actual packages available.

   The advantages of this approach are:

   - Simple model

   - No need to generate or copy files

   The disadvantages of this approach are:

   - If a package is added or removed, the include path have to be updated.
     Since every package depends on the include path, it might trigger a
     re-build.

   - If a header file with the same name is in multiple places, it will not be
     detected.

   - Is most cases, the source control systems will generate a conflict for the
     addition and removal of a directory to the include path in the build file
     (e.g., configure.ac or Makefile.am).

2. We have a dedicated include directory for, e.g., the server where exported
   header are available, and let the manifest file contain information on what
   files are to be made available in the central include directory.

   This would mean that the path stays the same regardless of what packages are
   available.

   For this approach, we have two "sub-approaches" on how to make the header
   files available from the include directory:

   a) Copy the files to the dedicated include directory

   b) Generate a header file holding only an #include directive referencing to
      the correct header file.

   The advantages of this approach is:

   - That there is no need to maintain an extensive include path to be able to
     compile a package (which might have dependencies on other packages).

   - Package maintenance is very easy. For example, adding a package does not
     require changing any include paths or anything at all in the build frame.

   - Conflicting header files will be detected during the build process (e.g.,
     when copying header files to the include directory).

   The disadvantages are:

   - Requires more work in the build frame.

   - It requires a "staging" phase, where header files are made available in
     the dedicated include directory, either by copying or generating files.

   - In the copy approach (2a), it is necessary to build a dependencies
     Makefile for the include directory, to trigger a copy whenever the
     original header file changes.

   - In the copy approach (2a), it is possible that a developer starts editing
     the wrong file, which will then be overwritten at some later point, which
     will be hard to discover.

Implementation
==============

In order to implement the structure described in the high-level specification,
we should approach it in well-contained steps that lead us to the goal. For
example, since we need to develop a build frame for supporting this, we need an
intermediate solution that does not cause problems for the final deployment of
the build structure and allow developers to work on creating packages without
introducing problems for the build frame.


Stage 1: Create the directory structure
---------------------------------------

Introduce the package directories and move the existing packages we have into
that directory. At this stage we will keep the existing autotools-based build
system and just do the minimal changes necessary to have a fully functional system.

We assume that the original "unstructured" sql/ code is dependent on the
packages, but that we have control over the dependencies between packages in the
"structured" directories.

In order to add a package, it will be necessary to:

- Create a Makefile.am for the package

- Add a reference under "SUBDIRS" in the parent directory

Note that in this stage, all header files in a package will be available as
package interface files, so care should be used when including header files from
other packages.

After this stage, developers will be able to create packages properly without
affecting the following stages.


Stage 2: Evaluate and optionally change to use CMake
----------------------------------------------------

It has been discussed if we should use CMake to build the server on all
platforms and not just Windows, since it seems to be a portable alternative.
However, concerns have been raised about the portability of CMake to the
platforms that we need to support, so this alternative need to be evaluated
before implementation starts.

The goal is to have an equivalently simple build system compared to existing
one, which also include being able to handle the system for defining pluggable
storage engines.

If the evaluation does not show problems with doing the switch, the replacement
should be done in two steps: just switching build system but otherwise maintain
the structure and build order of the old system.

We do this step separately since it will require merging the build process on
windows with the existing autotools-based build frame and still maintain the
same functionality.

After this stage, we will have a single build frame for all platforms, but there
will still be problems such as that package interfaces are not distinguished
from other header files.


Stage 3: Streamline and consolidate build frame
-----------------------------------------------

At this stage, the build frame will be consolidated by ensuring that there is
support for easily working with the code.