Charset Migration Tool

Running the Charset Migration Tool

In this section:

WebFOCUS Hyperstage Charset Migration Tool Parameters

CHMT requires a text file containing a mapping between collations used for conversion.

chmt --help // help message

Executing CHMT:

chmt --datadir=/absolute/path/to/data/directory [other parameters]

Top of page

WebFOCUS Hyperstage Charset Migration Tool Parameters

Parameter

Type

Description

Details

datadir

Mandatory

Absolute path to data directory.

conv-map

Optional

Absolute path to file with collations conversions.

If not specified, CHMT would try to use file: chmt-binary-folder/.../support-files/collations.txt; if not found there, it would search for: chmt-binary-folder/collations.txt

database

Optional

Name of database for migrating.

If specified, tables from no other databases would be migrated.

table

Optional

Name of table for migrating.

If specified, database must be also specified; no other tables but specified will be migrated.

log-file

Optional

Absolute path to output log file.

If not specified, logs will be printed to the console.

Parameter	Type	Description	Details
datadir	Mandatory	Absolute path to data directory.
conv-map	Optional	Absolute path to file with collations conversions.	If not specified, CHMT would try to use file: chmt-binary-folder/.../support-files/collations.txt; if not found there, it would search for: chmt-binary-folder/collations.txt
database	Optional	Name of database for migrating.	If specified, tables from no other databases would be migrated.
table	Optional	Name of table for migrating.	If specified, database must be also specified; no other tables but specified will be migrated.
log-file	Optional	Absolute path to output log file.	If not specified, logs will be printed to the console.

Log Structure

The logs detail information about every considered table found in a specified datadir. Each conversion finishes with [NOT NEEDED], [PASS] or [FAILED] status.

Collations-Conversion-File Structure

Each conversion directive is stored in one line of file:

collation_from_name;collation_from_id;collation_to_name;collation_to_id

For example:

big5_chinese_ci;1;binary;63

where both fields containing names are only informative (all conversions will be done using only IDs).

Example collations-conversion-file (with conversion directives described above) can be obtained by running the following SQL:

use information_schema; 
select a.collation_name a_n, a.id a_id, b.collation_name b_n, b.id b_id from information_schema.collations a, information_schema.collations b, character_sets c where substr(a.collation_name, 1,  locate('_',a.collation_name)-1)=c.character_set_name and substr(a.collation_name, 1, locate('_',a.collation_name)) = substr(b.collation_name, 1, locate('_',b.collation_name)) and b.collation_name like '%bin' and c.maxlen=1 UNION select a.collation_name a_n, a.id a_id, 'binary' b_n, 63 b_id from information_schema.collations a, character_sets c where ((substr(a.collation_name, 1,  locate('_',a.collation_name)-1)=c.character_set_name) or (locate('_',a.collation_name)=0)) and (c.maxlen>1 or c.character_set_name='binary') order by a_id into outfile '/some/path/my_collations.txt' fields terminated by ';';

WebFOCUS Hyperstage DomainExpert

In this section:

The Hyperstage DomainExpert improves data compression and the performance of import, queries, and export. The DomainExpert allows you to define the composition of data, particularly columns. The database then uses this information to optimize the storage of the data and to reduce query processing time.

DomainExpert metadata is maintained in the system tables of the database sys_hyperstage and should be managed only with the use of stored procedures.

Top of page

Decomposition Rules

Decomposition rules are the main DomainExpert objects. Each rule describes the composition structure of values of a selected column expressed in a simple language. You can create, modify and delete rules using the following stored procedures from the system database sys_hyperstage:

CREATE_RULE(id, rule, comment)
UPDATE_RULE(id, rule)
CHANGE_RULE_COMMENT(id, comment)
DELETE_RULE(id)

where:

id: Is a unique identifier or name of a rule.
rule: Defines the structure of values.
comment: Is a free description of the rule.

For example, to create a simple rule for email addresses, you would run the following command:

CALL SYS_HYPERSTAGE.CREATE_RULE('EMAIL', '%s@%s', 'Rule for email addresses');

The rules are stored in the system table DECOMPOSITION_DICTIONARY. The list of all rules defined in the system can be obtained with the following query:

SELECT * FROM SYS_HYPERSTAGE.DECOMPOSITION_DICTIONARY;

Top of page

Decomposition Rules Language

The language to define the structure of values accepts three types of primitives:

Non-negative integer number, denoted as %d.
Arbitrary character sequence, denoted as %s.
Literal, a sequence of characters that are to be matched exactly.

Examples:

%d.%d.%d.%d: Decomposes an IP address (4-byte version) in four 1-byte numerical components.
%s@%s: Decomposes an email address into the user name and the domain name.
%s://%s?%s: Decomposes a simple URL with a query string into the scheme, the address, and the query string.

As the percent sign (%) is a special character, to match it literally you can use a double percent sign (%%). For example, to match exactly the text 10% humidity, the rule can be defined as 10%% humidity. However, the percent sign only has a special meaning if it is followed by the letter s or d. Otherwise, the percent sign has the literal meaning, so in the above example the unmodified text 10% humidity is also a correct syntax of the exact rule.

There are two constraints on the rule syntax-the following ambiguous subsequences of symbols are not allowed in rules:

```
%s%s
```
```
%d&d
```

The matching algorithm for rules is LAZY. The algorithm moves to the next primitive in the rule as soon as possible. For example, for the text aa.bb.cc and the rule %s.%s, the first %s is matched to aa and the second %s is matched to bb.cc. However, if the most lazy approach fails, the algorithm searches back until the correct match is found or all the cases are traced. For example, for the text aa.bb.11 and the rule %s.%d, the string %s is matched to aa.bb and the number %d is matched to 11.

The current language is a simple, limited language that will be replaced with a much more powerful language in the future. The current language does not support the following regular expression constructs (these will be added in future releases):

Grouping. For example, (%s.%s).%s@(%d%s).%s
Type classes. For example, [%s|%d]@%s
Repetition. For example, %s{5,10}
Optional inclusion. For example, (%s.)?%d. This currently matches (string.)?1000 whereas it might more reasonably match string.1000 and 1000.
Sub-expressions
Word boundaries
Back-references. For example, each group has a reference, $1 for the match of the first group, $2 for the match of the second group, and so on.

Building recursive rules using the following operations is also not yet available:

Concatenation: r1r2 where r1 and r2 are any pair of already defined rules, matches any value that is concatenation of any pair of values, with v1 matching r1 and v2 matching r2.
Union (alternative): r1|r2 matches each value that matches one of r1 and r2.
Closure: r* matches each value which is any repetition of any value matching r.

Top of page

Predefined IPv4 Rule

Besides user-defined rules, Hyperstage provides a built-in rule that is not expressible in the above language. This is the IPv4 rule that is defined and added to the DomainExpert metadata at installation. IPv4 converts the text representation of an IP address into a single 32-bit number as used in network hardware and low-level network handling software.

If you have data with IP addresses, this allows you to compare the performance of the predefined IPv4 with IP decompositions expressible in the language. For example, with the rule %d.%d.%d.%d.

Top of page

Other Predefined Rules

The following predefined rules are provided with the DomainExpert.

Rule ID

Rule Content

Comments

EMAIL

%d.%d.%d.%d

Similar to IPv4 but uses generic numeric compression.

EMAIL

%s@%s

Username/domain split of an email address.

URL

%s:://%s?%s

Protocol, domain, and query parameters based rule.

Rule ID	Rule Content	Comments
EMAIL	%d.%d.%d.%d	Similar to IPv4 but uses generic numeric compression.
EMAIL	%s@%s	Username/domain split of an email address.
URL	%s:://%s?%s	Protocol, domain, and query parameters based rule.

These rules can be improved if the user data matches more specific criteria (for example, the domain always contains a suffix, such as .com). Using specific criteria may improve both the compression ratio and the response time. If you want to use more specific rules, create new rules (instead of replacing the predefined ones).

Assigning Rules to Columns

The stored procedure SET_DECOMPOSITION_RULE(database, table, column, id) from the database sys_hyperstage supports the assignment of rules to particular columns from the Hyperstage tables. For example, to apply the predefined IPv4 rule to column IP in the table CONNECTION from the database NETWORK, run the following command:

CALL SYS_HYPERSTAGE.SET_DECOMPOSITION_RULE('NETWORK', 'CONNECTION', 'IP', 'IPv4');

The decomposition rules can be applied only to columns of string types that are not lookup columns:

CHAR
VARCHAR
BINARY
VARBINARY
TEXT
TINYTEXT

Assigning a rule to a column of another type or to a lookup column is ignored.

You cannot set multiple rules on the same column. If the SET_DECOMPOSITION_RULE procedure is called for a column with an already assigned rule, the previous rule is replaced with the new rule.

To see the current decomposition rules for a particular table, use the SHOW_DECOMPOSITION procedure. For example:

CALL SYS_HYPERSTAGE.SHOW_DECOMPOSITION('NETWORK', 'CONNECTION');

If a rule is assigned to a column, you cannot change or delete the rule from the DECOMPOSITION_DICTIONARY system table.

Applying Rules to Data

After decomposition rules are assigned to columns, the rules are automatically applied to any new data coming to the tables containing these rules when using the following standard DML commands:

LOAD DATA
INSERT
UPDATE

If a rule is assigned to a column, instead of storing whole values, each value inserted into the column is decomposed into the parts matching the subsequent occurrences of %s and %d in the rule and the parts are compressed and stored in separate subcollections. Each subcollection corresponds to one occurrence of %s or %d in the rule.

A value inserted into a column with a decomposition defined does not have to match the rule. Such non-matching values are inserted into a separate subcollection. This subcollection of outliers is compressed and stored independently of other subcollections.

You can obtain the accuracy of decomposition rules by setting the ControlMessages parameter in the brighthouse.ini file to 2 (or higher):

ControlMessages = 2

If the parameter is set on each COMMIT for each column, Hyperstage reports the number of outliers among the committed values (from INSERTs and LOADs). For example:

2011-05-25 16:59:03 Decomposition of ./network/connection.ip left 15 outliers.

Hyperstage also reports the change in the number of outliers for the updated values (from UPDATEs), for example:

2011-05-25 16:59:46 The number of outliers increased by 2 after update(-s)
on network.connection.ip
2011-05-25 17:00:03 The number of outliers reduced by 3 after update(-s) 
on network.connection.ip

Note: Applying a decomposition rule DOES NOT always result in better compression ratio and time. A decomposition rule may result in a worse compression ratio or load and slower queries. To ensure decomposition improves performance. You can compare load time, compression ratio, and query time when loading the same data to a table with a decomposition rule defined and to a table without decomposition.

Modifying a Rule for an Existing Column

A rule for a column can be changed or deleted during the life of the table using the following stored procedures:

SET_DECOMPOSITION_RULE(database, table, column, id)
DELETE_DECOMPOSITION_RULE(database, table, column)

The change applies only to new data. The old data remains decomposed with the previously used rules. If the rule for a column is deleted, new values are stored without decomposition.

If a value is updated to a new value with an UPDATE command, then for the new value Hyperstage uses the original rule used to decompose the old value. The currently assigned rules are not used for UPDATEs.