See5: An Informal Tutorial

Welcome to See5, a system that extracts informative patterns from data. The following sections show how to prepare data files for See5 and illustrate the options for using the system.

Preparing Data for See5
User Interface
Constructing Classifiers
Using Classifiers
Cross-Referencing Classifiers and Data
Generating Classifiers in Batch Mode
Linking to Other Programs

Preparing Data for See5

We will illustrate See5 using a medical application -- mining a database of thyroid assays from the Garvan Institute of Medical Research, Sydney, to construct diagnostic rules for hypothyroidism. Each case concerns a single referral and contains information on the source of the referral, assays requested, patient data, and referring physician's comments. Here are two examples:

        Attribute                 Case 1    Case 2    .....

	age                       41        23
	sex                       F         F    
	on thyroxine              f         f    
	query on thyroxine        f         f
	on antithyroid medication f         f
	sick                      f         f    
	pregnant                  f         f    
	thyroid surgery           f         f    
	I131 treatment            f         f    
	query hypothyroid         f         f    
	query hyperthyroid        f         f    
	lithium                   f         f    
	tumor                     f         f    
	goitre                    f         f    
	hypopituitary             f         f    
	psych                     f         f    
	TSH                       1.3       4.1  
	T3                        2.5       2
	TT4                       125       102
	T4U                       1.14      unknown
	FTI                       109       unknown
	referral source           SVHC      other
	diagnosis                 negative  negative
	ID                        3733      1442

This is exactly the sort of task for which See5 was designed. Each case belongs to one of a small number of mutually exclusive classes (negative, primary, secondary, compensated). Properties of every case that may be relevant to its class are provided (although some cases may have unknown values for some attributes). There are 24 attributes in this example, but See5 can deal with any number of attributes.

See5's job is to find how to predict a case's class from the values of the other attributes. See5 does this by constructing a classifier that makes this prediction. As we will see, See5 can construct classifiers expressed as decision trees or as sets of rules.

Application filestem

Every See5 application has a short name called a filestem; we will use the filestem hypothyroid for this illustration. All files read or written by See5 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file. The case of letters in both the filestem and extension is important -- file names APP.DATA, app.data, and App.Data, are all different. It is important that the extensions are written exactly as shown below, otherwise See5 will not recognize the files for your application.

Names file

Two files are essential for all See5 applications and there are three further optional files, each identified by its extension. The first essential file is the names file (e.g. hypothyroid.names) that describes the attributes and classes. There are two important subgroups of attributes:

A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric value, and a label attribute serves only to identify a particular case.
The value of an explicitly-defined attribute is given directly in the data, while the value of an implicitly-defined attribute is specified by a formula. (Most attributes are explicitly defined, so you may never need implicitly-defined attributes.)

The file hypothyroid.names looks like this:

	diagnosis.			| the target attribute

	age:				continuous.
	sex:				M, F.
	on thyroxine:			f, t.
	query on thyroxine:		f, t.
	on antithyroid medication:	f, t.
	sick:				f, t.
	pregnant:			f, t.
	thyroid surgery:		f, t.
	I131 treatment:			f, t.
	query hypothyroid:		f, t.
	query hyperthyroid:		f, t.
	lithium:			f, t.
	tumor:				f, t.
	goitre:				f, t.
	hypopituitary:			f, t.
	psych:				f, t.
	TSH:				continuous.
	T3:				continuous.
	TT4:				continuous.
	T4U:				continuous.
	FTI:=				TT4 / T4U.
	referral source:		WEST, STMW, SVHC, SVI, SVHD, other.

	diagnosis:			primary, compensated, secondary, negative.

	ID:				label.

Whitespace (blank lines, spaces, and tab characters) is ignored except inside a name or value and can be used to improve legibility. The vertical bar `|' can appear anywhere in a file except inside a name or value: it causes the remainder of the line to be ignored and is handy for including comments.

The first line of the names file gives the classes, either by naming a discrete attribute (the target attribute) that contains the class value (as in this example), or by listing them explicitly. The attributes are then defined in the order that they will be given for each case.

The name of each explicitly-defined attribute is followed by a colon `:' and a description of the values taken by the attribute. There are six possibilities:

continuous: The attribute takes numeric values.
date: The attribute's values are dates in the form YYYY/MM/DD, e.g. 1999/09/30. Valid dates range from the year 1601 to the year 4000 (perhaps optimistically assuming that humanity survives as long as that!).
a comma-separated list of names: The attribute takes discrete values, and these are the allowable values. The values may be prefaced by [ordered] to indicate that they are given in a meaningful ordering, otherwise they will be taken as unordered. For instance, the values low, medium, high are ordered, while meat, poultry, fish, vegetables are not. If the attribute values have a natural order, it is better to declare them as ordered so that this information can be exploited by See5.
discrete N for some integer N: The attribute has discrete, unordered values, but the values are assembled from the data itself; N is the maximum number of such values. (This is not recommended, since the data cannot be checked, but it can be handy for unordered discrete attributes with many values.) NB: This form cannot be used for the target attribute.
ignore: The values of the attribute should be ignored.
label: This attribute contains an identifying label for each case, such as an account number or an order code. The value of the attribute is ignored when classifiers are constructed, but is used when referring to individual cases. A label attribute can make it easier to locate errors in the data and to cross-reference results to individual cases. If there are two or more label attributes, only the last is used.

The name of each implicitly-defined attribute is followed by `:=' and then a formula defining the attribute value. The formula is written in the usual way, using parentheses where needed, and may refer to any attribute defined before this one. Constants in the formula can be numbers (written in decimal notation), dates, and discrete attribute values (enclosed in string quotes `"'). The operators and functions that can be used in the formula are

+, -, *, /, % (mod), ^ (meaning `raised to the power')
>, >=, <, <=, =, <> or != (both meaning `not equal')
and, or
sin(...), cos(...), tan(...), log(...), exp(...), int(...) (meaning `integer part of')

The value of such an attribute is either continuous or true/false depending on the formula. For example, the attribute FTI above is continuous, since its value is obtained by dividing one number by another. The value of a hypothetical attribute such as

	strange := referral source = "WEST" or age > 40.

would be either t or f since the value given by the formula is either true or false.

Dates are stored by See5 as the number of days since a particular starting point (1600/03/01) so some operations on dates make sense. Thus, if we have attributes

        d1: date.
        d2: date.

we could define

         interval := d2 - d1.
	 gap := d1 <= d2 - 7.
	 d1-day-of-week := (d1 + 1) % 7 + 1.

interval then represents the number of days from d1 to d2 (non-inclusive) and gap would have a true/false value signaling whether d1 is at least a week before d2. The last definition is a slightly non-obvious way of determining the day of the week on which d1 falls, with values ranging from 1 (Monday) to 7 (Sunday).

Finally, if the value of the formula cannot be determined for a particular case because one or more of the attributes appearing in the formula have unknown values, the value of the implicitly-defined attribute is also unknown.

Data file

The second essential file, the application's data file (e.g. hypothyroid.data) provides information on the training cases from which See5 will extract patterns. The entry for each case consists of one or more lines that give the values for all explicitly-defined attributes. If the classes are listed in the first line of the names file, the attribute values are followed by the case's class value. If an attribute value is not known, it is replaced by a question mark `?'. Values are separated by commas and the entry is optionally terminated by a period. Once again, anything on a line after a vertical bar `|' is ignored. (If the information for a case occupies more than one line, make sure that the line breaks occur after commas.)

The first two cases from file hypothyroid.data are:

	41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.3,2.5,125,1.14,SVHC,negative,3733
	23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,4.1,2,102,?,other,negative,1442

Don't forget the commas between values! If you leave them out, See5 will not be able to process your data. Notice that the cases do not contain values for the attribute FTI whose values are computed from other attribute values.

Test and cases files (optional)

Of course, the value of predictive patterns lies in their ability to make accurate predictions! It is difficult to judge the accuracy of a classifier by measuring how well it does on the cases used in its construction; the performance of the classifier on new cases is much more informative. (For instance, any number of gurus tell us about patterns that `explain' the rise/fall behavior of the stock market in the past. Even though these patterns may appear plausible, they are only valuable to the extent that they make useful predictions about future rises and falls.)

The third kind of file used by See5 consists of new test cases (e.g. hypothyroid.test) on which the classifier can be evaluated. This file is optional and, if used, has exactly the same format as the data file.

Another optional file, the cases file (e.g. hypothyroid.cases), differs from a test file only in allowing the cases' classes to be unknown. The cases file is used primarily with the cross-referencing procedure and public source code, both of which are described later on.

Costs file (optional)

The last kind of file, the costs file (e.g. hypothyroid.costs), is also optional and sets out differential misclassification costs. In some applications there is a much higher penalty for certain types of mistakes. In this application, a prediction that hypothyroidism is not present could be very costly if in fact it is. On the other hand, predicting incorrectly that a patient is hypothyroid may be a less serious error. See5 allows different misclassification costs to be associated with each combination of real class and predicted class. We will return to this topic near the end of the tutorial.

User Interface

It is difficult to see what is going on in an interface without actually using it. As a simple illustration, here is the main window of See5 after the hypothyroid application has been selected.

See5 main window

The main window of See5 has six buttons on its toolbar. From left to right, they are

Locate Data: invokes a browser to find the files for your application, or to change the current application;
Construct Classifier: selects the type of classifier to be constructed and sets other options;
Stop: interrupts the classifier-generating process;
Review Output: re-displays the output from the last classifier construction (if any);
Use Classifier: interactively applies the current classifier to one or more cases; and
Cross-Reference: maps between the training data and classifiers constructed from it.

These functions can also be initiated from the File menu.

The Edit menu facilities changes to the names and costs files after an application's files have been located. On-line help is available through the Help menu.

Constructing Classifiers

Once the names, data, and optional files have been set up, everything is ready to use See5.

The first step is to locate the date using the Locate Data button on the toolbar (or the corresponding selection from the File menu). We will assume that the hypothyroid data above has been located in this manner.

There are several options that affect the type of classifier that See5 produces and the way that it is constructed.

The Construct Classifier button on the toolbar (or selection from the File menu) displays a dialog box that sets out these classifier construction options:

Main dialog box

In this section we will examine them in turn, starting with the simpler situations.

Decision trees

When See5 is invoked with the default values of all options, it constructs a decision tree and generates output like this:

	See5 [Release 1.11]  	Tue Jul 27 16:38:51 1999
	-------------------

	Class specified by attribute `diagnosis'

	Read 2772 cases (24 attributes) from hypothyroid.data

	Decision tree:

	TSH <= 6: negative (2472/2)
	TSH > 6:
	:...FTI > 65:
	    :...on thyroxine = t: negative (37.7)
	    :   on thyroxine = f:
	    :   :...thyroid surgery = t: negative (6.8)
	    :       thyroid surgery = f:
	    :       :...TT4 > 153: negative (6/0.1)
	    :           TT4 <= 153:
	    :           :...TT4 <= 37: primary (2.5/0.2)
	    :               TT4 > 37: compensated (174.6/24.8)
	    FTI <= 65:
	    :...thyroid surgery = t:
	        :...TT4 <= 48: negative (2)
	        :   TT4 > 48: primary (2.2/0.2)
	        thyroid surgery = f:
	        :...TT4 <= 61: primary (51/3.7)
	            TT4 > 61:
	            :...referral source in {WEST,SVHD}: primary (0)
	                referral source = STMW: primary (0.1)
	                referral source = SVHC: primary (1)
	                referral source = SVI: primary (3.8/0.8)
	                referral source = other:
	                :...TSH > 22: primary (5.8/0.8)
	                    TSH <= 22:
	                    :...T3 <= 2.3: compensated (3.4/0.9)
	                        T3 > 2.3: negative (3/0.2)


	Evaluation on training data (2772 cases):

		    Decision Tree   
		  ----------------  
		  Size      Errors  

		    16    6( 0.2%)   <<


		   (a)   (b)   (c)   (d)    <-classified as
		  ----  ----  ----  ----
		    60     3                (a): class primary
		         154                (b): class compensated
		                       2    (c): class secondary
		           1        2552    (d): class negative


	Evaluation on test data (1000 cases):

		    Decision Tree   
		  ----------------  
		  Size      Errors  

		    16    4( 0.4%)   <<


		   (a)   (b)   (c)   (d)    <-classified as
		  ----  ----  ----  ----
		    31     1                (a): class primary
		     1    39                (b): class compensated
		                            (c): class secondary
		           2         926    (d): class negative


	Time: 0.3 secs

(Since hardware platforms can differ in floating point precision and rounding, the output that you see might differ very slightly from the above.)

The first line identifies the version of See5 and the run date. See5 constructs a decision tree from the 2772 training cases in the file hypothyroid.data, and this appears next. Although it may not look much like a tree, this output can be paraphrased as:

	if TSH is less than or equal to 6 then negative
	else
	if TSH is greater than 6 then
	    if FTI is greater than 65 then
	        if on thyroxine equals t then negative
		else
	        if on thyroxine equals f then
	            if thyroid surgery equals t then negative
		    else
	            if thyroid surgery equals f then
	                if TT4 is greater than 153 then negative
			else
	                if TT4 is less than or equal to 153 then
	                    if TT4 is less than or equal to 37 then primary
			    else
			    if TT4 is greater than 37 then compensated
	    else
	    if FTI is less than or equal to 65 then
	    . . . .

and so on. The tree employs a case's attribute values to map it to a leaf designating one of the classes. Every leaf of the tree is followed by a cryptic (n) or (n/m). For instance, the last leaf of the decision tree is negative (3.0/0.2), for which n is 3.0 and m is 0.2. The value of n is the number of cases in the file hypothyroid.data that are mapped to this leaf, and m (if it appears) is the number of them that are classified incorrectly by the leaf. (A non-integral number of cases can arise because, when the value of an attribute in the tree is not known, See5 splits the case and sends a fraction down each branch.)

The last section of the See5 output concerns the evaluation of the decision tree, first on the cases in hypothyroid.data from which it was constructed, and then on the new cases in hypothyroid.test. The size of the tree is its number of leaves and the column headed Errors shows the number and percentage of cases misclassified. The tree, with 16 leaves, misclassifies 6 of the 2772 given cases, an error rate of 0.2%. Performance on these cases is further analyzed in a confusion matrix that pinpoints the kinds of errors made. In this example, the decision tree misclassifies three of the primary cases as compensated, both secondary cases as negative, and one negative case as compensated.

A very simple majority classifier predicts that every new case belongs to the most common class in the training data. In this example, 2553 of the 2772 training cases belong to class negative so that a majority classifier would always opt for negative. The 1000 test cases from file hypothyroid.test include 928 belonging to class negative, so a simple majority classifier would have an error rate of 7.2%. The decision tree has a lower error rate of 0.4% on the new cases, but notice that this is higher than its error rate on the training cases. The confusion matrix for the test cases again shows the detailed breakdown of correct and incorrect classifications.

Discrete value subsets

The default way in which See5 constructs tests on discrete attributes as to associate a separate branch with each value for which cases are available. Tests with a high fan-out can have the undesirable side-effect of fragmenting the data during construction of the decision tree. See5 has a Subset option that can mitigate this fragmentation to some extent: attribute values are grouped into subsets and each subtree is associated with a subset rather than with a single value.

In the hypothyroid example, invoking this option has little effect on the tree, merely simplifying the later part to

                referral source in {WEST,STMW,SVHC,SVI,SVHD}: primary (4.9/0.8)
                referral source = other:
                :...TSH > 22: primary (5.8/0.8)
                    TSH <= 22:
                    :...T3 <= 2.3: compensated (3.4/0.9)
                        T3 > 2.3: negative (3/0.2)

with no effect on classification performance. However, this option can be worth trying if your application has numerous discrete attributes with three or more values.

Rulesets

Decision trees can sometimes be very difficult to understand. An important feature of See5 is its mechanism to convert trees into collections of rules called rulesets. The Rulesets option causes rules to be derived from trees produced as above, giving the following rules:

	Rule 1: (31, lift 42.7)
	    	thyroid surgery = f
	    	TSH > 6
	    	TT4 <= 37
		->  class primary  [0.970]

	Rule 2: (63/6, lift 39.3)
	    	TSH > 6
	    	FTI <= 65
		->  class primary  [0.892]

	Rule 3: (141/1, lift 17.7)
	    	on thyroxine = f
	    	thyroid surgery = f
	    	TSH > 6
	    	TT4 <= 153
	    	FTI > 65
		->  class compensated  [0.986]

	Rule 4: (114/24, lift 14.1)
	    	thyroid surgery = f
	    	TSH > 6
	    	TSH <= 22
	    	T3 <= 2.3
	    	TT4 > 61
		->  class compensated  [0.784]

	Rule 5: (2225/2, lift 1.1)
	    	TSH <= 6
		->  class negative  [0.999]

	Rule 6: (296, lift 1.1)
	    	on thyroxine = t
	    	FTI > 65
		->  class negative  [0.997]

	Rule 7: (240, lift 1.1)
	    	TT4 > 153
		->  class negative  [0.996]

	Rule 8: (531/16, lift 1.1)
	    	TSH <= 22
	    	T3 > 2.3
	    	TT4 > 61
		->  class negative  [0.968]

	Rule 9: (38/2, lift 1.0)
	    	thyroid surgery = t
		->  class negative  [0.925]

	Default class: negative

Each rule consists of:

A rule number -- this is quite arbitrary and serves only to identify the rule.
Statistics (n, lift x) or (n/m, lift x) that summarize the performance of the rule. Similarly to a leaf, n is the number of training cases covered by the rule and m, if it appears, shows how many of them do not belong to the class predicted by the rule. The lift x is the estimated accuracy of the rule divided by the prior probability of the predicted class.
One or more conditions that must all be satisfied if the rule is to be applicable.
A class predicted by the rule.
A value between 0 and 1 that indicates the confidence with which this prediction is made. (Note: If boosting is used, this confidence is measured using an artificial weighting of the training cases and so does not reflect the accuracy of the rule.)

When a ruleset like this is used to classify a case, it may happen that several of the rules are applicable (that is, all their conditions are satisfied). If the applicable rules predict different classes, there is an implicit conflict that could be resolved in two ways: we could believe the rule with the highest confidence, or we could attempt to aggregate the rules' predictions to reach a verdict. See5 adopts the latter strategy -- each applicable rule votes for its predicted class with a voting weight equal to its confidence value, the votes are totted up, and the class with the highest total vote is chosen as the final prediction. There is also a default class, here negative, that is used when none of the rules apply.

Rulesets are generally much simpler to understand than trees since each rule describes a specific context associated with a class. Furthermore, a ruleset generated from a tree usually has fewer rules than than the tree has leaves, another plus for comprehensibility. (In this example, the first decision tree with 16 leaves is reduced to nine rules.) Finally, rules are often more accurate predictors than decision trees -- a point not illustrated here, since both have an error rate of 0.4% on the test cases. For very large datasets, however, generating rules with the Ruleset option can require considerably more computer time.

In the example above, rules are ordered by class and sub-ordered by confidence. This is the default, but an alternative ordering by contribution to predictive accuracy can be selected using the Sort by utility option. Under this option, the rule that most reduces the error rate appears first and the rule that contributes least appears last. Furthermore, results are reported in a selected number of bands so that the predictive accuracies of the more important subsets of rules are also estimated. For example, if the Sort by utility option with 5 bands is selected, the hypothyroid rules are reordered as

	Rule 1: (2225/2, lift 1.1)
	    	TSH <= 6
		->  class negative  [0.999]

	Rule 2: (141/1, lift 17.7)
	    	on thyroxine = f
	    	thyroid surgery = f
	    	TSH > 6
	    	TT4 <= 153
	    	FTI > 65
		->  class compensated  [0.986]

	Rule 3: (63/6, lift 39.3)
	    	TSH > 6
	    	FTI <= 65
		->  class primary  [0.892]

	Rule 4: (296, lift 1.1)
	    	on thyroxine = t
	    	FTI > 65
		->  class negative  [0.997]

	Rule 5: (240, lift 1.1)
	    	TT4 > 153
		->  class negative  [0.996]

	Rule 6: (114/24, lift 14.1)
	    	thyroid surgery = f
	    	TSH > 6
	    	TSH <= 22
	    	T3 <= 2.3
	    	TT4 > 61
		->  class compensated  [0.784]

	Rule 7: (38/2, lift 1.0)
	    	thyroid surgery = t
		->  class negative  [0.925]

	Rule 8: (31, lift 42.7)
	    	thyroid surgery = f
	    	TSH > 6
	    	TT4 <= 37
		->  class primary  [0.970]

	Rule 9: (531/16, lift 1.1)
	    	TSH <= 22
	    	T3 > 2.3
	    	TT4 > 61
		->  class negative  [0.968]

and a further summary is generated for both training and test cases. Here is the output for test cases:

	Rule utility summary:

		Rules	      Errors
		-----	      ------
		1-2	   39( 3.9%)
		1-4	    9( 0.9%)
		1-5	    9( 0.9%)
		1-7	    5( 0.5%)

This shows that the error rate on the test cases is 3.9% when only the first 1/5th of the rules are used, dropping to 0.9% when the first 2/5ths of the rules are used, and so on.

Boosting

Another innovation incorporated in See5 is adaptive boosting, based on the work of Rob Schapire and Yoav Freund. The idea is to generate several classifiers (either decision trees or rulesets) rather than just one. When a new case is to be classified, each classifier votes for its predicted class and the votes are counted to determine the final class.

But how can we generate several classifiers from a single dataset? As the first step, a single decision tree or ruleset is constructed as before from the training data (e.g. hypothyroid.data). This classifier will usually make mistakes on some cases in the data; the first decision tree, for instance, gives the wrong class for 6 cases in hypothyroid.data. When the second classifier is constructed, more attention is paid to these cases in an attempt to get them right. As a consequence, the second classifier will generally be different from the first. It also will make errors on some cases, and these become the the focus of attention during construction of the third classifier. This process continues for a pre-determined number of iterations.

The Boost option with x trials instructs See5 to construct up to x classifiers in this manner. Naturally, constructing multiple classifiers requires more computation that building a single classifier -- but the effort can pay dividends! Trials over numerous datasets, large and small, show that on average 10-classifier boosting reduces the error rate for test cases by about 25%.

Selecting the Rulesets option and the Boost option with 10 trials causes ten rulesets to be generated. The summary of the rulesets' individual and aggregated performance on the 1000 test cases is:

	Trial	    Decision Tree           Rules     
	-----	  ----------------    ----------------
		  Size      Errors      No      Errors

	   0	    16    4( 0.4%)       9    4( 0.4%)
	   1	    27   31( 3.1%)      19   20( 2.0%)
	   2	    48   21( 2.1%)      21    3( 0.3%)
	   3	    30   31( 3.1%)      19   19( 1.9%)
	   4	    35   16( 1.6%)      18    4( 0.4%)
	   5	    38   25( 2.5%)      22   21( 2.1%)
	   6	    28   13( 1.3%)      22    8( 0.8%)
	   7	    37   13( 1.3%)      20   13( 1.3%)
	   8	    52   18( 1.8%)      20    7( 0.7%)
	   9	    46   28( 2.8%)      28   14( 1.4%)
	boost	          6( 0.6%)            3( 0.3%)   <<

(Again, different floating point hardware can lead to slightly different results.) The performance of the classifier constructed at each iteration or trial is summarized on a separate line, while the line labeled boost shows the result of voting all the classifiers. The tree and ruleset constructed on Trial 0 are identical to those produced without the Boost option. Some of the subsequent trees and rulesets produced by paying more attention to certain cases have quite high overall error rates. When the ten rulesets are combined by voting, however, the final predictions have an error rate of 0.3% on the test cases.

Softening thresholds

The top of our initial decision tree tests whether the value of the attribute TSH is less than or equal to, or greater than, 6. If the former holds, we go no further and predict that the case's class is negative, while if it does not we look at other information before making a decision. Thresholds like this are sharp by default, so that a case with a hypothetical value of 5.99 for TSH is treated quite differently from one with a value of 6.01.

For some domains, this sudden change is quite appropriate -- for instance, there are hard-and-fast cutoffs for bands of the income tax table. For other applications, though, it is more reasonable to expect classification decisions to change more slowly with changes in attribute values.

See5 contains an option to `soften' thresholds such as 6 above. When this is invoked, each threshold is broken into three ranges -- let us denote them by a lower bound lb, an upper bound ub, and a central value t. If the attribute value in question is below lb or above ub, classification is carried out using the single branch corresponding to the `<=' or '>' result respectively. If the value lies between lb and ub, both branches of the tree are investigated and the results combined probabilistically. The values of lb and ub are determined by See5 based on an analysis of the apparent sensitivity of classification to small changes in the threshold. They need not be symmetric -- a fuzzy threshold can be relatively hard in one direction.

Invoking the Fuzzy thresholds option gives the following decision tree:

	TSH <= 6 (6.05): negative (2472/2)
	TSH >= 6.1 (6.05):
	:...FTI >= 65.7 (65.35):
	    :...on thyroxine = t: negative (37.7)
	    :   on thyroxine = f:
	    :   :...thyroid surgery = t: negative (6.8)
	    :       thyroid surgery = f:
	    :       :...TT4 >= 158 (153): negative (6/0.1)
	    :           TT4 <= 148 (153):
	    :           :...TT4 <= 31 (37.5): primary (2.5/0.2)
	    :               TT4 >= 44 (37.5): compensated (174.6/24.8)
	    FTI <= 65 (65.35):
	    :...thyroid surgery = t:
	        :...TT4 <= 36 (49.5): negative (2)
	        :   TT4 >= 63 (49.5): primary (2.2/0.2)
	        thyroid surgery = f:
	        :...TT4 <= 60 (61.5): primary (51/3.7)
	            TT4 >= 63 (61.5):
	            :...referral source in {WEST,SVHD}: primary (0)
	                referral source = STMW: primary (0.1)
	                referral source = SVHC: primary (1)
	                referral source = SVI: primary (3.8/0.8)
	                referral source = other:
	                :...TSH >= 44 (22.5): primary (5.8/0.8)
	                    TSH <= 19 (22.5):
	                    :...T3 <= 2.3 (2.35): compensated (3.4/0.9)
	                        T3 >= 2.4 (2.35): negative (3/0.2)

Each threshold is now of the form <= lb (t) or >= ub (t). In this example, most of the thresholds are still relatively tight, but notice the asymmetric threshold values for the last test of TSH. Soft thresholds do not affect accuracy on this test set.

A final point: soft thresholds affect only decision tree classifiers -- they do not change the interpretation of rulesets.

Additional options

Two further options enable aspects of the classifier-generation process to be tweaked. These are best regarded as advanced options that should be used sparingly (if at all), so that this section can be skipped without much loss.

See5 constructs decision trees in two phases. A large tree is first grown to fit the data closely and is then `pruned' by removing parts that are predicted to have a relatively high error rate. The Pruning CF option affects the way that error rates are estimated and hence the severity of pruning; values smaller than the default (25%) cause more of the initial tree to be pruned, while larger values result in less pruning.

The Minimum cases option constrains the degree to which the initial tree can fit the data. At each branch point in the decision tree, the stated minimum number of training cases must follow at least two of the branches. Values higher than the default (2 cases) can lead to an initial tree that fits the training data only approximately -- a form of pre-pruning. (This option is complicated by the presence of missing attribute values and by the use of differential misclassification costs, discussed below. Both cause adjustments to the apparent number of cases following a branch.)

Cross-validation trials

As we saw earlier, the predictive accuracy of a classifier constructed from the cases in a data file can be estimated from its performance on new cases in a test file. Unless there are a very large number of cases in both files, this estimate can be rather erratic. If the cases in hypothyroid.data and hypothyroid.test were to be shuffled and divided into a new 2772-case training set and a 1000-case test set, See5 might construct a different classifier whose error rate on the test cases could vary considerably.

One way to get a more reliable estimate of predictive accuracy is by f-fold cross-validation. The cases in the data file are divided into f blocks of roughly the same size and class distribution. For each block in turn, a classifier is constructed from the cases in the remaining blocks and tested on the cases in the hold-out block. In this way, each case is used just once as a test case. The error rate of a classifier produced from all the cases is estimated as the ratio of the total number of errors on the hold-out cases to the total number of cases.

Let's see what happens when the Cross-validation option with ten folds is chosen together with the Rulesets option. After giving details of the individual decision trees and rulesets, the output shows a summary like this:

	Fold      Decision Tree           Rules     
	----    ----------------    ----------------
	          Size    Errors        No    Errors

	  0        6.0     1.4%        6.0     1.4%   
	  1       16.0     0.7%        9.0     1.1%   
	  2       16.0     0.4%       11.0     0.4%   
	  3       16.0     0.0%        9.0     0.0%   
	  4       11.0     0.7%        8.0     0.4%   
	  5        8.0     0.7%        6.0     0.7%   
	  6        9.0     0.4%        7.0     0.7%   
	  7       10.0     1.1%        9.0     1.1%   
	  8        7.0     1.1%        7.0     1.1%   
	  9        7.0     1.4%        6.0     1.4%   

	Mean      10.6     0.8%        7.8     0.8%   
	SE         1.3     0.1%        0.5     0.2%

This estimates the error rate of decision tree and ruleset classifiers produced from the 2772 cases in hypothyroid.data at 0.8%. The SE figures (the standard errors of the means) provide an estimate of the variability of these results. A different random partition of the training cases is used every time a cross-validation is run. If the SE values are too high for comfort, the whole cross-validation can be repeated a few times and the average of the means used as the estimate.

Since every cross-validation fold uses only part of the application's data, running a cross-validation does not result in a classifier being saved. To save a classifier for later use, simply run See5 without employing cross-validation.

Sampling from large datasets

Even though See5 is relatively fast, building classifiers from large numbers of cases can take an inconveniently long time, especially when options such as boosting are employed. See5 incorporates a facility to extract a random sample from a dataset, construct a classifier from the sample, and then test the classifier on a disjoint collection of cases. By using a smaller set of training cases in this way, the process of generating a classifier is expedited, but at the cost of a possible reduction in the classifier's predictive performance.

The Sample option with x% has two consequences. Firstly, a random sample containing x% of the cases in the application's data file is used to construct the classifier. Secondly, the classifier is evaluated on a non-overlapping set of test cases consisting of another (disjoint) sample of the same size as the training set (if x is less than 50%), or all cases that were not used in the training set (if x is greater than or equal to 50%).

In the hypothyroid example, using a sample of 60% would cause a classifier to be constructed from a randomly-selected 1663 of the 2772 cases in hypothyroid.data, then tested on the remaining 1109 cases.

By default, the random sample changes every time that a classifier is constructed, so that successive runs of See5 with sampling will usually produce different results. This resampling can be avoided by selecting the Lock sample option that uses the current sample for constructing subsequent classifiers. If this option is selected, the sample will change only when another application is loaded, the sample percentage is altered, the option is unselected, or See5 is restarted.

Differential misclassification costs

Up to this point, all errors have been treated as equal -- we have simply counted the number of errors made by a classifier to summarize its performance. Let us now turn to the situation in which the `cost' associated with a classification error depends on the predicted and true class of the misclassified case.

See5 allows costs to be assigned to any combination of predicted and true class via entries in the optional file filestem.costs. Each entry has the form

	predicted class, true class: cost

where cost is a non-negative real number. The file may contain any number of entries; if a particular combination is not specified explicitly, its cost is taken to be 0 if the predicted class is correct and 1 otherwise.

To illustrate the idea, suppose that it was a much more serious error to classify a hypothyroid patient as negative that the converse. A hypothetical costs file hypothyroid.costs might look like this:

	negative, primary: 10
	negative, secondary: 10
	negative, compensated: 10

This specifies that the cost of misclassifying any primary, secondary, or compensated patient as negative is 10 units. Since they are not given explicitly, all other errors have cost 1 unit. In other words, the first kind of error is 10 times more costly.

A costs file is automatically read by See5 unless the system is told to ignore it. The output from the system using default options now looks like this:

	See5 [Release 1.11]  	Tue Jul 27 16:46:46 1999
	-------------------

	Class specified by attribute `diagnosis'

	Read 2772 cases (24 attributes) from hypothyroid.data
	Read misclassification costs from hypothyroid.costs

	Decision tree:

	TSH <= 6:
	:...query hypothyroid = t:
	:   :...TT4 <= 34: secondary (1.1/0.1)
	:   :   TT4 > 34: negative (122.3)
	:   query hypothyroid = f:
	:   :...TSH <= 4.5: negative (2267.4)
	:       TSH > 4.5:
	:       :...TT4 > 50: negative (78.2)
	:           TT4 <= 50:
	:           :...referral source in {WEST,STMW,SVHD}: secondary (0)
	:               referral source = SVHC: negative (0)
	:               referral source = SVI: negative (2)
	:               referral source = other: secondary (1.2/0.2)
	TSH > 6:
	:...FTI <= 65:
	    :...TT4 <= 63: primary (59.4/8)
	    :   TT4 > 63:
	    :   :...TSH <= 9.4: compensated (3.3/1.3)
	    :       TSH > 9.4:
	    :       :...TT4 > 90: compensated (1.1/0.2)
	    :           TT4 <= 90:
	    :           :...psych = f: primary (8.4/1.3)
	    :               psych = t: compensated (0.2)
	    FTI > 65:
	    :...on thyroxine = t: negative (37.7)
	        on thyroxine = f:
	        :...thyroid surgery = t: negative (6.8)
	            thyroid surgery = f:
	            :...TT4 <= 62:
	                :...TSH <= 35: compensated (4.5/0.4)
	                :   TSH > 35: primary (2.5/0.2)
	                TT4 > 62:
	                :...age > 8: compensated (170.9/28.9)
	                    age <= 8:
	                    :...TSH > 29: primary (0.7)
	                        TSH <= 29:
	                        :...referral source in {WEST,SVHC,SVI,
	                            :                   SVHD}: compensated (0)
	                            referral source = other: compensated (2.8)
	                            referral source = STMW:
	                            :...age <= 1: compensated (1)
	                                age > 1: primary (0.7)


	Evaluation on training data (2772 cases):

		       Decision Tree       
		  -----------------------  
		  Size      Errors   Cost  

		    23   17( 0.6%)   0.01   <<


		   (a)   (b)   (c)   (d)    <-classified as
		  ----  ----  ----  ----
		    62     1                (a): class primary
		         154                (b): class compensated
		                 2          (c): class secondary
		     7     9        2537    (d): class negative


	Evaluation on test data (1000 cases):

		       Decision Tree       
		  -----------------------  
		  Size      Errors   Cost  

		    23   12( 1.2%)   0.01   <<


		   (a)   (b)   (c)   (d)    <-classified as
		  ----  ----  ----  ----
		    32                      (a): class primary
		     1    39                (b): class compensated
		                            (c): class secondary
		     4     7         917    (d): class negative


	Time: 0.3 secs

This new decision tree has a higher error rate than the first decision tree for both the training and test cases, and might therefore appear entirely inferior to it. The real difference comes when we compare the total cost of misclassified training cases for the two trees. The first decision tree, which was derived without reference to the differential costs, has a total cost of 24 (4x1+2x10) for the misclassified cases in hypothyroid.data. The corresponding value for the new tree is 17, since no non-negative cases are classified as negative. That is, the total misclassification cost over the training cases is lower than that of the old tree.

Using Classifiers

Once a classifier has been constructed, an interactive interpreter can be used to assign new cases to classes. The Use Classifier button invokes the interpreter, using the most recent classifier for the current application, and prompts for information about the case to be classified. Since the values of all attributes may not be needed, the attribute values requested will depend on the case itself. When all the relevant information has been entered, the most probable class (or classes) are shown, each with a certainty value. For example, this is the result of analyzing a case using the first decision tree above:

Image of interpreter window

Cross-Referencing Classifiers and Data

See5 incorporates a unique facility that links data and the relevant sections of (possibly boosted) classifiers. We will illustrate this facility using the first decision tree for the hypothyroid application and the cases in hypothyroid.data from which it was constructed.

The Cross-Reference button brings up a window showing the most recent classifier for the current application and how it relates to the cases in the data, test or cases file. (If more than one of these is present, a menu will prompt you to select the file.)

The window is divided into two panes, with the classifier on the left and a list of cases on the right. Each case has a [?] tag (that is red if the case is misclassified), an identifying number or label, and the actual class of the case. Clicking on a case's label or number shows the part(s) of the classifier(s) relevant to that case. For instance, clicking on case 3169 shows the leaf to which this case is mapped:

X-Ref window

If a case has missing values for one or more attributes, if it is covered by several rules, or if boosted classifiers are used, more than one leaf or rule may be relevant to a case. In such situations, all relevant classifier parts are shown.

Click on any leaf or rule, and all the cases that map to the leaf or rule are shown. For instance, clicking on the third leaf from the top shows all cases that are covered by the leaf:

X-Ref window

The Save button preserves the details of the displayed model and case list as an ascii file -- the file name is selected through a dialog box. The Reset button can be used to restore the window to its initial state. Finally, clicking on the tag [?] in front of a case number or label displays that case:

X-Ref window

The values of attributes marked ignore or label are displayed in a lighter tone to indicate that they play no part in classifying the case.

Generating Classifiers in Batch Mode

The See5 distribution includes a program See5X that can be used to produce classifiers non-interactively. This console application resides in the same folder as See5 (usually C:\Program Files\See5) and is invoked from an MS-DOS Prompt window. The command to run the program is

	See5X -f filestem parameters

where the parameters enable one or more options discussed above to be selected:

	-s        use the Subset option
	-r        use the Ruleset option
	-b        use the Boosting option with 10 trials
	-t trials ditto with specified number of trials
	-S x      use the Sampling option with x%
	-I seed   set the sampling seed value
	-c CF     set the Pruning CF value
	-m cases  set the Minimum cases
	-p        use the Fuzzy thresholds option
	-e        ignore any costs file
	-h        print a summary of the batch mode options

If desired, output from See5 can be diverted to a file in the usual way.

As an example, typing the commands

	cd "C:\Program Files\See5"
	See5X -f Samples\anneal -r -b >save.txt

in a MS-DOS Prompt window will generate a boosted ruleset classifier for the anneal application in the Samples directory, leaving the output in file save.txt.

Linking to Other Programs

The classifiers generated by See5 are retained in binary files, filestem.tree for decision trees and filestem.rules for rulesets. Public C source code is available to read these classifier files and to use them to make predictions. Using this code, it is possible to call See5 classifiers from other programs. As an example, the source includes a program to read cases from a cases file, and to show how each is classified by boosted or single trees or rulesets.

Click here to download a WinZip archive containing the public source.

home

products

download

evaluations

prices

purchase

contact us