CleanCode::SQL::Histogram

This site will work and look better in a browser that supports web standards, but it is accessible to any browser or Internet device.

Histogram

N U T S H E L L
Module	Histogram
Description	Generates dynamic histogram result sets
Environment	SQL Server 2000/2005
Author	Michael Sorens
Since	CleanCode 0.9.22

A histogram, according to Wikipedia, is "...a mapping... that counts the number of observations that fall into various disjoint categories." The Histogram stored procedure generates a histogram as either a single-row or multiple-row result set in T-SQL. Each column of the result set is a bucket that tabulates values for a given range of the data under analysis. For a multiple-row result (a detail histogram) each row provides in effect a mini-histogram for an enumerated key field.

Besides the description here, I have also published an article discussing the practical uses of this tool -- A Generic Histogram Generator for SQL Server -- available on the DevX online magazine.

Usage and Parameters

Histogram @SchemaName, @TableName,
    @BucketExpr, @BucketFilter, @IdExpr, @IdFilter, @IdAlias,
    @DateTemplate, @Mode, @LowerBoundDate, @UpperBoundDate, @Buckets,
    @LowerBoundInt, @UpperBoundInt, @Smooth, @Verbose, @Process

where:
   @SchemaName     ::= schema or owner name
   @TableName      ::= table name
   @BucketExpr     ::= field name or expression
   @BucketFilter   ::= filter predicate for @BucketExpr
   @IdExpr         ::= field name or expression
   @IdFilter       ::= filter predicate for @IdExpr
   @IdAlias        ::= column name alias for @IdExpr
   @Aggregate      ::= aggregate expression (count, min, max)
   @DateTemplate   ::= template calling out year, month, and/or day specifiers
   @Mode           ::= histogram type: 'DateRange', 'DateMonths', or 'NumberRange'
   @LowerBoundDate ::= smallest date to include for date histograms
   @UpperBoundDate ::= largest date to include for date histograms
   @Buckets        ::= number of columns for NumberRange mode
   @LowerBoundInt  ::= smallest value to include for numerical histograms
   @UpperBoundInt  ::= largest value to include for numerical histograms
   @Smooth         ::= for NumberRange mode, adjust bounds to palatable values
   @Verbose        ::= "0"-"4" for no diagnostic output or progressively more
   @Process        ::= "1" to execute; "0" to only show generated statements

@SchemaName [ Type: varchar(50) Default: 'dbo' ]: Name of the schema to which the table belongs.

@TableName [ Type: varchar(50) Default: null ]: Name of the table to analyze.

@BucketExpr [ Type: varchar(100) Default: null ]: Field name or expression to tally into the histogram's buckets. To create a histogram over the size (in dollars) of orders, you might specify an OrderTotal field. On the other hand, to check on whether users have created strong passwords by examining the lengths of passwords, you could use an expression such as len(Password).

@BucketFilter [ Type: varchar(200) Default: null ]: Filter predicate for buckets. This allows you to optionally take a cross-section of the table's data. This filter is applied when auto-selecting minimum and maximum range values and when counting values to fill the histogram buckets. Compare this to @IdFilter which filters rows. If omitted, all table data, subject to range selectors, is used. Example: If you want to examine the orders written in a particular region, where each data row contains a regionID, you might use 'regionID like ''%America'''. (Note that to embed quotes you must double them.)

@IdExpr [ Type: varchar(500) Default: null ]: This procedure generates either a summary histogram or a detail histogram. If @IdExpr is null, the result set will consist of a single row categorized by @BucketExpr. If you specify a field name for @IdExpr, then that field becomes the target of a group by clause of the query and the field itself will be the first column in the result set. If @IdExpr is a single field name and it has a date datatype it is formatted according to @DateTemplate. The format that you specify implicitly determines the row groups. That is, if you specify 'yyyy-mm' then all dates in a given month are grouped into a single row. @IdExpr is not limited to being a field name but, like @BucketExpr, may be an expression. If for instance you wish to iterate over full names you might use something like 'LastName+'',''+FirstName'.

@IdFilter [ Type: varchar(200) Default: null ]: Filter predicate for rows. This allows you to take a cross-section of the table's data. If you specify @IdExpr to enumerate all the values of a field in rows, @IdFilter lets you filter that list rather than getting all values. If omitted, all values for @IdExpr are enumerated. Compare this to @BucketFilter which filters the data going into the buckets. Example: If you want to show just new customers (those less than a year old) you might use 'DATEADD(year, 1, customerCreatedDate) > getdate()'. Note that to embed quotes (as in a string constant) you must double them.

@IdAlias [ Type: varchar(50) Default: null ]: If you use an expression rather than a single field name for @IdExpr you will usually want to alias the expression to a simple name in your result set, just as you would with a simple query. Referring to the example under @IdExpr to build a full name expression, you might choose FullName for the @IdAlias.

@Aggregate [ Type: varchar(50) Default: null ]

Certain aggregate functions (count, min, max) may be applied to obtain a second-order result, best illustrated by example. Say you have a table of ClientAccounts with a primary key of IdAccount. Each record also has a foreign key IdClient to match one or more accounts to each client. If you want to create a histogram displaying the distribution of accounts per client, you could use this:
exec histogram @TableName='ClientAccounts', @BucketExpr='count(IdAccount)', @Aggregate='IdClient'
The @Aggregate parameter specifies the grouping field, just as you would use in a GROUP BY clause.

When you use the @Aggregate parameter, the @Mode defaults to NumberRange. As such, you may supply any other parameters valid for that mode except for @IdExpr, as this always generates a summary histogram. (See the parameter table under Parameter Requirements.)

This aggregation is currently not supported in SQL Server 2000; you must use SS2005 or later.

@DateTemplate [ Type: varchar(15) Default: null ]: This template is used to customize a date in two cases. First, when the @IdExpr has a date datatype--indicating the rows will enumerate the values of that field--that date field, it is formatted according to @DateTemplate. Second, when the @Mode is DateRange--indicating the columns will be ranges of dates--the column dates are formatted according to @DateTemplate. Use 'yyyy' to represent the year, 'mm' the month, and 'dd' the day, separated by either hyphens (-) or virgules (/), as in 'yyyy-mm-dd' or 'mm/yyyy'. No other characters are allowed in @DateTemplate.

@Mode [ Type: varchar(20) Default: null ]

'DateRange'

Enumerates items in each day, month or year, depending on the @DateTemplate. The format that you specify in @DateTemplate implicitly determines the grouping of data into columns, i.e. the size of each bucket in the histogram.

For months, the range runs from the complete month of the @LowerBoundDate running consecutively through the complete month of the @UpperBoundDate. The number of buckets is the number of months between @UpperBoundDate and @LowerBoundDate inclusive and the size of each bucket is one month.
For years, the range runs from the complete year of the @LowerBoundDate, running consecutively through the complete year of the @UpperBoundDate. The number of buckets is the number of years between @UpperBoundDate and @LowerBoundDate inclusive and the size of each bucket is one year.
For days, the range runs from the @LowerBoundDate through the @UpperBoundDate. The number of buckets is the number of days between @UpperBoundDate and @LowerBoundDate inclusive and the size of each bucket is one day.

Column headings for this histogram type are comprised of the @DateTemplate with appropriate day, month, and year substititutions. Thus, if you specify a lower bound of '3/14/2009' and a @DateTemplate of 'yyyy-mm', columns are named '2009-03', '2009-04', etc.

'DateMonths'

Collects data for all months irrespective of years. If you do not specify a lower boundary date, the histogram starts from January. If you do not specify an upper boundary date, the histogram ends in December. If you specify one or both boundary dates, the represented months are determined from your specified date. The data is limited to fall within the years of your boundaries as well. That is, to get a histogram from March to May between 2005 and 2008, you could specify a lower bound of '3/1/2005' and an upper bound of '5/25/2008'. The day in these boundaries is ignored, so you could specify any day of the month. Column headings for this histogram type are the English month names.

'NumberRange'

Partitions a range of numerical values into integral buckets. If, for example, you want to generate a histogram for the quantities ordered per invoice for a given product, and you could reasonably expect up to 100 units of the item per invoice, you could specify 10 buckets to see the breakdown of such orders in deciles (1-10, 11-20, 21-30, and so forth up to 100). You may explicitly specify the number of buckets (@Buckets) and the upper/lower bounds (@UpperBoundInt, @LowerBoundInt) or use the system defaults. Column headings for this histogram type are numerical ranges separated with hyphens, with a possible trailing plus or minus sign on the final bucket. Thus, for a range of 0 to 48 with 5 buckets, the columns are named '0-9', '10-19', '20-29', '30-39', and '40-48-'. The trailing minus sign indicates the final bucket is smaller than all the others.

If @Mode is unspecified, it will be interpreted if possible from the other parameters. First, the datatype of @BucketExpr is examined. If it is a datetime or smalldatetime,DateRange, DateMonths, and NumberRange @Mode is set to DateRange. If, on the other hand, it is some type of integer or money datatype, @Mode is set to NumberRange. If @BucketExpr is not a simple field name, the system will not be able to check the datatype so it next looks at minimum/maximum values that you specify. If @LowerBoundDate or @UpperBoundDate is non-null, @Mode is set to DateRange, whereas if @LowerBoundInt or @UpperBoundInt is non-null, @Mode is set to NumberRange,

@LowerBoundDate [ Type: datetime Default: null ]: When @Mode is set to DateRange this parameter defines the lower bound of the histogram. When unspecified, the specified table (@TableName) will be scanned for the minimum value.

@UpperBoundDate [ Type: datetime Default: null ]: When @Mode is set to DateRange this parameter defines the upper bound of the histogram. When unspecified, the specified table (@TableName) will be scanned for the maximum value.

@Buckets [ Type: int Default: 10 ]

When @Mode is set to NumberRange this parameter specifies the preferred number of buckets (columns) of the histogram. Note, however, that the actual number of buckets may be smaller due to integral rounding. Say, for example, that you specify a range of 0-45 with 15 buckets. The bucket size will be three, so the first bucket will hold 0-2, the second 3-5, etc., and this yields exactly 15 buckets. If you change the upper bound to 50, then the bucket size is determined to be four, yielding 0-3, 4-7, 8-11, etc., with a total of just 13 buckets (columns). The buckets will all be the same size except for the final bucket whose size may be equal to, greater than, or less than, the other bucket sizes. As an example, start with a range of 1-22 with 4 buckets. This range of 22 values yields 3 buckets holding 6 values each and one holding 4 values. Similarly, a range of 0-25 with 5 buckets--a range of 26 values, not 25-- yields 4 buckets with 5 values each and the final bucket with 6 values. When the final bucket size differs, the column name is postfixed with either a minus sign or a plus sign to indicate that it is smaller or larger, respectively.

When @Mode is set to DateRange the number of buckets is specified implicitly by the @DateTemplate. In either case, the current implementation supports up to approximately 36 buckets.

@LowerBoundInt [ Type: int Default: null ]: When @Mode is set to NumberRange this parameter defines the lower bound of the histogram. When left at null, the specified table (@TableName) will be scanned for the minimum value subject to smoothing (by enabling @Smooth).

@UpperBoundInt [ Type: int Default: null ]: When @Mode is set to NumberRange this parameter defines the upper bound of the histogram. When left at null, the specified table (@TableName) will be scanned for the maximum value subject to smoothing (by enabling @Smooth).

@Smooth [ Type: bit Default: 1 ]: If turned on and @Mode is set to NumberRange, then the upper and lower bounds are "smoothed" to nearby round lots. For example, if the range is 105 through 279, then this is adjusted to become 100 through 300. The algorithm for smoothing these bounds takes into account how close they are together (i.e. 1013 to 1017 does not change at all since it is such a small range), how large the upper bound is relatively to the lower's proximity to zero, and other factors.

@Verbose [ Type: smallint Default: 0 ]

For diagnostic purposes, you may display the dynamic SQL generated to process your request. Each level is cumulative, producing more (rather than alternate) output.

0 = No output.
1 = Calculations.
2 = Main SQL statements.
3 = Min/Max SQL statements.
4 = Aggregate data dump.

@Process [ Type: bit Default: 1 ]: Generally, this flag is available to let you see the generated SQL without executing it. That statement applies to the main query that is generated. There are "preliminary" queries to determine upper and lower bounds if you do not supply them explicitly. These preliminary queries are run regardless of the setting of this flag. Set @Process to 1 to execute the main query or 0 to disable its execution. To be useful, however, you must also set @Verbose to a non-zero value, otherwise no output is displayed and nothing is executed! To

Histogram

Usage and Parameters

Details

Installation

General Outline of Usage

Parameter Requirements for Histogram Types

Examples

Parameter	DateRange	DateMonths	NumberRange
@SchemaName	optional	optional	optional
@TableName	required	required	required
@BucketExpr	required	required	required
@BucketFilter	optional	optional	optional
@IdExpr	optional	optional	optional (NA if @Aggregate used)
@IdFilter	optional (and only when @IdExpr is also used)
@IdAlias	optional (and only when @IdExpr is also used)
@Aggregate	NA	NA	optional
@DateTemplate	required	required if: @IdExpr is a date field
@Mode	required if: bounds are omitted and @BucketExpr is not a date field	required	required if: bounds are omitted and @BucketExpr is not an integer or money field
@LowerBoundDate	optional	optional	NA
@UpperBoundDate	optional	optional	NA
@Buckets	optional	NA	NA
@LowerBoundInt	NA	NA	optional
@UpperBoundInt	NA	NA	optional
@Smooth	NA	NA	optional
@Verbose	optional	optional	optional
@Process	optional	optional	optional