Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Tag Clouds: Usability and Math


Linearization

For the purposes of illustration, I created a dataset of well-known authors in our field, with the number of hits these names score in a Google search. When I use the raw data to create a tag cloud, I get the result in Figure 2(a). The tag cloud presents most of the names in approximately the same size. Only some names jump out, and some are nearly illegible. The reason is that the weights are not distributed evenly over the range of the source data. Most of the authors on my bookshelf have (roughly) the same number of Google hits. Only some authors have either very many or very few hits. It appears you can recognize a normal distribution (or Gaussian distribution) here, of which you can see examples in Figure 3. To get a more evenly distributed range of font sizes in the tag cloud, it is necessary to "linearize" the original values. You get a better result when you use a linearized representation, as in Figure 2(b). Technically, linearization means that the weights become less accurate. Bust because the tags have differing word lengths, there is already no such thing as an accurate reflection of the weights. Here, we are interested in usability, not accuracy.

[Click image to view at full size]

Figure 2: Linearization of source data.

[Click image to view at full size]

Figure 3: Normal distributions

The Pareto distribution, or "80-20 rule" (see Figure 4) is also frequently encountered. In this distribution, 80 percent of the weights are in the lowest 20 percent of the range, while the other 20 percent fill the remaining 80 percent of the range, or the other way around. Well-known examples of this distribution include wealth among people, popularity of websites, and the frequency of words from the English language. You need to select the right algorithm for linearization of your dataset. In Figure 2(c), my dataset (which contains a normal distribution) is linearized as if it contained a Pareto distribution. The result can be weird when you select the wrong distribution model. Strangely enough, I've noticed several authors doing exactly the opposite—they linearized datasets that contained Pareto distributions assuming (unknowingly, I suppose) that they were normal distributions. Evidently, statistical knowledge itself is not distributed evenly among software developers.

[Click image to view at full size]

Figure 4: Pareto distributions.

You will need several functions when linearizing multiple types of distributions. Each function only needs one collection of weights as input, and it returns a new (linearized) version of the collection. I suggest you work with generic interfaces for collections so that you can apply the same functions to different types of data sources. It is necessary to specify explicit upper and lower boundaries to the desired range of output values. It also seems proper to work with decimal or real numbers, not integers. Rounding the values to integers should be left to the UI code, in my opinion.

Listing Two is my attempt at linearizing a normal distribution, which is partly based on some examples on the Internet. The function calculates the standard deviation (sd) and makes the statistically correct assumption that nearly all numbers will be in the range -2 * sd to + 2 * sd. For each number, a new weight is calculated on a straight line through that range. Listing Three presents an algorithm that linearizes a Pareto distribution. This function calculates a new weight for each number using a logarithm, with e as the base number. (Diehards among us will not be satisfied with this and can determine from their own source data which base number would render the best approximation.) The remainder of the function in this case also plots the new values on a fictitious linear line between the minimum and maximum values.

Public Shared Function FromBellCurve( _
        ByVal weights As ICollection(Of Decimal), _
        ByVal minSize As Decimal, ByVal maxSize As Decimal) _
        As ICollection(Of Decimal)
    'First, calculate the mean weight.
    Dim meansum As Decimal = 0
    For Each w As Decimal In weights
        meansum += w
    Next
    Dim mean As Double = meansum / weights.Count
    'Second, calculate the standard deviation of the weights.
    Dim sdsum As Double = 0
    For Each w As Decimal In weights
        sdsum += (w - mean) ^ 2
    Next
    Dim sd As Double = ((1 / weights.Count) * sdsum) ^ 0.5
    'Now calculate the slope of a straight line from -2*sd to +2*sd.
    Dim slope As Double
    If sd > 0 Then
        slope = (maxSize - minSize) / (4 * sd)
    End If
    'Get the value in the middle between minSize and maxSize.
    Dim middle As Double = (minSize + maxSize) / 2
    'Calculate the result for the given deviation from mean.
    Dim output As New List(Of Decimal)

    For Each w As Decimal In weights
        If (sd = 0) Then
            'With sd=0 all tags have the same weight.
            output.Add(CDec(middle))
        Else
            'Calculate the distance from mean for this weight.
            Dim distance As Double = w - mean
            'Calculate the position on the slope for this distance.
            Dim result As Double = CDec(slope * distance + middle)
            'If the tag turned out too small, set minSize.
            If result < minSize Then result = minSize
            'If the tag turned out too big, set maxSize.
            If result > maxSize Then result = maxSize
            output.Add(CDec(result))
        End If
    Next
    Return output
End Function
Listing Two

Public Shared Function FromParetoCurve( _
        ByVal weights As ICollection(Of Decimal), _
        ByVal minSize As Decimal, ByVal maxSize As Decimal) _
        As ICollection(Of Decimal)
    'Convert each weight to its log value.
    Const BASE As Double = Math.E
    Dim logweights As New List(Of Decimal)
    For Each w As Decimal In weights
        logweights.Add(CDec(Math.Log(w, BASE)))
    Next
    'First, find the min and max weight.
    Dim min As Decimal = Decimal.MaxValue
    Dim max As Decimal = Decimal.MinValue
    For Each w As Decimal In logweights
        If w < min Then min = w
        If w > max Then max = w
    Next
    'Now calculate the slope of a straight line, from min to max.
    Dim slope As Double
    If max > min Then
        slope = (maxSize - minSize) / (max - min)
    End If
    'Get the value in the middle between minSize and maxSize.
    Dim middle As Double = (minSize + maxSize) / 2
    'Calculate the result for each of the weights.
    Dim output As New List(Of Decimal)
    For Each w As Decimal In logweights
        If (max <= min) Then
            'With max=min all tags have the same weight.
            output.Add(CDec(middle))
        Else
            'Calculate the distance from the minimum for this weight.
            Dim distance As Double = w - min

            'Calculate the position on the slope for this distance.
            Dim result As Double = CDec(slope * distance + minSize)
            'If the tag turned out too small, set minSize.
            If result < minSize Then result = minSize
            'If the tag turned out too big, set maxSize.
            If result > maxSize Then result = maxSize
            output.Add(CDec(result))
        End If
    Next
    Return output
End Function

Listing Three


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.