Multisets

From NARS2000
Jump to navigationJump to search

Introduction

For decades the several set symbols have languished on APL keyboards either unused or underused. Sometimes the vendor has assigned no dyadic function to that symbol and sometimes the assigned function isn't very useful. All vendors have implemented Set Difference (L~R), one has implemented Set Union (L∪R) and Set Intersection (L∩R), and, up to now, no APL vendor has implemented the missing fourth function.

Part of the reason the so-called set functions in APL are in an odd state is that they are defined on sets, but implemented on non-sets.

From Wikipedia, “in computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values. It is a computer implementation of the mathematical concept of a finite set”.

However, APL implementations of the set functions don't enforce the “no repeated value” requirement. Moreover, allowing repeated values is quite useful giving rise to useful idioms such as L~' ' to remove all blanks from a vector.

Interestingly, sets with repeated values have a long history in mathematics and are known as Multisets. The usual set functions have identical counterparts as multiset functions with the same definitions except the multiset version takes into account the multiplicity of the unique values.

Multisets in an APL context are scalars or vectors of arbitrary items with various operations defined on them (monadically), but mostly between them (dyadically).

More formally, if L and R are multisets and the function m(x,M) returns the multiplicity of element x in the multiset M, then

  • The Union of multisets is the multiset whose unique elements are the unique elements of L,R where the multiplicity of element x in the result is the larger of m(x,L) and m(x,R),
  • The Intersection of multisets is the multiset similar to Union, but with larger replaced by smaller, and
  • The Difference (also called Asymmetric Difference and Relative Complement) of multisets is similar to Union and Intersection, but where the multiplicity calculation is max(0,m(x,L)-m(x,R)), that is if element x is in the result if it occurs more in L than in R, and it occurs with the multiplicity of the difference of the left and right multiplicities.

For multisets with no repeated elements, the multiset function and the corresponding set function produce the same results.

Notation

There are a dozen or so APL primitive functions we'd like to define on Multisets. One way to do this is to come up with a dozen new symbols to represent those Multiset functions; another is to define a single Multiset Operator that can be applied to the APL primitives which is the approach taken here, and that symbol is () which can be typed with Alt-'m', a keystroke previously used for the stile symbol (|) which was duplicated elsewhere on the keyboard. This operator is different from previous operators in APL in that it applies to a select set of primitive functions, but not system functions nor derived functions. It does, however, apply to user-defined functions via the System Label ⎕MS. In that sense it's more like an inflection such as how J uses . and :. Nonetheless, it is a (monadic) operator in the full mathematical (and APL) sense of it taking a function as an operand and returning a (derived) function.

Examples

A common example of a multiset is the decomposition of an integer into its prime factors, each of which may occur multiple times. For example, 600 may be factored into the multiset L←2 2 2 3 5 5, and 2100 into R←2 2 3 5 5 7.

Two useful properties of a multiset are the Underlying Set of Unique Elements (obtained via the usual ∪R) along with their Multiplicities (obtained via the derived function ∪⍦R), where the latter may be defined as ¯2-/⍸1,(2≠/R[⍋R⍳R]),1.

For the two multisets above the two properties are

      (∪L),[0.5] ∪⍦L                        (∪R),[0.5] ∪⍦R
2 3 5                  2 3 5 7
3 1 2                  2 1 2 1

The values (but not the order) of the original multiset may be reconstructed from ∊(∪⍦R)⍴¨∪R. That is, R≡⍦∊(∪⍦R)⍴¨∪R, but not necessarily R≡∊(∪⍦R)⍴¨∪R.

Two common operations performed between multisets are Union and Intersection.

Multiset Union on L and R is defined as the Multiset whose underlying set is the set union of the underlying sets of L and R, and whose multiplicities are the larger of the multiplicities of the corresponding elements of L and R.

For L←2 2 2 3 5 5 and R←2 2 3 5 5 7 the Multiset Union is 2 2 2 3 5 5 7. Note that, for example, there are three 2s in the result because that is the larger of the number of 2s in L (3) and R (2).

Multiset Intersection on L and R is defined the same as for Multiset Union except that the smaller of the multiplicities is taken instead of the larger. For the two Multisets above, the Multiset Intersection is 2 2 3 5 5 where there are two 2s in the result because that's the smaller of the number of 2s in L (3) and R (2), and there are no 7s in the result because that's the smaller of the number of 7s in L (0) and R (1).

Interestingly, in the context of prime factorization, Multiset Union is the direct analog of Least Common Multiple and Multiset Intersection is Greatest Common Divisor. For the two Multisets above,

L∪⍦R ←→ 2 2 2 3 5 5 7 and
L∩⍦R ←→ 2 2 3 5 5

The Least Common Multiple of two original numbers ×/L ←→ 600 and ×/R ←→ 2100 is

      (×/L)∧×/R
4200
      ×/L∪⍦R
4200

and the Greatest Common Divisor is

      (×/L)∨×/R
300
      ×/L∩⍦R
300

Moreover, using P←2 2 2 3 3 5 7 and Q←2 2 3 5 where as in this case P is a superset of Q (that is, P⊇Q) then Multiset Asymmetric Difference is the direct analog of integer division. That is,

      P~⍦Q
2 3 7
      ×/P~⍦Q
42
      (×/P)÷×/Q
42

Other examples illustrate the distinction between the multiset and non-multiset functions

      'mississippi'~'miss'
pp
      'mississippi'~⍦'miss'
issippi

Subscripts

One way to understand Multisets and the operations performed on them is to view them with equal elements having unique subscripts. That is, for the Multisets 2 2 2 3 5 5 and 2 2 3 5 5 7, write them as

21 22 23 31 51 52 and
21 22 31 51 52 71

Now Multiset Union reduces to simple union where only one copy of like elements is kept, and similarly for Multiset Intersection and Multiset Asymmetric Difference. That is,

L     21 22 23 31 51 52
R     21 22 31 51 52 71
L∪⍦R   ←→   21 22 23 31 51 52 71
L∩⍦R   ←→   21 22 31 51 52
L~⍦R   ←→   23

Missing Function

To see if we have covered all of the possible set/multiset results, look at the usual Venn diagram for two sets – all seven results (excluding the empty set) appear as follows:

L∪R     Venn123.png     Union     L,R~L
L∩R     Venn2.png     Intersection     L~L~R
L~R     Venn1.png     Asymmetric Difference Left     L~R
R~L     Venn3.png     Asymmetric Difference Right     R~L
L     Venn12.png     Left     L
R     Venn23.png     Right     R
L∆R     Venn13.png     Symmetric Difference     (L~R)∪R~L ←→ (L~R),R~L

The last diagram shows the missing function along with its name and effect. The mathematical symbol for this function is delta (), however because old APL programs use this symbol as another alphabetic character, we use the Section symbol (§, Alt-'S') instead.

Multiset Derived Functions

The APL functions defined on Multisets are as follows:

L∪⍦R     Multiset Union
L∩⍦R     Multiset Intersection
L~⍦R     Multiset Asymmetric Difference
L§⍦R     Multiset Symmetric Difference
L⍳⍦R     Multiset Index Of
L∊⍦R     Multiset Member Of
L≡⍦R     Multiset Match
L≢⍦R     Multiset Mismatch (same as ~L≡⍦R)
L⊂⍦R     Multiset Proper Subset Of
L⊆⍦R     Multiset Subset Of
L⊃⍦R     Multiset Proper Superset Of
L⊇⍦R     Multiset Superset Of
 ∪⍦R     Multiset Multiplicities (⍴∪⍦R ←→ ⍴∪R)

Multiset Member Of and Index Of

The key to understanding the meaning of the Multiset Operator as it is applied to the above APL primitive functions is Multiset Member Of (∊⍦), so we'll investigate that first.

Using the subscript approach above the definition is straightforward,

L     21 22 23 31 51 52
R     21 22 31 51 52 71
L∊⍦R   ←→   1 1 0 1 1 1

again, matching like elements with like elements between the two arguments.

We'll use this definition many times over in defining how the Multiset Operator applies to various APL primitive functions.

In a similar manner, Multiset Index Of can be understood best by writing the two arguments with subscripts. For a more detailed coverage of this see Anatomy of An Idiom.

Multiset Match

This derived function is a convenient way to determine whether or not two multisets are identical up to but not including order, as in these anagrams:

      'dynamo'≡⍦'monday'
1
      'pepsicola'≡⍦'episcopal'
1
      'the morse code'≡⍦'here come dots'
1

APL Definitions

Each of the above dyadic Multiset functions has a simple analog in a non-Multiset context:

Function
  f
    Non-Multiset
Definition
L f R
    Multiset
Definition
L f⍦ R

  ∪     L,R~L     L,R~⍦L
  ∩     (L∊R)/L     (L∊⍦R)/L
  ~     (~L∊R)/L     (~L∊⍦R)/L
  §     (L~R),R~L     (L~⍦R),R~⍦L
  ⍳     L⍳R     L⍳⍦R
  ∊     L∊R     L∊⍦R
  ≡     ((≢L)≡≢R)∧∧/L∊R     ((≢L)≡≢R)∧∧/L∊⍦R
  ≢     ~L≡R     ~L≡⍦R
  ⊂     (L⊆R)∧L≢R     (L⊆⍦R)∧L≢⍦R
  ⊆     ∧/L∊R     ∧/L∊⍦R
  ⊃     (R⊆L)∧R≢L     (R⊆⍦L)∧R≢⍦L
  ⊇     ∧/R∊L     ∧/R∊⍦L

† = This is the meaning this function would have as a set function if it didn't have another meaning in a non-Multiset context.

Note that in every case the Multiset definition may be obtained from the non-Multiset definition by introducing the Multiset Operator at the appropriate place(s). Beyond the above explanations for Multiset Member Of and Multiset Index Of, all of the other Multiset definitions can be reduced to the definition of Multiset Member Of.

Moreover, Multiset Union, Intersection, and Symmetric Difference are all symmetric up to order. That is,

(L∪⍦R)≡⍦R∪⍦L
(L∩⍦R)≡⍦R∩⍦L
(L~⍦R)≡⍦R~⍦L

Acknowledgments

The idea of defining Multisets in an APL context is due to Patrick Parks of APL2000, Inc.