Information Is What Confuses Us

This is not really a review of William Kent's book Data and Reality Basic Assumptions in Data Processing Reconsidered [1]. It is an article inspired by Kent. The category in which to put this essay is not clear. Where it should go in the index depends upon your point of view. These may seem minor points, but they are central issues here. As Kent asserts, people not involved with computers may find the subject trivial.

The subject, in its broadest sense, is the correspondence between symbols and reality. This subject is also part of the study of linguistics, philosophy and mathematics, but here only those aspects relevant to computers are considered. Kent gives many examples of the problem in building data models of reality. However this paper develops only one of his examples (books) to illustrate the mismatch between computer systems and reality.

Computers store prodigious quantities of data in the form of fields, records and indexes. All of this data consists of symbols. It is through symbols that computers communicate with each other and with people. The symbols used to store data can be classed as numbers or names. So consider the use of numbers and names with respect to books.

How many books are there in your library? That depends on whether "titles" or physical objects are counted. A two volume copy of Tolstoy's "War and Peace [2] can be counted as one book or two. Let us decide to call the physical object a volume and the literary object a book. The volume on the shelf is "The Best of Jules Verne" [3]. It contains "Around the World in 80 Days". "The Clipper in the Clouds and "Journey to the Centre of the Earth". But the book's title is "The Best of Jules Verne" by Alan K. Russet. This can be counted as one book or three books depending on whom the author is deemed to be.

There are, of course, books which do not exist. In a volume of Stanislaw Lem's works "Solaris The Chain Of Chance A Perfect Vacuum" [4] the title/book/story "A Perfect Vacuum" reviews books (including itself) quoting titles authors and publishers. None of the books exist (except itself). Then there are works of ancient Greek authors, which are referred to by their followers, but for which no copies are known to survive. They may appear in citation indexes and bibliographies. In some sense these books exist. It could be said that the titles exist but the numbers of copies are zero. Is there any difference between the non-existent books Lem reviews and the ones Plato references?

Books change. Charles Darwin's "The Origin of Species" [5] went through 6 revisions in Darwin's lifetime. (The one that most closely fits modern biological thinking is version 1.) This can be viewed as six separate books or one.

evolution

All the above issues can be resolved by taking suitable points of view. The question is can there exist a consistent view point such that the books can be counted?

"Once more: we are not modelling reality, but the way information about reality is processed, by people." [1]

Perhaps if counting books is so difficult we can name them. Books are generally named by a combination of author and title. Poetry books, however have a tendency to violate this convention. For example "Robert Graves, Poems Selected by Robert Graves and Anthony Thwait" [6] is a collection of Robert Graves' poetry. It is not clear who is (are) the author(s), and what the title of the book is. "A Choice of Kipling Verse Made By T.S. Eliot With an Essay on Rudyard Kipling" [7]. Author and title please. Returning to an earlier example, "The Best of Jules Verne" Selected and Introduced by Alan K. Russel. Since Mr J. Verne never wrote a book called The Best, Mr Russel can be considered to be the author. Maybe the convention which is used for films and records of having a title, author, director, producer and distribution company should be extended to books. But there again Jules Verne wrote in French, so in some sense he is not the author. It is tempting to bring Philip Jose Farmer's "The Other Log of Philease Fogg" [9] into the issue, but life is complex enough without it.

"A book has a title; a Library of Congress Number; an ISBN (International Standard Book Number); not to mention various Dewey decimal identifiers in local library catalogues. And each copy of a book may have an 'accession number', assigned locally by the library for their overall inventory management. " [1]

However "some books don't have an ISBN, others don't have Library of Congress numbers, and some neither. But many books have both - and some have several ISBNs. And Library of Congress numbers apply to a larger class of entities than do ISBN's they are also assigned to films, recordings, and other forms of publication in addition to books. " [1]

Naming books is no easier than counting them. The problem of a book's name depending on the observers point of view is common to the counting process. In addition a second problem has come to the fore. The book acquires names through the process of publication. Jules Yerne's "Around the World in Eighty Days" appeared originally as "Le Tour de Monde en Quatre-vingt Jours" [8]. Book revisions, such as Darwin's, constitute another process which alter a book's content without changing its title. The reverse process, of altering the title without altering the contents, also occurs.

"Structure is process slowed down" [1]

The relevance of all this to the use of computers is: The systems designers structure what they perceive as data about reality in a particular way. They do so within the technical constraints of the computer systems they are using. To use the most powerful feature of the computer the ability to repeat processes on a large number of similar records, the designer is obliged to take a point of view. They may take several views of the data, but the more views they take the more complex their data structures, the more convoluted their programs and the more difficult it is for the users to use the system. A "common sense" view of information involves continuous, subtle and sub-conscious shifting of view point. The loss or of "common sense" that users experience on first contact with a computer based system can be a major problem in implementation. But the narrow deterministic view is essential to the efficient use of computers. For efficient read small programs, simple file structures, and minimum key strokes. Looked at from a wider view point efficiency may mean something else all together.

The view point dilemma is recursive. Not only is the "reality" outside the computer subject to the problem of shifting view point, but the internal reality of a computer system suffers the same way. As an illustration of this consider a library index system.

An index records relationships. It associates a group of facts and points to something. In the case of a library index it associates the author with the title of a book and points to the shelf where the book should be. Of course all the problems of identifying the author and the title recur in drawing up the index cards

There are now some additional difficulties. Can the index refer to itself? This is a trivial problem in this example, but it can be a crucial problem with computer systems. After all any citation index, publisher's yearbooks, or book catalogues, that are published in book form ought to be recorded in the index. A work may occur in several places in the index. A book ascribed to multiple authors may have multiple cards raised for it. Where the book crosses over several subject codes there might be several subject index cards raised. Where a new edition of the book is added all the cards in the system ought to be brought up to date.

index

Where a book is breaking new ground, there may be no suitable Dewey code, so it will be given the nearest available code. As a range of books are published in the subject a new Dewey code is raised, and subsequent books filed under it. It seems unlikely that the initial "paradigm shifting" book will be re-categorised. This implies that all important new books tend to be incorrectly coded. The new sub-code changes the meaning of the original Dewey code. The old code now, by implication, excludes the new speciality. So in practice, the meaning of subject code depends on the time it is allocated to the book. Additionally the catalogue of Dewey numbers may not be up to date. There could be codes in the index no longer in the catalogue. This can occur because of wrong or invalid codes have crept into the index or the catalogue.

misfile

Another form of non-existence can make itself felt. An index card cannot be raised for a book not yet published. For example the next volume of Isaac Asimov's Foundation series is due out in hardback in October 1982 (paperback a year later). Most library indexes do not reflect this information because they only record the past. This is a nice example because it reflects the dilemma of many information systems, often their very basis prohibits them from recording data until after the event. Yet in practice information about the future is more important than information about the past.

So to return to computer systems and their relationship with some external system. The external system is viewed by its participants through a many-sided prism of common-sense points of view. The system and its participants points of view are but the current state in a continuous process of change. The computer system is based on a smaller number of points of view, but these are also undergoing changes not necessarily in a way that relates to the changes in the external system.

The outcome of these processes, for data processing, is the continual modification of computer programs and files. The outcome for the people using the computer is that the computer system seems to reflect yesterdays problems, not today's needs. A number of techniques have been developed to improve this situation. The development of High Level Languages is one approach, and Database a later one. The word Database has been de-based to mean anything from a floppy disc file catalogue to a full relational database. It is ironic that the word database does not have an agreed definition. Put it another way, the problem of naming things extends to naming the thing a database.

Database, for this discussion, is used to mean a Relational Database. A database permits the physical storage of data, and the processes used to retrieve it, to be independent of the computer programs. It enables data to be stored in a very generalised manner so that different users of the data can have their own way of looking at it. The aim of database technology is to enable data to be structured in a way that is independent of the use to which the data is put. The advantage of database techniques is that when one user changes the data stored, then this change is reflected to all other users however they access the data.

It is not the purpose of this discourse to question the value of database technology as a programming tool. Far from it, there is every reason to suggest that database techniques raise computer programmer productivity. The issue is whether database techniques have any relevance outside the programming department. Is the very generalised way of holding data (called the conceptual model) applicable to the organisation which owns the computer? Does database allow an adequate description of reality to be maintained within a computer system?

The way a conceptual model works is to break down data structures to their smallest component parts. These parts can then be recombined in a variety of ways each representing a different view of reality. . This might be called the Atomic Theory of Data. It is a reductionist view of data that reduces data to entities and relationships. An argument against this approach is that this Atomic Theory of data has its own unacknowledged, Uncertainty Principle. This principle states that the more energy used to define the shape of a data structure, the less we know about how that structure is changing.

"Structure IS process slowed down. " [1]

The Atomic approach (unlike Physics) cannot define when to stop subdividing the data. For example consider the two entities linked by a relationship:- "William Kent wrote Data and Reality". It might be necessary to store data about the act of writing: when did he write it (as distinct from when was it published); in what languages; where was he working when he wrote it. To accomplish this the relationship wrote is transformed into two relationships and an entity. The entity is the writing project, and the whole thing becomes "William Kent's work on a writing project produced Data and Reality". Were it required to record data about the method of production (did he use a word processor) then this would be similarly sub-divided. The ancient argument of whether matter is contiguous or discrete is now resurrected about information.

There is another catch. Two things can be related together in more than one way. Consider a book that gives a history of a library. . Such a book (i) gives a history (ii) is published by (iii) is owned by (iv) is written by(?) the library. This is not as silly as it seems. In human organisations multiple relationships exist. Take the example of a couple working for the same company. How does this effect the data stored by the company pension scheme on the partner's entitlements. What happens if they divorce?

The principle objection to "the database approach" (as distinct from database programming tools) is that there can be no single objective view of social organisations. And yet the belief in the existence of such a viewpoint is implicit in the database approach. In practice the view point from which a "company database" is constructed is the view-point of the data processing department. This view-point has its own history, its own process of development. It is not a snap-shot of the company information structure, it is the product of a social process.

The original ending of this tract (although not really this write-up, but an earlier one, which changed into this in the process of being written) drew on the works of Tulihard de Chardin [10]. It pointed out that de Chardin's universal love at the Omega point represents the unified view through which the conflicts and confusion of different ways of looking at the world are resolved. Therefore database technology can only work by developing it on the basis of universal love.

In practice computers can be seen to work (without much love). The manner in which they work and the overall degree of user satisfaction may be questioned. But in practice computers are being used extensively. Therefore the question has to be raised, if transferring information from the real world to a computer is so fraught with problems, how is it being done?

There will be some to whom the answer is obvious. Reality, not data, is the myth. Reality exists only to the extent we have data about it. If this is your view point, congratulations! This is without doubt a logical and satisfying point of view.

The second way of answering the question is that the problem of communication about reality pre-dates electronic computers. Human culture has adapted to the gap between symbols and reality. The effect of using computers is to widen the gap. This is a change of degree, not a break with the past.

The third reply is that the database approach is not that widespread, yet. Where it is in use the difficulties described in this contribution have taken the form of a vast number of data items being recorded. In some senses this can be viewed as a success of the method, rather than a criticism of it. The way in which pre-database computing has avoided the problems is by using data structures designed for specific problems. By restricting the design of data structures to meet the need of particular applications, a narrow view point produces a workable system. This traditional solution to the problem creates the difficulties of obtaining from the computer the information required, because the data structures reflect someone else's needs at some point in the past.

Another objection to this critique of data processing is that the difficulties outlined stem from getting from reality to data structure through a human language. In this view reality and data structures are really all right. But getting from one to the other through language introduces the ambiguities and view-point shifts described above. (The observant reader may have noticed that I am not sure whether view-point is one word or two. I considered using a text editor to change all view point phrases to point of view. Unfortunately the software does not recognise a phrase if it goes over a line.) For this objection to stand a method of getting from reality to data without going through language needs to be outlined. Can such a procedure be outlined in a human language?

"Incidentally one ought to be very cautious about claims of various models being based on 'the axioms of traditional set theory'. That set theory deals entirely with extensional sets: a set is determined entirely by its population. There is simply no notion of a set with changing population, each different population constitutes a different set. So, the relevance of such set theory to any model of data processing is in the least, questionable. "[1]

Fuzzy sets do not help, precisely because an items relationship to a system is a function of time, not just probability. Perhaps we should define the word system to mean a group of objects whose population is being changed by a process over time. A closed system then is the equivalent of a mathematical set. (But not vice versa)

So a second ending of this logomachy is a quotation from Robert Graves [6] "In Broken Images".

"He in a new confusion of his understanding
I in a new understanding of my confusion. "

Roger Hill September 1982

Published in Newsletter 13 of the UK Systems Society. Art work by Richard Drydon

book

[1] William Kent (IBM San Jose. California): Data and Reality, Basic Assumptions in Data Processing Reconsidered: North Holland Publishing Company 1978.
Library of Congress Cataloguing in Publications Data
Kent, William. 1936-
Data and Reality.
Bibliography p.
1 Data base management. 2. Data Structure (Computer Science)
3. Title.
GA76 9 D3K46 001.6'4 78-19130
ISBN 0-4444-85187-9.

[2] L.N. Tolstoy: War and Peace (2 Vols Translated by R. Edwards): Penguin Classics. ISBN 0 14 044 062 3.

[3] Alan K. Russel: The Best Of Jules Verne: Castle Books. ISBN 0-29009-270-2.

[4] Stanislaw Lem: Solaris The Chain of Chance A Perfect Vacuum: Penguin Books 1981. ISBN 0 14 00 5539 8.

[5] Charles Darwin Edited With an introduction by J.W. Burrow: The Origin Of Species By Means Of Natural Selection OR The Preservation Of Favoured Races In The Struggle For Life: Penguin books 1972 (Copy of the first edition). 14040001 X

[6] Robert Graves Poems Selected by Robert Graves and Anthony Thwaite: Penguin books Fifth Edition 1978. ISBN 0 14 042. 039 8.

[7] A Choice of Kipling's Verse made by T S Eliot with an essay on Rudyard Kipling: Faber and Faber Ltd Reprinted 1973.
ISBN 0 571 05444 7 (Faber Paber Covered Editions).
ISBN 0 571 07007 8 (Hard Bound Edition).

[8] Jules Verne: Le Tour du Monde en Quatre-vingt Jours: Hetzel & Co. Paris 1873 Illustrated by Neuville and Bennet in 1874.

[9] Philip Jose Farmer The Other Log of Philease Fogg: Hamly Paperbacks 1979.
ISBN 0 600 36747 9

[10] Teilhard De Chardin: The Phenomenon of Man with an introduction by Sir Julian Huxley. Translated by Bernard Wall: Fount Paperbacks Revised Edition Eighth Impression. 1980.
0 00 62483 5

Roger Hill's Published Papers