# AnalyticBridge

A Data Science Central Community

Robin Gower

## Infonomics.ltd.uk

### The mechanical poetry of semantic embeddings

``Paris - France + Germany = Berlin``

This bizarre equation is a genuine result from a natural language processing technique that represents words as vectors of numbers. In this post we apply this tool to an analogical reasoning problem, then go on to explore some possible applications.

## Words as vectors of numbers

Word “embeddings” are numerical representations of words or phrases. They are created by a machine-learning process that seeks to predict the probability of a certain word occurring given the context of other co-occurring words. Thus the vectors embed a representation of the contexts in which a word arises. This means they also encode semantic and syntactic patterns. It works a bit like a form of compression. Instead of words being arbitrarily long sequences of letters, they become fixed-length sequences of decimal numbers.

The vectors themselves are just sequences of numbers. They don’t mean a lot in isolation, much like a single word in a foreign language written in an exotic alphabet. Having more words (vectors) helps - each provides context for the others. We can start to combine and compare words by doing linear algebra on the vectors. This provides a fascinating insight into the relationships between words. Comparisons also allow us to interpret the vectors in natural language terms (to extend the above metaphor - this is like having a translator for that foreign alphabet).

We can explore this a little with visualisation. It is practically impossible to display vectors with e.g. 300 dimensions graphically in a way the human eye can interpret, but there are techniques available to compress high dimensional vectors so that we can visualise a more-or-less faithful representation of them in 2-dimensional space. The t-SNE (t-Distributed Stochastic Neighbour Embedding) algorithm is one such approach that is designed to preserve the local structure of the network - arranging the vectors according to how neighbours relate to one another.

This example, from Pennington, Socher, Manning (2014) Global Vectors for Word Representation, shows comparisons of the vectors for pairs of words. On the below diagram, each word vector is displayed as a point in the 2d space, the comparison between pairs of words is displayed as a dotted-blue line. The vector difference between `man - woman`, `king - queen`, and `brother - sister` all seem broadly similar (in terms of length and orientation). This suggests there is a common structure that represents the semantic distinction between these two genders.

By combining the vectors of several words, we can calculate new aggregate vectors (using vector addition and subtraction from linear algebra):

``Paris - France + Germany = ?``

We can take the vector that represents `Paris` then subtract the one that represents `France` and instead add in the vector for `Germany`. The resulting vector that we calculate (represented with the `?` in the equation above) represents something like “Germany’s equivalent of France’s Paris”. These transformation don’t produce vectors that correspond to other words exactly but we can look for the word with the closest match (the distance can be measured with cosine-similarity - the angle that separates the vectors). In this case the result is `Berlin`!

Another way of describing these “equations” is to use the semantics of analogy: “Paris is to France as Berlin is to Germany”. There’s a ratio notation for analogies like this:

``France : Paris :: Germany : Berlin``

These analogies are used to test mental ability. In the US, the Miller Analogies Test is used to assess candidates for post graduate education. The student is presented with a partial analogy - one in which a term is removed - and then asked to fill-in the blanks.

Can we use word embeddings to respond to a Miller Analogies Test?

## Analogical reasoning with word embeddings

The Miller Analogies test presents a partial analogy:

``Shard : Pottery :: ? : Wood``

And a set of 4 options:

• `acorn`
• `smoke`
• `chair`
• `splinter`

You need to choose the one which best completes the analogy. In this case, `splinter` is the correct answer.

We can think of the analogies `A : B :: C : D` working like a mathematical formula: `A - B = C - D`. This equation may be rearranged to solve for the missing element. This gives us an expression which we can use to calculate a new vector that represents the blanked-out term in the analogy. The resulting vector is then compared against the multiple-choice options to find the option that most closely resembles the analogical “idea” that results from the embeddings.

We can visualise this for the splinter example. We take the vector differnce `shard - pottery` (dashed blue line) then add that to `wood` to yield a vector for the missing term `?`. We then choose the nearest option (choices in green).

Note that the nearness is measured with cosine-similarity so it’s not exactly the same as 2d euclidian difference as suggested by the plot (although it ought to be broadly equivalent). Indeed proximity doesn’t necessarily mean anything in t-SNE visualisations. The plot is presented to help explain how we’re solving the partial analogies. The distill journal (which provides excellent visual explanations of machine-learning) has some insightful guidance on configuring and interpreting tsne effectively. In this case the plot is quite contrived - it only includes the 8 terms we’re interested in, and I generated several random variations until I found one that best demonstrated the overall shape that illustrates the idea.

The t-SNE visualisation was prepared in R using the Rtsne package (which wraps the original C++ implementation and is much faster than the pure R tsne version) and the mighty ggplot2.

## Putting the theory to the test

I implemented this partial-analogy solving algorithm using the Deep Learning for Java library to ingest and analyse the GloVe data and a handful of Clojure functions to tie all the linear algebra together.

I then applied the algorithm to some Miller Analogy tests taken from this example set with the kind permission of majortests.com. I had to remove a couple of questions that didn’t use terms present in the vocabulary of the model - e.g. “Sri Lanka” is actually two words and “-ful” etc are suffixes not words.

This approach leads to 9 correct answers of a possible 13 - a 70% success rate. This is far better than the 25% score we would get from responding at random and is even better than the 55% average score for humans reported by majortests.com.

Which then does it get wrong?

``17 : 19 :: ? : 37``

The options presented were: `39`, `36`, `34` and `31`. The algorithm picked `39` while the correct answer was `31` - the question is looking for consecutive pairs of primes. As insightful as the word embeddings are, we wouldn’t have expected the co-incidence of numbers in text (which the embeddings capture) to tell us anything about mathematical properties of those numbers, primes or otherwise (expect perhaps something like Benford’s Law which describes the relative incidence of numbers!).

It’s tempting to think of these calculations we do with word embeddings as logical equations. But there’s no mathematical reasoning or understanding of physical cause and effect here. Some results taken from the GloVe dataset again:

``Five - One + Two = FourFire - Oxygen + Water = Fires``

Another incorrect answer was given to this question:

``? : puccini :: sculpture : opera``

The options presented were: `Cellini`, `Rembrandt`, `Wagner`, and `Petrarch`. The algorithm suggests `Petrarch` whereas the correct answer was `Cellini`. Ultimately the equation here becomes `Puccini - Opera + Sculpture`. It’s hard to say where things went wrong but I wonder if Cellini was penalised as there is an opera named after him - leaving Petrarch as the only Italian. Reversing the relation (`Cellini - Sculpture + Opera`) results in vector very close to `Puccini`.

This leaves the following partial analogy that we might otherwise have expected the word embeddings to solve correctly:

``Penury : Money :: Starvation : ?``

The options presented were: `Sustenance`, `Infirmity`, `Illness` and `Care`. The algorithm suggests `Care` is the best fit, whereas we are expecting `Sustenance`; as explained by millertests.com, penury is the result of having no money, starvation is the result of having no food. The closest words found by the algorithm (i.e. an open answer, not those options specified in the test) are in the topic area we’d expect: `overwork`, `malnutrition` and `famines`.

The Oxford English Dictionary defines two meanings of `Penury`: the first is poverty, and the second is scarcity. Similarly `Sustenance` can refer to a livelihood as well as food. If we replace `Penury` with `Poverty` or `Sustenance` with `Food` the same algorithm finds the correct answer. This suggests that the multiple meanings are introducing some ambiguity. Curiously these replacements aren’t in the list of 10 words closest to the words we’re substituting them for - there is a substantive distinction to be made.

I think this example serves as a useful reminder that this approach is naively mechanistic. The human reader will automatically distinguish the appropriate meaning by re-appraising the context - the context of starvation means the nutrition-interpretation of sustenance comes to mind much more readily. Perhaps the algorithm could be further improved by doing something similar… We can use the idea of “Dog” to distinguish between “Cat as in Pet” and “Cat as in CAT scan”:

``Cat - Dog = Tomography``

## Applications of word-embeddings

As we’ve seen above the vectors give a numerical value for the similarity of words. A simple use would be to expand search requests using synonyms and associations or even to automatically generate a thesaurus. This would be much more powerful than a keyword search as a measure of the proximity of meaning is encoded - i.e. not literally the word you searched for but also related terms, you could even make distinctions like “please find documents about cat as in scan, not as in pet”.

Compound vectors built-up from word embeddings can be used to represent larger structures too - like sentences or whole documents. These can be used to define the sentiment of a sentence or topic of a document. They could also help to identify documents that are related to a given topic (or corpus) for grouping or retrieval. You don’t necessarily need to identify the topic by name (or a collection of names) - the topic is identified by an abstract vector of numbers - so you can start with a seed document or corpus and request some suggestions of other related ones.

From retrieval we can extend further into discovery. Swanson linking is the idea of discovering new relationships by finding connections between existing (but separate) knowledge. This is not about finding new knowledge, but a process through which we can find lessons from one area and apply them to another.

By Robin Davis - Own work, CC BY-SA 3.0, Link

More curiously this can also be extended to machine translation or the expansion of bilingual dictionaries. To achieve this feat, the embeddings are initialised using existing translations (“word alignments”) and then further trained monolingually (so they capture word context) or bilingually (so they optimise for between-language translational equivalence as well as learning within-language context). Similarities in semantic structures between two languages can become apparent once they are mapped onto the same feature space. This provides insight into new translations and incidentally makes for a novel way to explore alternative interpretations.

Perhaps the most powerful prospect of word-embeddings through, is their potential for application in other algorithms that need a base representation of language. An important task in natural-language processing, for example, is that of named-entity recognition (i.e. being able to spot the names of people, companies or places in documents). It has been shown that word embeddings can be used as a feature (i.e. way of describing the input text) when training such systems, leading to improved accuracy.

Finally, although it’s not really a practical application, I’ve been quite entertained playing with these vectors. There is a dense network of meaning deep within the numbers. Of course many of the results are prosaic but others are very pleasing. There is only as much imagination present as you put in - the process is entirely mechanical after all - but the outputs can be surprisingly expressive…

`life - death + stone = bricks`

`productivity - economic + social = absenteeism`

`word - mechanism + creativity = poetry`

`poetry - words + sound = music`

`home - house + building = stadium`

`trust + money - banking = charitable`

`lifestyle + happy - sad = habits`

`habit + happy - sad = eating`

`eating + happy - sad = vegetarian`

`ship - water + earth = spacecraft`

`power - energy + money = bribes`

### Calculate your Bus Factor with Git and R

I’ve written a tutorial for the Linux Voice magazine explaining how you can analyse the robustness of a project from it’s git repository using the ‘bus factor’ metric.

The bus factor is the number of developers that would need to be hit by a bus before the project they were working on is in serious trouble. Obviously the situation doesn’t need to be that dramatic. It could be as commonplace as having people leave by choice or through sickness etc. The general idea is that the more people who have worked on some code, the more robust the development process.

The tutorial provides an introduction to the R Studio editor and the popular visualisation package GGPlot2. It also demonstrates the analyse with reference to some of the most popular open source projects like the Linux Kernel and Open SSL.

### Stop Making Pie Charts

Don’t let Excel’s default settings ruin your data analysis!

I gathered together some insights from research into visual perception and interpretation (borrowed from the likes of Edward Tufte, Leland Wilkinson, and Stephen Few) and presented these in a talk which I hope will mean you never look at a pie chart quite the same way again!

The title - Stop Making Pie Charts - is polemic, but I think the idea is quite reasonable - pie charts are, generally speaking, not a good choice of visualisation for communicating quantitative information.

You can find the slides here

The central argument is that the most effective way to encode data in a graphic is with the position of the elements and their distance from a common baseline (like in a scatter plot or bar chart). By contrast, angle and areas (as in a pie chart) are harder to decode accurately.

## In defence of Pie Charts

I’ve given the talk a couple of times now and I’ve been fascinated to hear people’s defense of pie charts. Clearly there’s no single form of visualisation that is the best in every context (although I feel like, given suitably transformed data, the scatter plot comes close) and there are circumstances in which the much-maligned pie is appropriate.

Here are some of the counter-arguments - reasons why you shouldn’t stop using pie charts:

• Pie charts are easy to understand - people are used to seeing them, what they lack in decoding accuracy, they make up for in decoding simplicity and familiarity
• Some values are easy to read on a pie chart - it’s easy to compare against the quartiles (i.e. 25%, 50%, 75%, and 0/100%) even without guidelines
• The circular shape is aesthetically pleasing and can provide variety to decorate dry reports otherwise filled with dots and rectangles
• Sometimes people want give a subjective representation of the facts - a one-sided perspective (and 3d distortions) can help support a narrative

Indeed if you’re just looking to tell a story - particular one like “only a very small proportion of people do x” - and you don’t need your audience to decode quantitative data, then pie charts aren’t so bad after all.

Still, if you have an inquisitive audience, complex quantitative data, and find raw objective data points aesthetically pleasing, then perhaps you have no excuse but to stop making pie charts?!

### The Linked Data Mind Set

Linked Data is data that has been structured and published in such a way that it may be interlinked as part of the Semantic Web. In contrast to the traditional web, which is aimed at human readers, the semantic web is designed to be machine readable. It is built upon standard web technologies - HTTP, RDF, and URIs.

I’ve been working with Manchester-based Linked Data pioneers Swirrl to convert open data to linked data format. This experience has opened my eyes to the immense power of linked data. I thought it was simply a good, extensible structure with some nice web-oriented features. What I’ve actually found is some pretty fundamental differences that require quite a change in mind set.

If you’re already familiar with linked-data then jump down to read about the changes in perspective it’s led me to see. If you’re new to the topic or a bit rusty then you might want to read about the basic principles first.

The recently updated RDF Primer 1.1 provides an excellent introduction to RDF. A brief summary follows.

### Everything is a graph

Graphs, in the mathematical sense, are collections of nodes joined by edges. In linked-data this is described in terms of triples - statements which relate a subject to an object via a predicate:

``````<subject> <predicate> <object><Bob> <is a> <person><Bob> <is a friend of> <Alice>
<Bob> <is born on> <the 4th of July 1990>
``````

These statements are typically grouped together into graphs or contexts. A quad statement has a subject, predicate, object, and context (or graph).

### URIs and Literal

The subjects and predicates are all identifiers symbolic representations that a supposed to be globally unique, called uniform resource identifiers (URIs). URIs are much like URLs (Uniform Resource Locators) that you may be familiar with using to find web pages (this “finding” process - requesting a URL in your browser to get a web page in response - is more technically known as “dereferencing”). URIs are a superset of URLs which also include URNs (Uniform Resource Names) such as ISBNs (International Standard Book Numbers).

The objects can also be URIs or they can take the form of literal values (like strings, numbers and dates).

### Turtle and SPARQL

There are a number of serialisation formats for RDF. By far the most readable is Turtle.

``````BASE   <http://example.org/>PREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX schema: <http://schema.org/>

<bob#me>
a foaf:Person ;
foaf:knows <alice#me> ;
schema:birthDate "1990-07-04"^^xsd:date ;
``````

SPAQRL is a query language for RDF. The query below selects Bob.

``SELECT ?person WHERE { ?person foaf:knows <http://example.org/alice#me> }``

## Thinking with a Linked Data Mindset

Now that we’ve established the basics, we can go on to consider how this perspective can lead to a different mindset.

### There’s no distinction between data and metadata

Metadata is data that describes data. For example, the date a dataset was published. In traditional spreadsheets there’s not always an obvious place to put this information. It’s recorded in the filename or on a “miscellanous details” sheet. This isn’t ideal as a) it’s not generally referenceable, and b) it is easily lost if it’s not copied around with the data itself.

In RDF, metadata is stored in essentially the same way as data. It’s triples all the way down! Certainly there are some vocabularies that are designed for metadata purposes (Dublin Core, VOID, etc) but the content is described using the same structures and is amenable to the same sorts of interrogation techniques.

This makes a lot of sense when you think about it. Metadata serves two purposes: to enable discovery and to allow the recording of facts that wouldn’t otherwise fit.

Discovery is the process of finding data relevant to your interests. Metadata summarises the scope of a dataset so that we can make requests like: “show me all of the datasets published since XXXX about YYYY available on a neighbourhood level”. But this question could be answered with the data itself. The distinction between metadata and data exists in large part, because of the way we package data. That is to say we typically present data in spreadsheets where the content and scope cannot be accessed without the user first acquiring and then interpreting the data. Obviously this can’t be done in bulk unless the spreadsheets follow a common schema (some human interaction is otherwise necessary to prepare the data). If we remove the data from these packages, and allow deep inspection of it’s content, then discovery can be acheived without resorting to a separate metadata index (although metadata descriptions can still make the process more efficient).

The recording of facts that don’t fit is usually a problem for metadata because it doesn’t vary along the dimensions of the dataset in the traditional (tabular) way it’s usually present. This isn’t a problem for linked data.

### The entity-relationship model doesn’t (always) fit

The capacity of entity-relationship models is demonstrated by the popularity of object-oriented programming and relational-databases. Linked-data too can represent entity-relationship very naturally. The typically problem with the ER approach is that there’s so often an exception to the rule. A given entity doesn’t fit with the others and has a few odd properties that don’t apply to everything else. Different relationships between instances of the same two types (typically recorded with primary/ foreign keys) are qualitatively different. Since in ER, information about an object is stored within it, the data model can become brittle. In linked-data, properties can be defined quite apart from objects.

### There’s no schema: arbitrary data can be added anywhere

In a traditional table representation, it’s awkward to add arbitrary data. If you want to add a datum that doesn’t fit into the schema then the schema must be modified. Adding new columns for a single datum is wasteful, and quickly leads to a bloated and confusing list of seldom-used fields.

In part, this frustration gave rise to the Schemaless/ NoSQL databases. These systems sit at the other end of the scale. Without any structure it can be complex to make queries and maintain data integrity. These problems are shifted from the database to the application layer.

In a graph representation, anything can be added anywhere. The schema is in the data itself and we can decide how much structure (like constraints and datatypes) we want to add.

### The data is self-describing

This flexibility - the ability to add arbitrary facts without the constriction of a schema - can certainly seem daunting. Without a schema what is going to prevent errors, provide guarantees, or ensure consistency? In fact linked-data does have a schema of sorts. Vocabularies are used to describe the data. A few popular ontologies are worth mentioning:

• RDFS: the RDF Schema extends the basic RDF vocabulary to include a class and property system.
• OWL: the Web Ontology Language is designed to represent rich and complex knowledge about things, groups of things, and relations between things
• SKOS: provides a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary.

### There’s no one right way to do things

The flexibility of the data format means that there are often several ways to model the same dataset. This can lead to a sort of options-paralysis! It often pays to make a choice for the sake of progress, then review it later once more of the pieces of the puzzle are in place. Realising that it doesn’t need to be perfect first time is certainly liberating.

### Naming is hard

Naming is one of the hardest problems in programming. Linked data modelling is 90% naming. The Linked Data Patterns book provides some useful suggestions for how to approach naming (URI design) in a range of contexts.

Identifiers have value: clarifying ambiguity, promoting consensus, providing reliability, ensuring stability, and facilitating integration.

### Vocabularies aren’t settled

When developing a linked-data model, it’s vital to understand the work done by others before you. After all, you need to adopt other vocabularies and URIs in order to link your data to the rest of the semantic web. There are lots of alternatives. The Linked Open Vocabularies site provides a way to search and compare vocabularies to help you decide which to use.

In summary:

• Metadata can be data too, don’t treat it as a second class citizen
• Use entities if it helps, but don’t get too hung-up on them
• Use the core vocabularies to bring commonly understood structure to your data
• Experiment with different models to see what works best for your data and applications
• Create identifiers - it might be hard to start with, but everybody benefits in the long-term
• Stand on the shoulders of giants - follow patterns and adopt vocabularies

### How Information Entropy teaches us to Improve Data Quality

I’m often asked by data-owners for guidance on sharing data, whether it’s with me on consulting engagements or by organisations looking to release the potential of their open data.

A great place to start is the 5 star deployment scheme which describes a maturity curve for open data:

1. ★ make your stuff available on the Web (i.e. in whatever format) under an open license
2. ★★ make it available as structured machine-readable data (e.g. Excel instead of image scan of a table)
3. ★★★ use non-proprietary formats (e.g. CSV instead of Excel)
4. ★★★★ use URIs to denote things, so that people can point at your stuff

This scheme certainly provides a strategic overview (release early/ improve later, embrace openness, aim to create linked open data) but it doesn’t say much about specific questions such as: how should the data be structured or presented and what should it include?

I have prepared the below advice based upon the experiences I’ve had as a consumer of data, common obstacles to analysis that might have been avoided if the data had been prepared in the right way.

In writing this, it occurs to me that the general principle is to increase information entropy. Information entropy is a measure of the expected value of a message. It is higher when that message (once delivered) is able to resolve more uncertainty. That is to say, that the message is able to say more things, more clearly, that are novel to the recipient.

## More is usually better than less (but don’t just repeat everybody else)

While it is (comparatively) easy to ignore irrelevant or useless data, it is impossible to consider data that you don’t have. If it’s easy enough to share everything then do so. Bandwidth is cheap and it’s relatively straightforward to filter data. Those analysing your data may have a different perspective on what’s useful - you don’t know what they don’t know.

This may be inefficient, particularly if the receiver is already in possession of the data you’re sending. Where your data set includes data from a third party it may be better to provide a linking index to that data, rather than to replicate it wholesale. Indeed even if the data you have available to release is small, it may be made larger through linking it to other sources.

## Codes and Codelists allow for linking (which makes your data more valuable)

There are positive network effects to data linking - the value of data grows exponentially as not only may it be linked with other data, but that other data may be linked with it. Indeed, perhaps the most valuable data sources of all are the indicies that allow for linking between datasets. This is often called reference data - sets of permissible values that ensure that two datasets refer to a common concept in the same terms. The quality of a dataset may be improved by adding reference data or codes from standard code lists. A typical example of this is the Government Statistical Service codes that the ONS use to identify geographic areas in the UK (this is much prefered over area names that can’t be linked because of differences in spelling that prevent - “Bristol” or “Bristol, City of”, it’s all E06000023 to me!).

If you’re creating your own codelist it ought to follow the C.E.M.E. principle - Comprehensively Exhaustive and Mutually Exclusive. If the codes don’t cover a significant category you’ll have lot’s of “other”s which will basically render the codelist useless. If the codes overlap then they can’t be compared and the offending codes will ultimately need to be combined.

## Normalised data is more reliable and more efficient

Here I’m referring to database normalisation, rather than statistical normalisation. A normalised database is one with a minimum redundancy - the same data isn’t repeated in multiple places. Look-up tables are used, for example, so that a categorical variable doesn’t need to have it’s categories repeated (and possibly misspelled). If you have a table with two or more rows that need to be changed at the same time (because in some place they’re referring to the same thing) then some normalisation is required.

Database normalisation ensures integrity (otherwise if two things purporting to be the same are different then how do you know which one is right?) and efficiency (repetition is waste).

## Be precise, allow data users to simplify (as unsimplification isn’t possible)

Be wary about introducing codes where they’re unneccessary. It’s unfortunately quite common to see a continuous variable represented by categories. This seems to be particularly common with Age. The problem is, of course, that different datasets make different choices about the age intervals, and so can’t be compared. One might use ‘working age’ 16-74 and another ‘adult’ 15+. Unless data with the original precision can be found, then the analyst will need to apportion or interpolate values in between categories.

Categories that do not divide a continuous dimension evenly are also problematic. This is particularly common in survey data, where respondents are presented with a closed-list of intervals as options, rather than being asked to provide an estimate of the value itself. The result is often that the majority of responses fall into one category, with few in the others. Presenting a closed-list of options is sometimes to be prefered for other reasons (e.g. in questions about income, categories might ellicit more responses) - if so the bounds should be chosen with reference to the expected frequencies of responses not the linear scale of the dimension (i.e. the categories should have similar numbers of observations in them, not occupy similar sized intervals along the range of the variable being categorised).

Precise data can be codified into less precise data. The reverse process is not possible (or at least not accurately).

## Represent Nothingness accurately (be clear even when you don’t know)

It’s important to distinguish between different types of nothingness. Nothing can be:

• Not available - where no value has been provided (the value is unknown);
• Null - where the value is known to be nothing;
• Zero - which is actually a specific number (although it may sometimes be used to represent null).

A blank space or a number defaulting to 0 could be any of these types of nothingness. Not knowing which type of nothing you’re dealing with can undermine analysis.

Metadata is data about data. It describes provenance (how the data was collected or derived) and coverage (e.g. years, places, limits to scope, criteria for categories), and provides warnings about assumptions and their implications for interpretation.

Metadata isn’t just a descriptive narrative. It can be analysed as data itself. It can tell someone whether or not your data is relevant to their requirements without them having to download and review it.

## In summary - increase information entropy

These tips are all related to a general principle of increasing entropy. As explained above, Information entropy is a measure of the expected value of a message. It is higher when that message (once delivered) is able to resolve more uncertainty. That is to say, that the message is able to say more things, more clearly, that are novel to the recipient.

• More data, whether in the original release or in the other sources that may be linked to it, means more variety, which means more uncertainty can be resolved, and thus more value provided.
• Duplication (and thus the potential for inconsistency) in the message means that it doesn’t resolve uncertainty, and thus doesn’t add value.
• Normalised data retains the same variety in a smaller, clearer message.
• Precise data can take on more possible values and thus clarify more uncertainty than codified data.
• Inaccurately represented nothingness also means that the message isn’t able to resolve uncertainty (about which type of nothing applies).

Herein lies a counter-intuitive aspect of releasing data. It seems to be sensible to reduce variety and uncertainty in the data, to make sense and interpret the raw data before it is presented. To provide more rather than less ordered data. In fact such actions make the data less informative, and make it harder to re-interpret the data in a wider range of contexts. Indeed much of the impetus behind Big Data is the recognition that unstructured, raw data has immense information potential. It is the capacity for re-interpretation that makes data valuable.

# Robin Gower's Page

## Profile Information

Short Bio:
I'm a self-employed consultant. My company Infonomics provides economic development consultancy and information services to public, private and social sectors.
My Website or LinkedIn Profile (URL):
http://infonomics.ltd.uk

## Robin Gower's Blog

### Invitation to an Introduction to GNU-R

Posted on September 24, 2010 at 7:51am

I'll be giving a talk[1] in October to introduce people to GNU-R[2] - a popular and free statistical language and computing environment.

The talk is being hosted by the Manchester Free Software group[3] and will be held at the Madlab[4] on 19/10/10 19:00-20:30.

Naturally I'll be taking questions on the day but if you can think of any particular topics that you would like me to cover then please post a comment with your suggestions.

I look forward to seeing you… Continue

## Comment Wall (1 comment)

Join AnalyticBridge

At 3:42pm on February 19, 2008, Robin Gower said…
As you've no doubt learned, work experience is vital. My first break came when, after many knock-backs, I applied for a job that I thought was "beneath" my skills/ salary expectations. I wasn't in the post long before my new employer realised what I was worth and I was soon promoted. It's very hard to convince people of your worth without having credentials to back-up your abilities (I'm afraid academic qualifications alone are insufficient). You might consider volunteering (charities will welcome analytical support) to build-up your CV.

The general rule for job searching is to keep your options open - keep thinking of alternatives. If you can't find work in statistical/ analytical businesses then apply for analytical jobs in other industries.

Keep your chin up, and don't let the rejections grind you down!

Let me know how you get on...