Surname Distribution Studies

Vermeer Sources/Methods/Tools
Preface Spatial Analysis
Contemporary Geodemographics
1881 Census  
Of Boundaries and maps
e-Maps Further info
Printed maps  
The study of English local surnames
Guppy's 'local' names

If you came to this page directly, then please access
Modern British Surnames

 

 

 

 

Brief Checklist

Not an exhaustive list; just the basics

 

Preface Cautionary Note
Background Essential background reading :-
Colin Rogers
The Surname Detective
David Hey
Family Names and Family History
Oxford Companion to Local and Family History
Contemporary distribution Consult Telephone Directories
Electoral Registers
Survey of Contemporary Names (The distribution of 16,000 names available here)
1881 Census Extraction from census, manually or with LDS Companion
1851 Census No National Index; but most Family History Societies have indexed 'their' county
GRO data Births, marriages and deaths by registration district
Hearth Tax Many counties available in print or will be through the Roehampton program
Poll Tax The real thing : back almost to the era when heriditary surnames were formed

 

 

 

Preface

Before plunging enthusiastically into this topic, the following preface of the case against, might be salutary.

A) Variant or Not?

In the 1960's, and using the GRO birth indexes for 1850, Francis Leeson mapped the distribution of his name and what he regarded as its variants -Lee, Lees, Leigh, Leigh, Lea, Ley, Leese, Leeson, Leason. The resulting plots revealed discrete areas, which remained the same even when compared with a 1960's telephone survey..

A reply was made to this article by Dr Reaney, who criticised the distributions on several fronts:-

1) A plot of the modern spelling does not necessarily equate with the original form or distribution "Lee,Lea, Ley, Lay and Leigh are all one surname. They all go back ultimately to OE leah and both surnames and place-names have a variety of forms; the different modern spellings may be partly due to ME grammar, partly due to the local dialect or simply to mere chance...Parish Registers did not begin until long after surnames becam fixed; they are not necessarily proof of the original distribution."
It would have taken maybe just 1 fertile family to migrate in 1530, to give a false impression of the home of a name. Especially if that name did not appear elsewhere in mediaeval documents.

2) A plot cannot be made comparing a root name and its variants, unless one is totally sure that the supposed variant did derive etymologically from the root name. Reaney points out, that in his opinion, Leese (from OE laes- 'pasture') and Leeson - derived from 'son of Lece'- are not variants of the root names Lee, Lea, Ley, Lay and Leigh.
Reaney has subsequently been criticised for over-reliance on etymology, but I think his general points should be borne in mind by anyone plotting any kind of surname distribution.

B) Surname corruption

George Redmond in his study of Yorkshire surnames has shown the amazing variability of surnames.
"In addition to the obvious variations associated with the distortion of vowel sounds and the confusion when pronouncing consonants, the author draws attention to the remarkably high incidence of elision and truncation, as well as the introduction of so-called prosthetic consonants such as Y, W or S to preface some surnames beginning with a vowel. He also notes that the final consonant of a first name may transfer to the surname, citing Thomas Anderson alias Saunderson and John Nellis alias Ellis."
Book Review in The Escutcheon of - Surnames and Genealogy
Dramatic changes could also occur to the final syllable of surnames. For example -Whithalghe/Whitalk/Whitack and Astmough/Astmall/Asman/Asmond. Surnames such as these seem to have had very little stress on the final syllable - it was left to the listener to decide their own interpretation -often in perpetuity.

If you collect the occurrences of a name from say the Hearth Tax, how do you know that the name is what you think it is? -unless one investigate the genealogy of each bearer.
Surname dictionaries will be of little help, because they tend to ignore local corrupted forms. Surname dictionaries concentrate on the earliest form of a name : surname corruption comes much later

As George Redmonds says, each occurrence of a surname should be treated as being unique.

End of the cautionary preface

 

Snapshots


Part 1 - Contemporary


Some of the potentially really useful national and comprehensive sources are inaccessible to us - such as the National Health Service Central Register at Southport or the Social Security Central Register at Newcastle upon Tyne.

Plotting by postcode

  example number mailboxes households covered
All unit postcodes PO1 2ST 1.7 million 24 million 15-16 average
Postcode sector PO1 2 9,100 2,000  
Postcode districts PO1 2,900 20,000  
Postcode area PO 125 200,000  
above figures are not exact : check how many delivery points you own postcode covers

The following represents a rough guide to the percentages of the Scotland/Wales/England population in each postcode area and the proportion that are aged under 18. Non-mainland postcodes not yet included are for Belfast, Jersey, Guernsey and the Isle of Man. The figures should be taken as a rough guide.They were compiled prior to the publication of the ONS Census 2001 postcode area figures- but still appear to be in line

P code P area % GB pop % aged 0-17
AB Aberdeen 0.8  
AL St Albans 0.4 23
B Birmingham 3.0 25
BA Bath 0.7 22
BB Blackburn 0.8 26
BD Bradford 0.9 26
BH Bournemouth 0.9 19
BL Bolton 0.6 24
BN Brighton 1.4 15
BR Bromley 0.5 22
BS Bristol 1.6 22
CA Carlisle 0.5 21
CB Cambridge 0.7 21
CF Cardiff 1.7 24
CH Chester 1.1 23
CM Chelmsford 1.0 23
CO Colchester 0.7 21
CR Croydon 0.6 25
CT Canterbury 0.8 22
CV Coventry 1.3 23
CW Crewe 0.5 23
DA Dartford 0.7 24
DD Dundee 0.5  
DE Derby 1.2 23
DG Dumfries 0.3  
DH Durham 0.5 21
DL Darlington 0.6 22
DN Doncaster 1.2 24
DT Dorchester 0.4 21
DY Dudley 0.7 22
E London E 1.3 26
EC London EC 0.05 16
EH Edinburgh 1.4  
EN Enfield 0.5 23
EX Exeter 0.9 20
FK Falkirk 0.4 -
FY Blackpool 0.5 21
G Glasgow 2.1  
GL Gloucester 1.0 22
GU Guildford 1.2 23
HA Harrow 0.7 23
HD Huddersfield 0.4 23
HG Harrogate 0.2 22
HP Hemel Hempstead 0.8 24
HR Hereford 0.3 22
HS Harris 0.05  
HU Hull 0.7 23
HX Halifax 0.3 24
IG Ilford 0.5 25
IP Ipswich 1.0 22
IV Inverness 0.3  
KA Kilmarnock 0.6  
KT Kingston-upon-Thames 0.9 22
KW Kirkwall 0.1  
KY Kirkcaldy 0.6  
L Liverpool 1.5 24
LA Lancaster 0.6 21
LD Llandrindod Wells 0.1 22
LE Leicester 1.6 23
LL Llandudno 0.9 22
LN Lincoln 0.5 22
LS Leeds 1.3 22
LU Luton 0.5 26
M Manchester 1.8 23
ME Medway 0.9 24
MK Milton Keynes 0.8 25
ML Motherwell 0.6  
N London N 1.3 23
NE Newcastle-upon-Tyne 2.0 22
NG Nottingham 1.9 22
NN Northampton 1.0 24
NP Newport 0.8 24
NR Norwich 1.2 21
NW London NW 0.9 21
OL Oldham 0.8 26
OX Oxford 1.0 22
PA Paisley 0.6  
PE Peterborough 1.4 22
PH Perth 0.3  
PL Plymouth 0.9 22
PO Portsmouth 1.4 21
PR Preston 0.9 22
RG Reading 1.3 23
RH Redhill 0.8 17
RM Romford 0.8 17
S Sheffield 2.3 22
SA Swansea 1.2 22
SE London SE 1.5 23
SG Stevenage 0.6 24
SK Stockport 1.0 23
SL Slough 0.6 23
SM Sutton 0.4 23
SN Swindon 0.7 23
SO Southampton 1.1 22
SP Salisbury 0.4 22
SR Sunderland 0.4 22
SS Southend-on-Sea 0.9 23
ST Stoke-on-Trent 1.1 22
SW London SW 1.5 18
SY Shrewsbury 0.6 22
TA Taunton 0.5 21
TD Galashiels 0.2 20
TF Telford 0.3 24
TN Tunbridge Wells 1.1 23
TQ Torquay 0.5 20
TR Truro 0.5 21
TS Cleveland 1.0 24
TW Twickenham 0.8 22
UB Southall 0.6 25
W London W 0.9 18
WA Warrington 1.0 23
WC London WC 0.07 15
WD Watford 0.4 23
WF Wakefield 0.8 24
WN Wigan 0.5 23
WR Worcester 0.5 22
WS Walsall 0.7 24
WV Wolverhampton 0.6 23
YO York 0.9 21
ZE Lerwick 0.04  


Scotland

The population of Scotland at the 2001 Census was 5,062,011.
The population percentage of Scottish postcode areas oF 5,062,011 was about:-

AB 9.12   KA 7.23
DD 5.36   KW 0.99
DG 2.92   KY 6.86
EH 16.01   ML 7.23
FK 5.13   PA 6.38
G 23.01   PH 3.00
HS 0.52   TD 1.76
IV 4.05   ZE 0.43
Scottish Sector postcode populations : 2001 Census

 

Postcode Atlases
  • Geoplan Postcode Atlas -Geoplan (1997) isbn:0952761815 -also on CD-Rom
  • Postcode Atlas of Great Britain and Northern Ireland -Collins (Dec 2004) isbn 0007191979



UK Electoral Rolls on CDROM

Pluses

Minuses

Although the disk is expensive to purchase, there is a fee-based extraction service available from People Finders UK.

The Ward is a unit common both to Electoral and contemporary Census geography. To learn how the modern census is administered, plus a list of all hierarchical divisions -county, district, ward, enumeration district, visit the Census Dissemination Unit
1"Only 85% of those who said they did not vote in the 2001 general election were actually registered to do so and 29% of young people aged 18-24 and 19% of minority ethnic groups indicated in a sample survey that the reason for not voting was that they were not registered"
2"Looking at ethnic minority communities, 27% of black non-voters and 15% of Asian non-voters reported that they were not registered, although these figures were drawn from a small base-size"

UK parliamentary elections- numbers registered to vote
2001 44,403,238
1997 43,846,152
1992 43,275,316

Changes to the register tend to affect between 0.1% and 0.5% of electorate in any given month 
Sources: 1The Electoral Registration Process : Report and Recommendations (The Electoral Commission 2003) and
2Election 2001: the official results (Politico's 2001)

 

UK-INFO Disk

Pluses

Minuses

Up to now, the UK-Info disk could not be recommended for surname distribution analysis, where accuracy in the totality of numbers is so important. The latest disk seems at first to have a much better coverage as a percentage of the population. This is due however to the many duplications in entries caused by Postcode changes. Ensuring that the source is one of 'clean data' is vital in our study.

Telephone Directories

These now come in a variety of formats - Online, Cd-Rom, and printed. However, the telephone directory -whatever its format- suffers from a major proviso -the increasing number of unlisted telephone numbers.

"Although the national average for ex-dir is about 37% the figures do vary enormously between counties, being lowest in northern England and *much* higher in southern England.  So for any surname you will get perhaps 80% listed if they live in a northern county, but less than 50% listed in southern counties, especially East/West Sussex, Hampshire, Surrey, Kent etc.    This imbalance in ex-dir status can be significant in surnames with small numbers, but probably less so with the more common surnames." (John Wynn)
 

The latest online version -PhoneNetUk - is extremely disappointing for our purposes. A regional qualifier is mandatory (under the terms of the licensing authority) , so no national searches are possible. The inclusion of postal codes is erratic, and where they do appear are truncated to the outward code alone.
With the CD, national searches are allowable, but only the first 200 hundred entries are displayed (with full postcode). A tweak is possible to derive statistics of a surname by region, if the number of occurrences exceeds 200. A visit to the local library will probably be required to consult the printed telephone directories.

Colin Rogers has listed the disadvantages of using printed telephone directories:-

 

He adds:-

"British Telecom has an Archives and Historical Information Centre at 2-4 Temple Avenue, London EC4Y OHL which is open to the public...it holds an almost complete set of telephone directories from 1879 when the first publically available system was introduced into Great Britain."

Mr Rogers is sceptical about the usefulness of pre-1950 telephone directories for our purposes; the coverage of the population being so small. However, they might be useful as pointers for the study of relatively high frequent names.

 

National Health Service Central Register [NHSCR]

This database of 60 million names is not available in its entirety - but you can look at an individual frequency. The NHS Central Register is prone to list inflation, and some of the results are surprising, so treat with extreme caution. The whole database does have linguistic possibilities. For a paraphrased potted history of the NHSCR

Survey of Contemporary Surnames

Despite these limitations, a major and significant survey was conducted of the surnames of Britain, using the printed telephone directories 1980-1996. The survey was led by Patrick Hanks and Kate Hardcastle in order to establish those names deemed to be of significance for 'A Dictionary of Surnames' OUP, 1988. The result was 16,000 surnames with a frequency of more than 20 occurrences in any particular directory.
A full listing of the distribution of all the names can be found by following this link

This is a major survey, whose results are important to anyone wishing to compare surname frequencies and distributions, especially between 1881 and today. Of particular use in identifying homophonic surnames that have completely different distributions e.g. Adie and Adey. One Scottish: the other West Midlands.

 



International data sources
The publication of national telephone directories on CD has been used by geneticists to study isonymic rates for individual countries. Onomastic studies based on national datasets are much rarer, but hopefully will increase.

  Format dataset size (names) Publication based on data source
Austria 1996 telephone CD 4 million Barrai I and others. 'Elements of the Surname Structure of Austria.'
Annals of Human Biology 27, no. 6(November 2000-December 2000): 607- 22.
Belgium telephone CD
[
future online source]
  Barrai I.; Rodriguez-Larralde A.; Manni F.; Ruggiero V.; Tartari D.; Scapoli C. 'Isolation by Language and Distance in Belgium 'Annals of Human Genetics, January 2003, vol. 68, no. 1, pp. 1- 16(16)
Canada 1996 telephone CD 12 million D K Tucker 'Distribution of forenames, surnames and forename pairs in Canada' Names 50 no. 2 (June 2002), 105-132
Denmark Danish Central Civil Register 6.5+ million Sondergaard, Georg. 'Computer Databank of Danish Names' Names , no. 38(1990): 21-30.
Estonia Corpus Nominum Gentilium Estonicorum [online] c 74,000  
Finland     Poyhonen, Juhani. Suomalainen Sukunimikartasto . [Atlas of Finnish Surnames]. Helsinki: Suomalaisen Kirjallisuuden Seura, 1998.
France Insee datasets of births 1891-1915 and 1916-1940   Darlu, Pierre, Anna Degioanni, and Jacques Ruffie. 'Quelques Statistiques Sur La Distribution Des Patronymes En France.' Population [Paris]52, no. 3(1997): 607-34.
Germany telephone CD ?   Rodriguez-Larralde, A.; Barrai, I.; Scapoli, C. 'Isonymy and Isolation by Distance in Germany'. Human biology, 1998, vol. 70, no. 6, pp. 1041}
Israel   4 million+ Eliassaf, Nissim. 'Names Survey in the Population Administration : State of Israel.' Names , no. 29 (1981): 273- 84
Italy telephone CD ?   Barrai, I.; Rodriguez-Larralde, A.; Scapoli, 'Isonymy and Isolation by Distance in Italy'. Human biology, 1999, vol. 71, no. 6, pp. 947
Italy- Sicily telephone CD ?   Rodriguez Larralde, A. and others. 'Isonymy and the Genetic Structure of Sicily.' Journal of Biosocial Science 26, no. 1(1994): 9-24.
Japan     Miyazima S and others. 'Power-Law Distribution of Family Names in Japanese Societies.' Physica A 278, no. 1-2(April 2000): 282-88.
Netherlands Instituut Meertens [online] 27,000 'Grinding one's teeth. Linkage of surnames in the Database of Surnames in The Netherlands' by Leendert Brouwer 21st International Congress of Onomastic Sciences Uppsala, August 19-24, 2002
Norway      
New Zealand      
Russia     Balanovsky O.P., Buzhilova A.P., and Balanovskaya E.V. 'The Russian Gene Pool: Gene Geography of Surnames.' Russian Journal of Genetics 37, no. 7 ( July 2001 )
Spain telephone CD ?   Rodriguez-Larralde, A.; Gonzales-Martin, A.; Scapoli, C.; Barrai, I. 'The Names of Spain: A Study of the Isonymy Structure of Spain'. American Journal of Physical Anthropology, 2003, vol. 121, no. 3, pp.280-292
Switzerland 1994 Helvetic Telephone Directory   Barrai, I. and others. 'Isonymy and the Genetic Structure of Switzerland .1. The Distributions of Surnames.' Annals of Human Biology 23, no. 6(1996): 431-55
USA 1997 telephone directory CD 100 million D K Tucker 'Distribution of forenames, surnames and forename pairs in the USA' Names 49, no. 2 (2001): 69-96.
Venezuela telephone CD ?   Rodriguez-Larralde, Alvaro; Morales, Jorge; Barrai, Italo 'Surname Frequency and the Isonymy Structure of Venezuela'.American Journal of Human Biology, 2000, vol. 12, no. 3, pp. 352

Isonymic tables

 

 


Part 2 - Censuses

 

1881 Distribution

The 1881 census transcription -despite its known faults- is a marvellous tool for considering the frequency and distribution of names in the late nineteenth century.

The Guild of One-Name Studies has done important work in establishing baselines upon which to commence a study of individual names. The following table of conventions is based on the work of the 1881 Project- co-ordinated by Geoff Riggs

slt The number of surname occurrences at a sub-national level local
Snt The National total of surname occurrences National
n The population size of the area under study local
N The National Population size National
     
slt/Snt The percentage of occurrences local
slt/N The frequency : usually expressed per 1,000 or per 10,000 local
Snt/N The overall frequency National
(slt/Snt)/(n/N) The Density National

The density is an important indicator. If a surname was evenly distributed it would have a density of 1.
Geoff Riggs shows in his articles that reliance merely on the number of occurrences (s) is a misleading indicator.

For example, below are the 1881 county figures for my own name :-

County 1881 Population Number Total Occur % of 2514   significance   per 1000 Rank
  n s 2514 s/n s/S (s/S)/(n/N)      
HEREF 121,062 160 2514 6.36 0.06 13.65   1.322 1
BERKS 218,363 227 2514 9.03 0.09 10.10   1.040 2
WILTS 258,965 105 2514 4.18 0.04 3.94   0.405 3
GLOS 572,433 194 2514 7.72 0.08 3.29   0.339 4
HANTS 593,470 181 2514 7.20 0.07 2.96   0.305 5
WORCS 380,283 113 2514 4.49 0.04 2.89   0.297 6
SURREY 1,436,899 341 2514 13.56 0.14 2.31   0.237 7
RUTLAND 21434 4 2514 0.16 0.00 1.81   0.187 8
WARWICK 737,339 116 2514 4.61 0.05 1.53   0.157 9
NOTTS 391,815 50 2514 1.99 0.02 1.24   0.128 10
BUCKS 176,323 22 2514 0.88 0.01 1.21   0.125 11
MDSX 2,920,485 311 2514 12.37 0.12 1.03   0.106 12
OXON 179,559 19 2514 0.76 0.01 1.03   0.106 13
The data in the above and below tables is derived from considering figures derived from 1) counties, and 2) from the smaller registration districts
I have used Steve Archer's LDS Companion to extract the number of references and location in both tables from the 1881 Census Cdrom. Alternatively, I could have collected them manually from the fiche version, and using M Bryant Rosiers
Index to Census Registration Districts, assigned them to their correct area. The former is far simpler. Population figures are taken from the statistics section in this site.

Surrey has the highest number of absolute numbers, but if one considers the density, then Herefordshire is the leading county, with Berkshire close behind. There is a wide margin to the next county, Wiltshire.
I found this surprising, as a contemporary survey indicates Berkshire as the main county, whilst the IGI favours Worcestershire. The name Dance seems not to have a discrete source, but seems to have arisen independently in several counties from Worcestershire, through Gloucestershire, Wiltshire, Berkshire,Hampshire.

But then, in surname distribution studies, nothing is often clear cut- as the following table of data arrranged by registration district reveals:-

Regn Dist Regn Cnty Count 1881 Population s/n s/S Density Significance
Marlborough Wiltshire 47 9,588 0.02 0.03 4.9 51.47
Ledbury Herefordshire 53 12,691 0.01 0.02 4.2 43.85
Wokingham Berkshire 53 15,996 0.01 0.01 3.3 34.79
Bradfield Berkshire 46 16,719 0.00 0.01 2.8 28.89
Newent Gloucestershire 25 11,030 0.00 0.01 2.3 23.80
Catherington Hampshire 6 2,747 0.00 0.01 2.2 22.93
Castle Ward Northumberland 43 19,720 0.00 0.01 2.2 22.89
Andover Hampshire 47 15,700 0.00 0.02 2.1 22.07
Hartley Wintney Hampshire 44 21,326 0.00 0.01 2.1 21.66
Cirencester Gloucestershire 42 21.125 0.00 0.02 2.0 20.87
Hungerford Berkshire 33 17,802 0.00 0.02 1.9 19.46
               

generated with xls2html converter

The distribution is best seen as a map

This map shows the heartland of the Dance surname.
Three foci can be discerned:-
  • Marlborough (Wiltshire)
  • Ledbury(Herefordshire)
  • Bradfield/Wokingham (Berkshire)

The map was created with Genmap v2
Figures are per 10,000 people

 

.

 

Marlborough is the 1881 Dance hotspot.Within the Registration District, the name is located in just 2 parishes:-
  • Marlborough St Mary (174 on map)
    numbers-35
  • Preshute- numbers 12

Map created from the HDS Historic Parishes of England and Wales CD

 

GRO data

One-namers collect the GRO data for their name as a matter of course. An examination of members' pages on the Guild site reveals many excellent examples.

Professor David Hey has built a database from the death registrations for the years 1842-1846 for surnames beginning with the letters A,E,K and R. This has resulted in a computer database of over 220,000 surnames, covering an estimated 12.5 % of the whole set of surnames. He has published his results in his book
Family Names and Family History

An examination of the GRO data for my name 1840-45 shows the main counties to be Berkshire, Hampshire, Worcestershire

 


Part 3

Spatial analysis

This section considers the geographical tools available to analyse the spatial dispersion of surnames.
Amongst those introduced are:-
  • Index Numbers
  • Mean Separation Distance
  • Nearest Neighbour Analysis
  • Lorenz Curves

    Explanation of these techniques (except for Mean Separation Distance) is based on two A Level texts -
    Martin Mowforth Statistics for Geographers (Harrap, 1979)
    Lenon and Cleves Techniques and Fieldwork in Geography 2nd ed (2000)

 

 

-Index numbers

In the section above, an example was given of the Density of a particular surname (my own). This could be applied to every surname in the national database under study, to produce an index value for each name. It is the norm, however, to express Index numbers around base 100. One can either multiply the Significance value (see above) by tha factor, or use the following equation to produce the same result

Si =

___ Slt ______

* 100
 

(Snt/N) * n

 

where

Slt the local count of your name
Snt The sum of all the local counts
n The population of the local area
N The national population of area = Snt

An index number of 200 would indicate that for that surname there are twice as many surname-holders in that area, than one would expect given the total number nationally.
High frequency surnames exhibit a range of index values that is very constricted. For example, in the late 1990's, the surname Smith ranged from a minimum value of 50 to a maximum value of 249.
This should be compared with low frequency names that have ranges 0 - 3,000
At the extreme, some names with very small populations have very high index scores of c9,000

If one looks in which areas (in this case, postcodes) the index values reach a peak, the results seem inconsistent

Postcodes with the highest number of peaks

London WC London EC Norwich York Hull Ipswich Truro Taunton
727 703 517 381 371 360 355 353

Postcodes with the lowest number of peaks

Kingston upon Thames London SW Manchester Llandudno Blackpool Cardiff Leeds Reading
45 46 60 62 67 94 98 98

 

Why have London Postcodes some of the highest and lowest number of surname peaks? Those who are experienced users of Surname Atlas may have noticed that some surnames seem to display unaccountably heavy concentrations in the Isle of Man or Jersey. This is a distortion that is probably introduced through the large population ranges of geographical areas, as well as the large surname ranges. If you are working with contemporary data, and therefore postcodes, please be aware that postcode area populations vary from 3% (Birmingham) down to 0.04% ( I must do a similar exercise on 1881 Registration district areas).

  • 83 out of the 120 GB postcode areas have populations less than 1.00% of the UK total.
  • On average, a postcode area has 217 names with a count of 100+ occurrences
  • This is a factor - known to geographers as MAUP (Modifiable Areal Unit Problem)-.
    The message conveyed can be considerably influenced by the areal units chosen, and the scale

 

Least resident-populated Postcode areas
% UK population
KW Kirkwall 0.09
LD Llandrindod Wells 0.09
WC London WC 0.07
EC London EC 0.05
HS Harris 0.05
ZE Lerwick 0.04


In the following grid, column b represents a matrix of postcode areas and clusters of similar surnames -large and small. This the top lefthand cell represents frequently occurring names in highly populated postcode areas (
the Smiths etc in Birmingham etc); and conversely, the bottom righthand cell represents low frequency surnames in sparsely populated postcode areas (e.g. London EC).
The key represents the standard deviations. Most surnames fall within an irregular but graduated range of 40-420 standard deviations. Those in islands or 'pockets' have much wider ranges; and the standard deviations for low frequency names in low-populated areas are excessive.
In effect a 'cluster' of small names is far more likely to appear of significance than a 'cluster' of large names. The size of the postcode in which the cluster appears can also bolster this bias.

a b   c   key

Surnames

Large

to

Small

Large...<.Postcode area>...Small   Islands   45-49   300-349  
              Orkney etc   50-99   350-399  
              Shetland etc   100-149   400-450  
              Outer Hebrides   150-199   500-549  
              London EC   200-249   1000+  
              London WC   250-299   1500+  

For this reason, the index value ideally needs to be standardised. An equation has been formulated that does this- but is not-as yet- in the public domain.

(This section is based upon elements from an unpublished UCL symposium paper by D Lloyd)


- Mean Separation Distance

This is a measure of how dispersed your name is.

The clearest way that I can think of understanding how to apply the formula is through the following example.

Consider 4 places (parishes, registration districts) A,B,C,D each with holders of your surname numbering 100, 50, 20, 10 repectively. Enter these numbers into the following grid, as well as entering a measurement of the distance of all the other places from each other (noted here as dBA,dCA ,dDA = distance of B,C, and D from A). This is represented in this case by the third, fourth anf fifth columns below.

  Numbers distA distB distC
A 100 -    
B 50 15 -  
C 20 25 10 -
D 10 30 7 5

Formula 1= The enumerator

(B x dBA) + (C x dCA) + (D x dDA) + (C x dCB) + (D x dDB) + (D x dDC)

To enter the relevant numbers simply start at the red number 15 in the grid, and work down each column in turn

(B x 15) + (C x 25) + (D x 30) + (C x 10) + (D x 7) + (D x 5)

By substitution of the nameholders in each place this completes to
(50 x 15) + (20 x 25) + (10 x 30) + (20 x 10) + (10 x 7) + (10 x 5)

= 1870

This is known as the Total Separation Distance. This figure must now be divided by the Total of Separated Persons cited in the demoninator to result in a Mean Separation Distance of the name. The citiations will for every time the placenumber has been used in the first formula, but this time also including the surname number in place A.

Formula 2= The denominator= Total of Separated Persons

(A + B + C + D) + C + (D +D) = 100 + 50 + 20 + 10 + 20 + 10 + 10= 220

The Mean Separation Distance in this case is 1870/220 = 8.5 km

You might like to copy the grid and formulae into a spreadsheet and experiment with the numbers.
For example, if the distances remain the same , but the numbers in place A were much higher , say a concentration of 1000 (and not 100), the MSD would drop to 0.18.
If the numbers were equal (say 10) in each place, then the MSD= 13.14

If the numbers remain unaltered, but the distances from A are increased by 100km each, then the resulting MSD =106.22

And if the number of nameholders are the same, and the distance increased by 100 km, then the MSD=98.85

The above formula and example is after Schürer (2004), but the exposition is mine.
I also would sort places first by highest surname density, and then take the raw numbers from that rank order.
It may be that one would need then only to input a relatively small selection to obtain a fairly good idea of the MSD
.

A problem will be how to measure accurately (and in a consistent fashion) the distances between places, when those place are areal units, like parishes and registration districts. How does one determine what is the centroid of each is?

GenMap has has a tool to measure straightline distances between registration districts (but you will still have to guesstimate where the centroid is)
Historic Parishes of England and Wales Gazetteer has the 6 figure OS grid reference of each parish. There may be freeware to measure distances between OS grid references? Otherwise the OS has a page on how to manually calculate distances between grid references.
Or use a piece of string :-))

 

-Mean Separation Distance of the Place

The above allows you to compare the dispersion of 1 surname with another, but does not give one a national perspective of the relative dispersion of all names by place. To do so, one would have to feed all the MSD's back into the original locations.

For example, in Parish A, has a total population of 90 people, comprising just 3 surnames (p,q,r) with occurrences of 16, 8, 3 and associated Mean Separation Distances (19, 16 and 6).

The MSD of the whole parish would then be calculated as:

p19 + q16 + r12/ total parish population

by substitution

(16 x 19) + (8 x 16) + (3 x 6)/90 = 304 + 128 +18/90 = 450/90= MSD of the parish = 5

This is a daunting exercise for anyone without access to large computing power, but it has been done by the University of Essex for each parish of England and Wales in the 1881 census, and the output plotted onto a map, and published in Local Population Studies no 72 (2004)

This map reveals certain broad belts of low separation distances

Surname density seems to cut across these areas: except a core area of the South Lancashire seems to fit within the 1st belt. This seems to be an area of low surname density, in which the holders ramified greatly within the same region. As opposed say to North Wales, where the high migration to England co-existed with low surname density.

 


-Nearest Neighbour Analysis

I have in mind here the comparison of the dispersal of 2 or more names in a modern context. For example:-

NNA is a measure of distribution, and not of 'pattern' and does have limitations. The higher the number of points, the higher the reliability of the result; and 30 surname plots would be considered the minimum.

Procedure

  1. Measure the distance between each and every surname point, and average the result (Dobs)
  2. Derive the density of the points (d) , by dividing the number of surname points, by the area under consideration(e.g. ward, parish, registration district, county)
  3. Calculate the expected mean of a random distribution of points (re) over this area
    r
    e = 1 divided by (2 * square root of d)
  4. The nearest neighbour statistic can then be obtained by dividing the Average distance by the mean random of distribution
    Dobs
    /re

The Value of the nearest neighbour statistic (Rn) can range from 0 (extremely clustered) to 2.15 ( an ordered and uniform distribution). A value of 1 would suggest a random distribution

It is now up to the surname analyst to explain the resulting distribution

Be aware that: the above equations are based on 2 assumptions:-

  1. The points are located within an infinite area
  2. The points are free to locate anywhere within that area

These are severe restrictions in the case of surname study: as quite a few factors come into play - propinquity of kin, economic conditions, lines of transport, geomorphology. And if the area under study changes, then this affects the density of the points.

"Because of this problem of study area delimitation, one should be very wary indeed of comparisons made between nearest neighbour analysis results from different areas."
But

"The technique provides a very useful descriptive measure of point patterns, particularly for quantifying the increase or decrease in dispersion or clustering of a pattern through time, provising the definition of the study area remains the same."

David Ebdon Statistics in Geography 2nd ed. Blackwell, 1985 p148-9

So this technique may perhaps be useful for comparing the temporal change in a surname distribution within a specified area, such as a parish or registration district - provided its boundaries have not changed in the interim, or for comparing the distribution of 2 surnames in the same area, or jsut perhaps the distribution of a widely-dispersed surname, using the area of England, Scotland, or Wales as a baseline.
But the resulting index number for a surname, does not imply that the distributions are the same as it is "possible for arrangement of points which are very dissimilar to have identical mean nearest-neighbour distances"

 


-Lorenz Curves

These are plots of cumulative percentage (normally used in economic and social history to plot accumulated wealth).However, they can be used here to plot accumulated name frequency (x axis) against accumulated area (y axis). A surname that aligned on the diagonal (a-d) would have a perfectly even distribution of name-holders such that 10% of the area sample contains 10% of the surname-holders, 50% of its area contains 50% of the name-holder population. Any curve that tends to corner b would lllustrate a name in which a large % of the name-holders are concentrated in a small % of land

 

......c..................................................................................................................d

cum

% of area

.    o

.       o

. o

cumulative % Surname-holders
. Surname A : o Surname B

a......................................................................................................b

All Lorenz curves should be compared against the diagonal. It is possible -using the Gini coefficient- to measure the area between the diagonal and the curve, as a fraction of the area below the diagonal. A high concentration of surname-holders in a small area would yield a high Gini co-efficient.

Lorenz curves are useful for comparing the differential growth or decline of any two features over time so these curves could be used to compare :-

Comprehensive area values can be obtained from the Census Abstracts, and individual volumes of the Victoria County History,
and selected values from this site

 

Comparing Census surname distributions over time

There were 383 changes to the boundaries of registration districts from 1841 to 1911, and almost 20,000 to parishes from 1876 to 1972. "These changes mean it is very difficult to compare one census with its predecessor and make the creation of long run time series of raw data impossible"
This problem might not affect the study of a single name, but would have to be taken into consideration by those who are studying the varying distribution of a class of names.
For example, if one wished to study how the age-distribution of Welsh surnames in a London registration district varied between 1841 to 1901

Gregory and Ell consider possible ways to overcome the problem for census geographers.

Source: Ian Gregory and Paul Ell Breaking the boundaries: geographical approaches to integrating 200 years of the census
Journal of the Royal Statistical Society A 168(2) 2005, pp419-437


 

Part 4 - Socio-economic

-Geodemographics and surnames

Lots of potential for analysis here- though I am not entirely convinced of the validity of this approach on the microscale

"There is no formal proof and no "theory of geodemographics" either, only the concept that "birds of a feather flock together". All the evidence is empirical..the systems are used simply because they do work.."
R Flowerdew/ B Leventhal- Under the microscope (Market Research Society symposium paper)

"Some of the most persuasive evidence that geodemographic mapping does affect perceptions is the condemnation of this work by other researchers"
D Dorling Mapping p13


Geodemographic schemes use census and private data to create a profile of a neighbourhood. These profiles serve as a likely indication of the area's relative affluence, and the possible life-style of its inhabitants. A classification scheme is used to assign profiles into a hierarchical order
Two well-known geodemographic products are:-

Acorn
-A Classification of Residential Neighbourhoods
  Mosaic
- (used by the credit agency, Experian)
UK 2001 Classification
(Main classes)
est %
Uk Pop
  Main classes 52 sub-groups
Wealthy achievers Wealthy executives
Affluent greys
Flourishing families
8.6
7.7
8.8
  High income familes
Suburban semis
Professionals and wealthy people living in very affluent suburbs
includes satellite villages as well as suburbs
Urban prosperity Prosperous Professionals
Educated Urbanites
Aspiring singles
2.2
4.6
3.9
  Blue Collar
Low rise council
Council flats
Least expensive owner-occupied housing; includes junior white-collar
Local authority or housing association tenants
includes municipal overspill estates
Comfortably Off Starting out
Secure families
Settled suburbia
Prudent pensioners
2.5
15.5
6.0
2.6
  Victorian low status
Town houses/flats
Stylish singles
Wide mix of lifestyles for mainly young families and childless elderly
Lower and middle income- typically junior admin grades
Typically inner-city; well-educated occupants
Moderate Means Asian communities
Post Industrial families
Blue collar roots
1.6
4.8
8.0
  Independent elders
Mortgaged families
Owner-occupiers or sheltered accommodation: low incomes
Typically newly-built private housing; young families on town peripheries
Hard Pressed Struggling families
Burdened singles
High rise hardship
Inner-city adversity
14.1
4.5
1.6
2.1
  Country dwellers
Institutional areas
Outside the commuter belt; wide range of lifestyles & affluence
A catch-all category for militayr housing, boarding schools, hospitals etc
Unclassified   0.3   Mosaic has recently been revised e.g. to accommodate changing affluence/lifestyles e.g. in the Asian community
Census variables: {Age, sex, socioeconomic status, Occupation, tenure}   Census variables: {Age, marital status, recent movers, household composition & size, employment type, travel to work, unemployment, car ownership, housing tenure, amenities, housing type, socioeconomic status}
Non-census variables: {County Court Judgements, Credit activity, Electoral Roll, Postcode Address File, Directors, Retail accessibility. c 350 variables (census and non-census) in all
source used for table %   Photographs that illustrate areas deemed to be typical in Mosaic.
    On average there are 3.1 different household level Mosaic types in a postcode : Only 22% postcodes consist entirely of 1 Mosaic type at household level : A maximum of 18 different types in a postcode (Source: Richard Webber)
More detailed classification for both schemes on their websites
Useful source; Presentations to the
MRS Census and Geodemographics Group

 

However, Acorn is the more usable to the surname analyst, as the profile assigned to a unit postcode is readily available

A One-namer could obviously tabulate the current overall socio-economic status of their name. Although a minimum number of name-holders would be needed (100+?). If a name is still predominantly located within a specific region, such an analysis would divulge what percentage are rural/urban; associated with town centres, suburbs etc.

This is such a new area, that I am wary that there must be pitfalls in applying a scheme to find the socio-economic value of a surname. And would such a profile have any validity?

  • Schemes such as Acorn and Mosaic are built around census data.
    In the inter-census period, an influx of new residents with education levels, employment and ages that differed from established residents; all may cause a mismatch
  • Classification schemes are one-dimensional. A unit postcode could actually encompass several lifestyles. A postcode may be split 55% : 45% between classes, and yet will be assigned to the former. Or it may be wrongly assigned. For example the postcode of my large employer (the only building in the road) is designated as low-income tenants, rather than institutional . Almost right- the flats are 2 streets away. (Incidentally, each census output area encompasses between 5 to 10 postcodes).
  • Though new schemes are being developed with fuzzy classifications, and a resulting contour-line representation


Perhaps of more significance would be to define a group of names, and to perform the same profiling. This has been done for the names traditionally associated with one small region (i.e. 'local' surnames). The analysis (of this unpublished academic study) found that 'local' names were more associated with lower status profiles. and neighbourhoods.

To see what can be achieved in this area of name pattern analysis using geodemographics, then download

Richard Webber 'Neighbourhood segregation and social mobility among the descendants of Middlesbrough's 19th century immigrants- (CASA Working Paper- 88)

Note:

  • Although unstated, this approach does rely upon the correct classification of a name (and that is contentious in itself), and the classification scheme seems to be based upon an analysis of modern-day forms and distribution, and not a historical approach. Although on the large-scale of this study, an occasional mis-classification is not significant
  • The MAUP could also be a distorting factor i.e. when surname figures are being compared between different spatial areas in different periods

 


This type of approach needs to be repeated for other parts of the country

The following are possible areas for socio-economic surname studies
Above-average concentrations of financially privileged
and socially excluded, in close proximity:-
Camden, Haringey, Westminster
Aberdeen, Edinburgh, Stirling
High proportions of elderly people living in council accommodation Nottingham, Barking, Dagenham
Eclectic ethnic mixes (London postcodes) London -E7, E12, EC1N,W2,W3,W1BN17, London-N15,SE15,SE8,SW9,SW5,SW7,SW8, UB1
source: J. of targeting, measurement and analysis for marketing (2001) vol 10, 1 p64

Portsmouth would be an interesting case-study. It is unique in the UK as being an island city, with a strong-sense of place, and a clannishness associated with long-established families.

 


-Household Composition and surnames

It is possible to glean further information from the nature of the household entry , by analysing the possible combinations of gender and surnames

Possible household categories

Family 1 male: 1 female, sharing the same surname
Extended Family Family with at least one other adult of the same surname
Pseudo Family 1 male: 1 female, but with different surnames
Single male  
Male homesharers 2 or more males with 2 or more surnames
Multi-occupancy dwelling More than 5 surnames at one address

This is a simplified version of the household analysis in Mosaic. The use of such a simplified system does have drawbacks e.g. a brother living with a widowed sister would appear to be a pseudo family.
Given name frequencies could be used to help decide if extended families comprise of parents or offspring (
Mosaic classifies your forename into 50 clusters - each with a similar age distribution)

.."it would seem that type of neighbourhood, age and gender represent three items of information which are 'orthogonal', ie complementary to each other in that they operate in three quite independent domains. Given that both gender and age can be inferred from a person's first name with a fair degree of reliability (especially when also using public information such as years at their current address on the electoral roll and the presence and name (if present) of a partner) then it would seem that most behaviours could be predicted for any consumer from their name and address with a fairly high degree of success "
R Webber "father of UK Geodemographics"

and in the USA

"The development process also uncovered a correlation between cluster membership and given names. In the 35 million-name database, there were many names that appeared with unusual frequency in only one cluster. For that reason, all of the clusters were given high-indexing first names, resulting in titles like "Jules & Roz" (affluent and physically active urbanites with children), "Denise" (single mothers on a tight budget), and"Elmer" (very sedentary older men)...
[Certain] people defy categorization and have been lumped into a potpourri group known as the "Omegas." Nearly 9 percent of all U.S. households are Omegas."
Source : J Bickert, 1995

 

Names 'typical' of their age groups
core age    
38-44 Michelle, Sharon Kevin, Gary
44-64 Pamela, Janet Philip, Brian
65-84 Sylvia, Brenda Kenneth, Raymond
85+ Hilda, Ethel Percy, Herbert
Source: 'Geographics,GIS and neighbourhood targeting' Wiley, 2005 p 72

Female names are more fashion-driven than male names. If they are combined with a male partner name, then geodemographers are pretty confident in their estimates of that couple's age-range. This is re-inforced by the length of residency - a statistic that is consistently lower where lower-age groups are involved.

As for surnames -

"...in Scotland the percentages of electors with self-evidently Scottish names is significantly higher among consumers in highland and island communities than among consumers in student areas, defence establishments and areas of high-incomes singles and families in inner areas of Glasgow. Indeed the percentage with Scottish names has proved a more effective indicator than the Census indicator 'speaking Gaelic' in identifying areas with the most traditionally Scottish way of life"
source: R Webber Designing geodemographic classifications to meet contemporary business needs Interactive Marketing 5(3) 2004, p 233-234

 

Example

I have just played around and collected household data for name D in the PO postcode area

Postcode a b c d e f g h i
Households 1 2 3 1 1 4 5 3 2
Main types 1 2 1 1 1 2 2 3 2
  1 mod means 1 mod means
1 hard-pressed
3 mod-means
1 comfortably-off 1 mod means 3 wealthy-
achievers
1 comfortably-off
4 wealthy-
achievers
1 comfortably-off
1 wealthy-
achievers
1 comfortably-off
1 hard-pressed
1 comfortably-off
1 hard-pressed
sub-types 1 2 3 2 1 4 3 3 2
                   
Postcode j k l m n o p q r
Household 1 1 2 2 1 9 1 1 1
Main types 1 1 2 1 1 3 1 1 1
  1 mod-means
1 comfortably-off 1 urban prosperty;
1 hard-pressed
2 comfortably-off 1 wealthy-achievers 6 comfortably-off
1 moderate means
2 hard-pressed
1 hard-pressed 1 urban-prosperity 1 wealthy
achievers
sub-types 1 1   1 1 3 1 1 1
  Number % National
Average
Postcodes
involved
Households   Types Numbers
Wealthy achievers 11 25.6 25.1 5 10   Singles(Young) 6
Urban prosperity 2 4.7 10.7 2 2   Singles (Mature) 17
Comfortably Off 14 32.6 26.6 8 14   Doubles 17
Moderate means 8 18.6 14.5 6 8   3+ 2
Hard-Pressed 8 18.6 22.4 6 7      

It might be a fruitful exercise to correlate geodemographic status against household composition for a name/class of names

Another possible broader-brush classification scheme at local authority level

Does the above have implications for the philosophy of identity. The standard position is that a name has reference but no meaning i.e. it is a label that refers to one object. If it refers to more than one, then its usage is that of a common noun. Foe example, if I talk about a polar bear, the mind conjures up a class of bear with all its associations, snow, whiteness, polar region. If I say 'John' there is no equivalent class of 'Johns' sharing all the same attributes, into which my one John neatly fits.
But with geodemographic clusters, someone with a distinctive name might group into defined socio-economic, lifestyle groups. Not everyone, but a significant number.
So are Geodemographic clusters "common nouns" or
does the lifestyle communality imply meaning ???
This is in all probability early-morning tosh- I certainly have not thought it through


Other Spatial Analysis Tools

-Index of concentration
-Location quotient
-Cluster Analysis


Still to be written (sometime):-


If you came to this page directly, then please access
Modern British Surname Studies
Last revised: February 13, 2006
.