Baby Names

12 minute read

Published:

In this project I have worked with data provided by the United States Social Security Administration, which lists first names along with the number and sex of babies they were given to in each year. For processing speed purposes, we've limited the dataset to first names which were given to over 5,000 American babies in a given year. Our data spans 101 years, from 1920 through 2020.

Skills : SQL, Data Exploration, Data Cleaning and Manipulation

baby_names

columntypemeaning
yearintyear
first_namevarcharfirst name
sexvarcharsex of babies given first_name
numintnumber of babies of sex given first_name in that year

1. Classic American names

Let's get oriented to American baby name tastes by looking at the names that have stood the test of time!

%%sql
postgresql:///names

-- Select first names and the total babies with that first_name
-- Group by first_name and filter for those names that appear in all 101 years
-- Order by the total number of babies with that first_name, descending
SELECT first_name, SUM(num)
FROM baby_names
GROUP BY first_name
HAVING COUNT(year) = 101
ORDER BY SUM(num) DESC;


8 rows affected.
first_namesum
James4748138
John4510721
William3614424
David3571498
Joseph2361382
Thomas2166802
Charles2112352
Elizabeth1436286

2. Timeless or trendy?

Here I have captured the type of popularity that each name in the dataset enjoyed. Was the name classic and popular across many years or trendy, only popular for a few years?

%%sql

-- Classify first names as 'Classic', 'Semi-classic', 'Semi-trendy', or 'Trendy'
-- Alias this column as popularity_type
-- Select first_name, the sum of babies who have ever had that name, and popularity_type
-- Order the results alphabetically by first_name

SELECT first_name, SUM(num),
    CASE WHEN COUNT(year) > 90 THEN 'Classic'
         WHEN COUNT(year) > 40 THEN 'Semi-classic'
         WHEN COUNT(year) > 10 THEN 'Semi-trendy'
        ELSE 'Trendy' END AS popularity_type
FROM baby_names
GROUP BY first_name
ORDER BY first_name
LIMIT 10;

 * postgresql:///names
10 rows affected.
first_namesumpopularity_type
Aaliyah15870Trendy
Aaron530592Semi-classic
Abigail338485Semi-trendy
Adam497293Semi-classic
Addison107433Semi-trendy
Adrian147741Semi-trendy
Aidan68566Trendy
Aiden216194Semi-trendy
Alan162041Semi-trendy
Albert260945Semi-trendy

3. Top-ranked female names since 1920

Here the search is limited to names which were given to female babies.

Window function is used by assigning a rank to female names based on the number of babies that have ever been given that name. What are the top-ranked female names since 1920?

%%sql

-- RANK names by the sum of babies who have ever had that name (descending), aliasing as name_rank
-- Select name_rank, first_name, and the sum of babies who have ever had that name
-- Filter the data for results where sex equals 'F'
-- Limit to ten results

SELECT
    RANK() OVER (ORDER BY SUM(num) DESC) AS name_rank,
    first_name,
    sum(num)
FROM baby_names
WHERE sex ='F'
GROUP BY first_name
LIMIT 10;

 * postgresql:///names
10 rows affected.
name_rankfirst_namesum
1Mary3215850
2Patricia1479802
3Elizabeth1436286
4Jennifer1404743
5Linda1361021
6Barbara1343901
7Susan1025728
8Jessica994210
9Lisa920119
10Betty893396

4. Picking a baby name

We are analyzing the data for traditionally female name ending with the letter 'a' and a name that has been popular in the years since 2015.

%%sql
-- Select only the first_name column
-- Filter for results where sex is 'F', year is greater than 2015, and first_name ends in 'a'
-- Group by first_name and order by the total number of babies given that first_name
SELECT first_name
FROM baby_names
WHERE sex = 'F' AND year > 2015 AND first_name LIKE '%a'
GROUP BY first_name
ORDER BY SUM(num) DESC;
 * postgresql:///names
19 rows affected.
first_name
Olivia
Emma
Ava
Sophia
Isabella
Mia
Amelia
Ella
Sofia
Camila
Aria
Victoria
Layla
Nora
Mila
Luna
Stella
Gianna
Aurora

5. The Olivia expansion

Based on the results in the previous task, Olivia is the most popular female name ending in 'A' since 2015. In the next step we will analyze when did the name Olivia become so popular?

The rise of the name Olivia is explored with the help of a window function.

%%sql

-- Select year, first_name, num of Olivias in that year, and cumulative_olivias
-- Sum the cumulative babies who have been named Olivia up to that year; alias as cumulative_olivias
-- Filter so that only data for the name Olivia is returned.
-- Order by year from the earliest year to most recent

SELECT year, first_name, num,
    SUM(num) OVER (ORDER BY year) AS cumulative_olivias
FROM baby_names
WHERE first_name = 'Olivia'
ORDER BY year ASC
LIMIT 10;

 * postgresql:///names
30 rows affected.
yearfirst_namenumcumulative_olivias
1991Olivia56015601
1992Olivia580911410
1993Olivia634017750
1994Olivia643424184
1995Olivia762431808
1996Olivia812439932
1997Olivia947749409
1998Olivia1061060019
1999Olivia1125571274
2000Olivia1285284126
2001Olivia1397798103
2002Olivia14630112733
2003Olivia16152128885
2004Olivia16106144991
2005Olivia15694160685
2006Olivia15501176186
2007Olivia16584192770
2008Olivia17084209854
2009Olivia17438227292
2010Olivia17029244321
2011Olivia17327261648
2012Olivia17320278968
2013Olivia18439297407
2014Olivia19823317230
2015Olivia19710336940
2016Olivia19380356320
2017Olivia18744375064
2018Olivia18011393075
2019Olivia18508411583
2020Olivia17535429118

6. Many males with the same name

In the next task, Let's take a look at traditionally male names. We saw in the first task that there are nine traditionally male names given to at least 5,000 babies every single year in our 101-year dataset!

In the next two tasks, we will build up to listing every year along with the most popular male name in that year. This presents a common problem: how do we find the greatest X in a group? Or, in the context of this problem, how do we find the male name given to the highest number of babies in a year?

Subquery can be used in the query. first query is written that selects the year and the maximum num of babies given any single male name in that year. For example, in 1989, the male name given to the highest number of babies was given to 65,339 babies. This query is written in this task.

%%sql

-- Select year and maximum number of babies given any one male name in that year, aliased as max_num
-- Filter the data to include only results where sex equals 'M'
SELECT year, MAX(num) as max_num
FROM baby_names
WHERE sex ='M'
GROUP BY year;
 * postgresql:///names
101 rows affected.
yearmax_num
197085291
200034483
194794764
196285041
197568451
198068704
193160518
198168776
201318266
197271401
195690665
200724292
194888589
198467745
195792718
196186917
200230579
192560897
199254397
200822603
195890564
197177599
198564924
192661130
198864150
192959804
196383778
192860703
200329643
193062149
195187261
194062476
198268244
192056914
199935367
195287063
202019659
194687439
196881995
199638365
200525837
192357469
200921184
192460801
195488576
200427886
193862269
194277174
196679990
199836616
197467580
194986865
199065302
199541399
197367842
192761671
194166743
197767609
200132554
199737549
201419319
196581021
193556522
194476954
199444472
201619154
196085933
198763654
197867157
201819924
200624850
192158215
199349554
196482642
194380274
193761842
198664224
195386247
195985224
197666947
198965399
201219088
201120378
201920555
195588372
193959653
197967742
199160793
201718824
196985201
201519650
201022139
193259265
196782440
192257280
193354223
193455834
193658499
194574460
195086229
198368010

7. Top male names over the years

In the previous task, we found the maximum number of babies given any one male name in each year. Incredibly, the most popular name each year varied from being given to less than 20,000 babies to being given to more than 90,000!

In this task, we find out what that top male name is for each year in our dataset.

%%sql

-- Select year, first_name given to the largest number of male babies, and num of babies given that name
-- Join baby_names to the code in the last task as a subquery
-- Order results by year descending

SELECT b.year, b.first_name, b.num
FROM baby_names AS b
INNER JOIN (
    SELECT year, MAX(num) as max_num
    FROM baby_names
    WHERE sex ='M'
    GROUP BY year) AS new_subquery
ON new_subquery.year = b.year AND new_subquery.max_num = b.num
ORDER BY year DESC;

 * postgresql:///names
101 rows affected.
yearfirst_namenum
2020Liam19659
2019Liam20555
2018Liam19924
2017Liam18824
2016Noah19154
2015Noah19650
2014Noah19319
2013Noah18266
2012Jacob19088
2011Jacob20378
2010Jacob22139
2009Jacob21184
2008Jacob22603
2007Jacob24292
2006Jacob24850
2005Jacob25837
2004Jacob27886
2003Jacob29643
2002Jacob30579
2001Jacob32554
2000Jacob34483
1999Jacob35367
1998Michael36616
1997Michael37549
1996Michael38365
1995Michael41399
1994Michael44472
1993Michael49554
1992Michael54397
1991Michael60793
1990Michael65302
1989Michael65399
1988Michael64150
1987Michael63654
1986Michael64224
1985Michael64924
1984Michael67745
1983Michael68010
1982Michael68244
1981Michael68776
1980Michael68704
1979Michael67742
1978Michael67157
1977Michael67609
1976Michael66947
1975Michael68451
1974Michael67580
1973Michael67842
1972Michael71401
1971Michael77599
1970Michael85291
1969Michael85201
1968Michael81995
1967Michael82440
1966Michael79990
1965Michael81021
1964Michael82642
1963Michael83778
1962Michael85041
1961Michael86917
1960David85933
1959Michael85224
1958Michael90564
1957Michael92718
1956Michael90665
1955Michael88372
1954Michael88576
1953Robert86247
1952James87063
1951James87261
1950James86229
1949James86865
1948James88589
1947James94764
1946James87439
1945James74460
1944James76954
1943James80274
1942James77174
1941James66743
1940James62476
1939Robert59653
1938Robert62269
1937Robert61842
1936Robert58499
1935Robert56522
1934Robert55834
1933Robert54223
1932Robert59265
1931Robert60518
1930Robert62149
1929Robert59804
1928Robert60703
1927Robert61671
1926Robert61130
1925Robert60897
1924Robert60801
1923John57469
1922John57280
1921John58215
1920John56914

8. The most years at number one

Noah and Liam have ruled the roost in the last few years, but if we scroll down in the results, it looks like Michael and Jacob have also spent a good number of years as the top name! Which name has been number one for the largest number of years? Let's use a common table expression to find out.

%%sql

-- Select first_name and a count of years it was the top name in the last task; alias as count_top_name
-- Use the code from the previous task as a common table expression
-- Group by first_name and order by count_top_name descending

WITH count_top_name_male AS (
    SELECT b.year, b.first_name, b.num
    FROM baby_names AS b
    INNER JOIN (
        SELECT year, MAX(num) as max_num
        FROM baby_names
        WHERE sex ='M'
        GROUP BY year) AS new_subquery
    ON new_subquery.year = b.year
    AND new_subquery.max_num = b.num
    ORDER BY year DESC
    )
SELECT first_name , COUNT(first_name) as count_top_name
FROM count_top_name_male
GROUP BY first_name
ORDER BY COUNT(first_name) DESC;
 * postgresql:///names
8 rows affected.
first_namecount_top_name
Michael44
Robert17
Jacob14
James13
Noah4
John4
Liam4
David1