Some in-depth analysis of letter frequency

ray
Made It and Played It

Posts: 11

Airdate: 11/30/16
Winnings: $14,350 (half of total winnings; played with sister Debra)

Some in-depth analysis of letter frequency Nov 4, 2015 13:35:50 GMT -5 kevin likes this

Quote

Post by ray on Nov 4, 2015 13:35:50 GMT -5

As I’ve been preparing to go on the show, a few questions have popped up as I’ve played such as: I know what the most common letters are in general, but does letter frequency change for certain categories like Proper Name, On the Map, or Food & Drink? If a letter such as t, g, c, or s is revealed, that is commonly paired with h, does h become a much better guess?

So I did what any self-respecting nerd would do and made a computer program to analyze puzzles! I was inspired by this post on reddit’s /r/dataisbeautiful subreddit: https://www.reddit.com/r/dataisbeautiful/comments/2r6bfp/which_letters_should_you_pick_in_wheel_of_fortune/and scraped the same website he did to get a bunch of puzzles to study. I have a list of over 20,000 puzzles now, with category information for maybe 2/3rds of them. I have a program that takes that list of puzzles and sorts them by category, and then analyzes for 2 different things:

1.raw letter frequency (what percentage of all letters does this letter make up?)
2.frequency of letter presence at least once in a puzzle (in what percentage of puzzles does this letter occur?)

I find number 2 to be more useful, as this lets me know what are good guesses if I’m trying to avoid duds. I have the program sort the letters by decreasing percentage, so the best guesses are first.

I can also filter puzzles using regular expressions. What this means is I can ask a question such as: Given that a puzzle has a word with t as the 4th from last letter, what is the percent chance that an n is in that puzzle? (looking at the tion suffix)

Some of my questions have been duds (tgcs do not strongly change the percent chance of an h appearing) and some have been quite useful (t is much less common in Proper Name and On the Map categories). I’ll post some of the more interesting results here, and would love for suggestions for other things to look for. It’s fairly easy to run an analysis, so feel free to post in this thread with an idea and I’ll try and get you some results. For example: in a category that isn’t normally plural, what happens to the frequency of s?

ray
Made It and Played It

Posts: 11

Airdate: 11/30/16
Winnings: $14,350 (half of total winnings; played with sister Debra)

Some in-depth analysis of letter frequency Nov 4, 2015 13:36:15 GMT -5 kevin likes this

Quote

Post by ray on Nov 4, 2015 13:36:15 GMT -5

Baseline percentages (Type 2 analysis)
All puzzles:
[('e', 86.50118410665732),
('a', 79.81317428295763),
('i', 74.41452504166301),
('t', 74.00228050171037),
('r', 72.78308920270152),
('o', 71.80949039557933),
('n', 71.77879133409351),
('s', 67.32304183843523),
('l', 56.65292518200158),
('h', 50.32014735549514),
('c', 47.3861941934918),
('d', 45.36005613542672),
('g', 41.72441013946145),
('u', 39.860538549250066),
('m', 36.3520743794404),
('p', 34.73379528111569),
('f', 29.080782387509867),
('b', 28.905359179019385),
('y', 28.6027541443733),
('w', 24.967108148408034),
('k', 21.835803876852907),
('v', 14.349618454521535),
('j', 4.3373388299272),
('x', 2.9251820015788086),
('z', 2.6445048679940357),
('q', 1.6884483817209017)]

In general, t r n and to a slightly lesser degree s are roughly equally good guesses.

Category: Proper Name
[('e', 83.95904436860067),
('a', 80.54607508532423),
('n', 79.86348122866895),
('r', 76.45051194539249),
('o', 70.64846416382252),
('s', 66.89419795221842),
('t', 65.52901023890784),
('i', 64.84641638225256),
('l', 52.55972696245734),
('c', 44.027303754266214),
('h', 37.54266211604095),
('d', 36.51877133105802),
('b', 30.034129692832767),
('m', 30.034129692832767),
('p', 30.034129692832767),
('y', 29.351535836177472),
('u', 23.208191126279864),
('j', 22.18430034129693),
('g', 21.843003412969285),
('k', 20.477815699658702),
('w', 19.453924914675767),
('f', 16.38225255972696),
('v', 8.19112627986348),
('x', 5.1194539249146755),
('z', 3.4129692832764507),
('q', 2.04778156996587)]

As I had suspected, in this category things are different. n and r are in a league of their own, and t drops quite a bit (although still a top 5 guess)

Category: On the Map
[('a', 86.83651804670913),
('n', 79.40552016985139),
('e', 74.09766454352442),
('i', 69.85138004246284),
('o', 66.66666666666666),
('r', 63.26963906581741),
('t', 61.78343949044586),
('s', 59.23566878980891),
('l', 54.56475583864119),
('c', 45.64755838641189),
('h', 42.25053078556263),
('d', 37.57961783439491),
('u', 31.422505307855626),
('m', 31.210191082802545),
('b', 26.53927813163482),
('g', 25.902335456475583),
('y', 23.991507430997878),
('p', 21.656050955414013),
('k', 20.59447983014862),
('w', 19.32059447983015),
('v', 18.046709129511676),
('f', 17.40976645435244),
('x', 8.280254777070063),
('z', 4.45859872611465),
('j', 4.033970276008493),
('q', 1.2738853503184715)]

Similarly, n is rather prominent, and t has dropped down the list.

Category: Food & Drink

[('e', 88.78378378378379),
('a', 85.54054054054055),
('s', 78.51351351351352),
('r', 73.37837837837839),
('i', 71.08108108108108),
('t', 69.1891891891892),
('o', 69.05405405405406),
('n', 67.56756756756756),
('c', 66.75675675675676),
('l', 59.189189189189186),
('d', 57.027027027027025),
('h', 55.945945945945944),
('u', 52.43243243243243),
('p', 43.37837837837838),
('m', 41.75675675675676),
('g', 35.810810810810814),
('b', 35.0),
('f', 32.567567567567565),
('w', 28.91891891891892),
('k', 25.945945945945947),
('y', 21.62162162162162),
('v', 9.18918918918919),
('z', 5.27027027027027),
('j', 4.864864864864865),
('q', 2.027027027027027),
('x', 1.6216216216216217)]

s jumps up dramatically in this category.

ray
Made It and Played It

Posts: 11

Airdate: 11/30/16
Winnings: $14,350 (half of total winnings; played with sister Debra)

Some in-depth analysis of letter frequency Nov 4, 2015 13:46:11 GMT -5 kevin likes this

Quote

Post by ray on Nov 4, 2015 13:46:11 GMT -5

An example of a dud: Given that a puzzle has a t,c,g, or s that are not the last letter of a word:

Category: All puzzles
[('e', 87.4518643749413),
('a', 80.51563820794591),
('t', 78.0501549732319),
('i', 75.19958673804827),
('r', 73.67333521179675),
('n', 72.35841081994928),
('o', 72.14238752700291),
('s', 70.28740490278952),
('l', 56.48539494693341),
('h', 52.06161360007514),
('c', 50.63398140321217),
('d', 44.810744810744815),
('g', 43.35963182117028),
('u', 40.42453273222504),
('m', 36.23086315394008),
('p', 35.03803888419273),
('f', 29.036348267117496),
('b', 28.346012961397577),
('y', 28.261482107635956),
('w', 24.61726307880154),
('k', 22.189349112426036),
('v', 14.163614163614163),
('j', 4.165492627031089),
('x', 2.977364515826054),
('z', 2.5077486615948152),
('q', 1.7375786606555836)]

h only moved up 2 percentage points, not really a significant amount more.

An example of good analysis, but doesn't really change how you would play: Given that there is a word with a t as the 4th to last letter:

[('t', 100.0),
('e', 90.92984640929846),
('a', 82.6068908260689),
('i', 80.71814030718141),
('n', 77.60481527604816),
('o', 76.81610626816106),
('r', 75.05188875051888),
('s', 70.44416770444167),
('l', 55.064342050643425),
('h', 51.26608551266085),
('c', 47.55085097550851),
('g', 42.465753424657535),
('u', 42.258198422581984),
('d', 40.59775840597759),
('m', 39.91282689912827),
('p', 34.62017434620174),
('f', 29.86716479867165),
('y', 29.036944790369446),
('w', 24.63677874636779),
('b', 24.4914902449149),
('k', 21.21212121212121),
('v', 16.27231216272312),
('x', 2.926525529265255),
('j', 2.7189705271897053),
('z', 1.847239518472395),
('q', 1.432129514321295)]

n does move up 6% more likely, but it is still roughly equal with r, and n was already a perfectly fine guess after t has been guessed anyways, so not the most game-changing information.

tuffy Made It and Played It Posts: 87 Airdate: 05/10/16 Winnings: $10,350	Some in-depth analysis of letter frequency Nov 4, 2015 20:03:43 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by tuffy on Nov 4, 2015 20:03:43 GMT -5 Thanks for the interesting and valuable information Ray!

kevin Made It and Played It "I hate you now." Posts: 433 Airdate: 03/28/16 Winnings: $1,950 SPIN ID: KS7272704	Some in-depth analysis of letter frequency Nov 5, 2015 3:53:05 GMT -5 tuffy likes this Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by kevin on Nov 5, 2015 3:53:05 GMT -5 Oh my gosh this is amazing. More please! I've often thought about this as well re: categories but don't have the CS background to do what you did. Awesomeness!

tuffy Made It and Played It Posts: 87 Airdate: 05/10/16 Winnings: $10,350	Some in-depth analysis of letter frequency Nov 5, 2015 12:45:26 GMT -5 kevin likes this Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by tuffy on Nov 5, 2015 12:45:26 GMT -5 Ray anything else that your wonderful "geeky" mind and computer prowess comes up with please share with the rest of us! Really appreciate any helpful strategies! As Kevin said more please

ray
Made It and Played It

Posts: 11

Airdate: 11/30/16
Winnings: $14,350 (half of total winnings; played with sister Debra)

Some in-depth analysis of letter frequency Nov 5, 2015 16:05:58 GMT -5 Prizes, kevin, and 1 more like this

Quote

Post by ray on Nov 5, 2015 16:05:58 GMT -5

For puzzles with no words less than five letters long:
[('e', 84.86030707274101),
('a', 77.28416813491064),
('r', 75.9249937075258),
('i', 72.25018877422602),
('s', 69.30531084822552),
('n', 68.71381827334508),
('t', 67.757362194815),
('o', 65.31588220488295),
('l', 56.40573873647118),
('c', 50.56632267807702),
('d', 40.83815756355399),
('u', 38.358922728416815),
('g', 37.81776994714321),
('h', 36.32016108733954),
('p', 36.13138686131387),
('m', 33.3375283161339),
('b', 25.409010823055628),
('y', 23.445758872388623),
('f', 21.356657437704506),
('w', 18.26076013088346),
('k', 17.115529826327712),
('v', 14.384596023156304),
('j', 3.3098414296501386),
('x', 2.617669267556003),
('z', 2.5044047319405993),
('q', 2.0639315378806944)]

r is now significantly better than t, and t is relegated to the next tier of letters. I'm going to clean up my code a bit and make it output things in a bit more readable format, but I'll keep any interesting finds coming.

Some in-depth analysis of letter frequency

Post by ray on Nov 4, 2015 13:35:50 GMT -5

Post by ray on Nov 4, 2015 13:36:15 GMT -5

Post by ray on Nov 4, 2015 13:46:11 GMT -5

Post by tuffy on Nov 4, 2015 20:03:43 GMT -5

Post by kevin on Nov 5, 2015 3:53:05 GMT -5

Post by tuffy on Nov 5, 2015 12:45:26 GMT -5

Post by ray on Nov 5, 2015 16:05:58 GMT -5