Word Clouds are all the rage these days, and here at Over the Monster, we are all about being hip, trendy and in-the-know. (Dubstep. Catfish. Sharknado. See?) One of the more interesting and entertaining toys on the internet these days is Wordle, a easy-to-use java-based word cloud generator with some nice design options. In honor of the upcoming futures game, I decided to try a little experiment: I found prospect write-ups on all the top Red Sox pitching prospects from some really smart people at places like Baseball Prospectus, Minor League Ball and SoxProspects.com, edited them to isolate just the important words and phrases, and pulled them into Wordle.
Click for the Wordle image in full-size viewing.
The Word Cloud above comes from approximately 660 words used to describe Matt Barnes, Allen Webster, Anthony Ranaudo, Henry Owens, Rubby De La Rosa, Brian Johnson, Brandon Workman, Chris Hernandez, Chris Carpenter, Drake Britton and Ty Buttrey. Since more has been written about the top guys, the number of words is not evenly distributed, but heavily favors Barnes, Webster, Owens and De La Rosa.
The editing process proved to be more complicated than simply removing articles and conjunctions. There is a consistent language used by prospect writers, but each individual writer approaches their write-ups with a slightly different format and this poses a serious problem if the goal here is to see what types of players the Red Sox tend to have in their system. Some editorial decisions are fairly straight forward. "His fastball is a plus," becomes "plus-fastball" without much thought, but for other recurring themes, it is more perilous. I have attempted to re-write as little as possible while also trying to make similar observations share a common term.
Certain key words like "consistency" and "command" arise in appear over a board range of contexts and when they are not clearly tied to a judgment like "average command" or "lacks consistency," I have left them alone. Such words or terms are more likely to come up often (and therefore be more prominent in the word cloud) when there are concerns surrounding them.
I also chose to leave pitch types in. Since every pitcher features a fastball, that word is the most used term in the cloud. The result is that fastball becomes a kind of control for this experiment. Other terms are more easily understood in relation to the size of fastball. The other pitch types vary in their occurrence and their size therefore tends to reflect their importance in the discussion of prospects
Interpreting the results
Looking at this first attempt, there are a few interesting observations we can make about the pitchers currently in the Red Sox system as a group.
One small surprise comes from the pitch types. Change-up is second only to fastball in size. Red Sox prospects are throwing this pitch regularly and it is often a plus pitch for them. Curveball is second among the secondary offerings, followed by slider. Cutters are noticeably small in comparison to other pitches and that too is a bit of a surprise. The Red Sox had developed the reputation as an organization that taught the cutter several years ago, but this current batch of top prospects doesn’t fit that image at this point.
A few key adjectives recur regularly and "projectable" is the one that jumps out at me the most. The Red Sox seem to favor players who give scouts a sense of confidence in their physicality and the ability to translate that tool as they develop. Anthony Ranaudo, Trey Ball and Henry Owens all get this tag at least once in the sample used and it would not be out of place on Matt Barnes, Brandon Workman and several others in the sample.
Command is also very prominent in the design and that reflects something else about the Red Sox current system. The Red Sox are not afraid to draft or develop pitchers who come to them in need of help controlling the strike zone. Comments on command range from "command-issues" and "below-average command" (both of which are included separately) to one that are more nuanced, along the lines of "needs more consistent command of breaking balls," but the prominence of the word is generally reflective of a widespread concern with command among Red Sox pitching prospects.
"Number-three starter", "reliever" and "bullpen" all catch the eye and they reflect the most common projects of floor and ceiling in the write-ups. I standardized comments like "projects as a number-two to number-three starter" and "mid-rotation starter" to just "number-three starter" to cut through the differences in terminology as best as I could in large part because "number-three starter" is the most often proscribed projection given to the Red Sox top pitching prospects. In some cases that is straight projection, other times it is a statement of ceiling. Bullpen and reliever tend to be more often used in describing players’ floor.
There are some words that pertain primarily to specific pitches. "Out-pitch" is one that comes up regularly and that is certainly a good sign. "Weak contact," Tight rotation" and "late break" also bear some noticeable weight here. There seems to be a consistent pattern of Red Sox prospects possessing at least one secondary offering that is either a plus pitch now or has obtainable plus-pitch potential. Unfortunately, the term "fringe-average" also jumps off the page and often that relates to second or third pitches.
I am generally happy with the results in this first experiment with using word clouds. The end result is interesting and it offers the promise of some insights. However, several issues make this first attempt somewhat flawed as a vehicle for understanding the Red Sox current system. A more even word count for each player would eliminate the bias towards those players who have more sizable write-ups in this example.
In general, more is almost certainly better with respect to word count for these. I would aim for at least double the word count in the future to help make some of the common critiques jump out more. Here there are enough terms that receive repeated use for us to make some interesting observations, but it would probably be a mistake to draw too many broad conclusions from this sample.
While upping the word count is the simplest way to improve this exercise, the most important refinement would be creating a more systematic way of editing comments into simple phrases. The biggest danger in this type of infographic is the possibility that people will read the wrong thing into a word or phrase’s prominence. An important key word "Durability" could be used in the negative- "player X lacks durability"- or in a positive light –"Player X should develop increased durability as he fills out." Both comments need to be distilled into consistent, repeatable phrases and with a higher word count, that offers more and more judgment calls to be made by the author. A consistent vocabulary here would maximize the potential of word clouds for this purpose. I’ve tried to avoid allowing repeated terms that lack the correct context, but the amount of re-phrasing here is fairly minimal. Going forward, I think that needs to change. This is a broad-stroke way of viewing things and the loss of nuance is probably inevitable.
Please share your thoughts on this concept, its strengths and weakness and ways it can be improved below.
Read more Red Sox:
- What to expect from Red Sox prospect Brandon Workman
- It's time for change in the Red Sox rotation
- MLB Trade Deadline: How can the Red Sox fix their bullpen?
- Should the Red Sox sign Jarrod Saltalamacchia to an extension?
- The sustainable side of Jose Iglesias’ success at the plate