I evolved from a regular old baseball fan into something well beyond that the same way a lot of other people do nowadays – with fantasy baseball. I’d been playing most of my life, but in 2014 I dove into my first league that involved prospects. I didn’t really know what I was doing, but I ended up doing pretty well, I think, rounding up Corey Seager, Josh Bell, Michael Conforto, and Touki Toussaint in the first 7 rounds, and even managed to scoop a 17-year-old Luiz Gohara in the 22nd. Though any success I had past Seager was blind luck, really.
As the number of available names from the public “Top 100” lists began to slowly creep to 0, I didn’t know who to pick. So I stuck with the numbers. In the 10th round, I picked a 2013 draftee who, in his first full season, pitched 120+ IP at A+, struck out 117, walked just 25, allowed only 21 ER, and gave up just 2 HR all year. This, to me at the time, was a great pick.
His name was Glenn Sparkman, pitcher for the Kansas City Royals.
Of course, there was a lot I didn’t know at the time. I didn’t know that 22 year olds in High-A should [more or less] being pitching exactly that well below AA due to competition that is sometimes underdeveloped. I didn’t know that his 11 appearances out of the bullpen were noteworthy, as they would inflate his K totals while deflating his ERA. I didn’t know that his home park (Wilmington) suppresses HR like crazy and that his measly 1.4% HR/FB could in no way be a skill.
This worthless little anecdote always stuck with me, and that’s how Sparkman, the projection system that I’ve developed for pitching prospects, got both its name and a little bit of inspiration.
WHAT DOES SPARKMAN DO?
Sparkman projects the Major League impact of a pitcher through his 20’s using his age and stats at whatever level he pitched at. The projections come in the form of a total FanGraphs WAR total for these years and are based on historical seasons in the minors from 2007 forward. 
This will be the first time of many I will say that Sparkman is not intended in any way to replace or be better than actual scouting. That’s not its purpose. When analyzing a prospect, pitching or hitting, visit the actual scouting reports of Prospects Live, FanGraphs, Baseball America, or wherever else first, as the numbers alone will never come close to tell the whole story. What Sparkman can hopefully do is contextualize certain things such as age/level/risk to pershaps shed some light on some pitching performances that were either underappreciated or not as impressive as they seemed on the surface.
 Learning to web scrape is very much on the top of my to-do list, but until then, only the years 2007 and beyond for Minor League Baseball data are [free and] available in one nice, neat location for me to download and analyze. On one hand, that’s disappointing; more data is never a bad thing, and I would love to go into the 90s or beyond. On the other hand, the game has changed so much over the last 10 – 20 years, I’m not sure if this data would be more helpful than what I already have. It could even be making the model worse. I won’t know until I get my hands on it. For now, however, I was happy with the data I had for an initial roll out, which was almost 40,000 individual data points from Low-A to AAA over those years.
What Sparkman sets out to do is, for each pitching prospect that pitched in the last year, project the percent chance that said pitcher will reach certain career “milestones” before the age of 30. Then, based on those percentages, calculate an expected WAR for each pitcher over that same time period. To do this, Sparkman takes a players stats and uses various logistic regressions that change depending on level and milestone.
Logistic regressions, for those unfamiliar, have two outputs:
0 (or “False”, “Negative”, “No”, etc.)
1 (or “True”, “Positive”, “Yes”, etc.)
All historical seasons in the Minors have a 0 or a 1 associated with them for each milestone at the Major League level. Here’s James Paxton in his 20’s (13.6 total fWAR), for example:
Plotted in red along Y=1 (“Yes”) are the players that were worth 2+ WAR in their 20’s. These are dots, as one player only has one output, but may look like a line in most places due to the high density of data (there’s only so much you can spread out thousands of data points across one line). Plotted in purple at Y=0 (“No”) are are the players that didn’t hit that milestone. In light blue, we have the percent chance that the player will reach said milestone as predicted by the logistic regression given his league adjusted K%. As you can see, there are very few players who had an 80 K%+ or below in AA who reached this 2 WAR milestone. Meanwhile, the difference in chances to reach 2+ WAR more than doubled from 100 K%+ to 140 K%+. It’s hard to tell from a plot such as this, but the density of Y=1 dots past a 110 K%+ is much higher than the density of dots along Y=0 at this point, which is why the curve gravitates upward towards Y=1 as the K%+ increases. This is just one example of how a stat can impact Sparkman’s projection of a pitcher.
If at this point you’re thinking to yourself, “this sounds a bit like Chris Mitchell’s KATOH”, you wouldn’t be off base. Sparkman and KATOH are built similarly. Each model uses binary output models to predict milestones. KATOH was built using a probit model whereas Sparkman was built on logit models. They’re very similar, and I just stuck with logistic regression solely because I was much more familiar with it. Also, Sparkman is (from what I can tell from reading Mitchell’s work/writing) a bit more “dynamic” than KATOH in that each milestone at each level doesn’t necessarily take the same inputs as the milestone before it. For example, BB% might be helpful in predicting the “Make MLB” milestone in A-ball, but it might be not helpful whatsoever to predict the “7 WAR” milestone. In that case, BB% was accounted for only until it begins losing predictive value.
The rundown of which stats at which level I found to be predictive of which milestones would honestly be an entire article in and of itself, so I’m not going to go into too much detail, but for the most part, the most important factors were unsurprisingly age, rates of games started vs. relief appearances (GS%), and K%+ (adjusted to league average). BB%+ was helpful quite a bit, but not always, and HR/BBE+ (used as a proxy for “hard” contact) and (GB+IFFB)/BBE+ (used as a proxy for “weak” contact) were helpful in a few cases, but were unhelpful and thus unused more often than not.
K%+, BB%+, HR/BBE+, and (GB+IFFB)/BBE+ were (as you can tell by the “+”) centered around league averages of the league in question during that year. Park factors for HR/BBE+ were also input based on HR park factors recently produced by Sam Dykstra at MiLB.com. Wilmington has a 53 (!!) park factor for home runs over the last 3 years, so it’s no wonder Glenn Sparkman thrived there way back when.
All 4 of these were also regressed back towards league average depending on the sample size at hand. If a pitcher faced 50 batters and walked 1 person, his BB% that went into the model wasn’t 2% (or whatever BB%+ resulted from a 2% mark). Instead, uncertainty was measured based on 50 TBF and regressed upwards due to sample size that was small relative to the “reliability” point of BB% (~170 TBF). 
 My theory as to why HR/BBE wasn’t more impactful was due to the large “reliability” point of home run rate, which I estimated around 800 batted balls based on a few things. When all the numbers were run, very few HR marks strayed very far from the 90-110 range, as 800 batted balls in a single year is impossible to reach, giving us no “stable” – and thus always regressed – HR/BBE numbers.
- Remember, all numbers you see below are projections before the age of 30
- FV values are loosely based on projected WAR, both historically (FG) and by the distribution of xWAR totals that the model has output in the past. Don’t take them too seriously; it’s just a way to quickly group similar outputs in a way that people understand.
- The model doesn’t know who has made the MLB or not, so there are (by design) players who have already made the MLB with a “Make MLB %” below 100%. Don’t worry too much about that.
- If on mobile, turn to view in landscape to see results in table form
|Rank||Name||Team||Age||Make MLB||2 WAR||4 WAR||7 WAR||10 WAR||14 WAR||18 WAR||Expected WAR||FV|
|5||Simeon Woods Richardson||Blue Jays||18||0.83||0.59||0.56||0.5||0.25||0.13||0.05||6.4||55|
|25||Bryan Mata||Red Sox||20||0.81||0.48||0.29||0.2||0.13||0.07||0.04||4||50|
|33||Nate Pearson||Blue Jays||22||0.84||0.43||0.33||0.19||0.1||0.05||0.02||3.6||50|
|40||Alek Manoah||Blue Jays||21||0.61||0.34||0.26||0.15||0.1||0.07||0.02||3.2||50|