Hacker News new | past | comments | ask | show | jobs | submit login
How I'm Predicting Baseball Outcomes (zachschnell.com)
42 points by zsch on May 15, 2013 | hide | past | favorite | 44 comments



As someone who has written 10k+ loc on simulators, markov chaining, and done tons of analysis on baseball projections... I don't know where to start.

I honestly don't. (You are very, very far off the correct methodology.)

But maybe this is a good place, I guess.

http://www.oakton.edu/user/4/pboisver/AABaseballMath.html

I am not trying to be a jerk. But ask yourself: If it were that trivial, why wouldn't it be a snap to crush online sportsbetting?


http://espn.go.com/blog/playbook/dollars/post/_/id/2935/meet...

Just a random article. I've met him through poker. He had a radio show for a while, and he stated that if you have a 2% edge over a bookie, you can easily make a million dollars a year. Anyone who thinks they have an edge and isn't also very wealthy is simply mistaken.


2% is enormous and can't last for very long, of course.


Why (as a European, but my dad is big a DET-fan) do you guys estimate WPE from high level variables?

Baseball is basicly a lineup versus lineup sport, right? Or in terms of the link: Q per team is variable between matches. An additional modelling level should estimate Q per team per match, before estimating Qx - Qy.

Without Cabrera, DET has a harder time winning. Same for the big bearded guy.

As a sidenote: some guy posts his code (+) and gets upvoted (+). He's not aiming to change the world, or to become a 100-million firm. He wants feedback and to improve his approach. We can give him that without the "aararrrgh, but Terrence Tao is much smarter than you"? Otherwise, just don't upvote :)


Agreed. The post isn't clear at all what his goal is, and the methodology makes no sense


Yeah. My first thoughts were that the methodology is extremely questionable. I know it's arguable that successive games aren't necessarily independent trials, but analyzing a game in terms of streaks, is so incredibly "un-mathematical" I don't know where to start. To top it off, there's no reasoning about why it might be the case that "streak" analysis might give insight into the why of things.


Thank you for the article. Believe me, I am very much aware that this is elementary. I hoped to – but apologize that I didn't – make it clear that making predictions requires incorporating many many factors. It was a fun script for me to code, and I was excited to see it work the night I wrote it. But I also understand that this will need more testing and to be based off of far more information to be anywhere close to accurate.


Yup, and his analysis does not take the strength of the team into account (Vegas calls this the spread). It like betting the Miami Heat with Lebron will win a high percentage of games, well of course, everyone can make that bet.


Thank you- I also built several baseball sims and this article is just insultingly ignorant! Maybe thats a little extreme, but that was my reaction.


I'm sorry it came off that way. I am very aware that the script in its current form does not come remotely close to doing justice given all of the data thats out there and necessary to incorporate. I meant to make it clear in the article that this script is completely elementary – it was a fun thing to code, but nothing remotely resemblant of what it takes to predict sports with accuracy.


Rather than waiting for data from future games, you should backtest on old data and see how it performs (http://en.wikipedia.org/wiki/Backtesting).

You might also be interested in checking out the book The Signal and The Noise by Nate Silver (He runs the FiveThirtyEight political/data blog that notably predicted last year's election results with great accuracy.)


More relevant--before he did statistical analysis of elections, he was a leading figure in sabermetrics, the statistical analysis of baseball. His program is still one of the best.


And before that, he was an excellent online gambler. Or was it during that time? ;)

PECOTA is alright, but not really his anymore. cwyers took it over at BPro. It's also tough to say if it's one of the best for any number of statistically boring reasons (read Phil Birnbaum's blog for more info).


Yeah, PECOTA isn't one of the best at all, it just has ranges for variance, which is an imprecise metric as no one knows what he's saying a percentile is under the model. What is luck and what is skill?

Birnbaum is one of the best bloggers on advanced stuff. Tango's blog is the best period as anyone good participating in the "open source" movement so to speak is there pr shows up when they write good stuff. Until you're aware of the work to date, you will be spinning in circles with awful biased errors.


Excellent idea. And it will provide a much faster way to gauge its accuracy as I adjust the script to accomodate more areas of information.

I actually just started that book and so far so excellent.


My bet is that your model is no better than (weighted) chance to predict the next game. Just run the model on the years before 2012 and you'll see what I mean.


On my phone right now so I can't put this analysis on paper, but this feels a lot like looking at the associative property of addition.

i.e. the author regrouped the data to come up with the same result


As a guy who spent years on contract work on sports prediction, i can say this is so naive that even don't worth to be discussed.

To start with, you can't 'predict' result of a single game. You can have some advantage (or more likely, disadvantage) over the quality of the betting like a bookie gives you.


Isn't this an example of the gambler's fallacy, that previous outcomes impact future outcomes?


If game results were independent, yes. But lots of things in baseball aren't independent: teams play the same opponent repeatedly in short series, runs of home or away series occur, pitchers rotate, players get injured, teams may slack when overperforming or intensify their efforts to avoid extended losing streaks, etc.

Most of all: baseballers are fairly superstitious. They believe in streaks, lucky rituals, jinxes, and the gamblet's fallacy (being 'due' for a win or a loss). So some serial correlation could be a self-fulfilling prophesy.

Still, I'd expect tons of other available team stats to outperform the last N game results in predictive power.


Baseball players believe those things, but no one can prove they actually exist.


If I remember correctly, the gambler's fallacy is for fully independent events. I think the assumption here is that the team's performance in a game will impact their next result. After a win, they get on a roll. After a loss, the coach gives them a stern talking to and they come out playing a bit harder the next night. There's some correlation between games, but I'd say it's mostly independent...

I'm not a statistician, so I can't really speak to how 'valid' the analysis is, but I'd be curious to see how it does in different tests--unless I'm misinterpreting, the biggest check so far was done on 2012, which is exactly what was used to train it. It would be interesting to see what happens if you train with half of 2012 and test the second half. Or check 2011 (do you predict an end-of-the-year collapse, allowing my Cardinals to sneak in again? ;) ).


Playing hard has very little to do with winning in baseball - oh, I didn't try hard so I didn't hit the ball? Or didn't take a walk? Or didn't watch the ball and it hit me in the head?

I pitched 5 mph less due to mental effort? Well, now you don't get to throw again for 5 days. Or you walked 15 batters in a row to try and get taken out of the game. It's not going to make your night easier, you sit there till the game is finally over. No clock to run out. Just outs.


I understand where you're coming from. My friend's statistics teacher said that if a flipped coin results in 5 heads, it doesn't owe you a tails – this seems to support the gambler's fallacy. Though looking at something like running: say I run the first mile in a race in 7 minutes. Odds are the next mile will be a bit slower given I'm bad at pacing myself and I'm now tired. This is an extreme example, but I've been trying to look at baseball with this approach, that prior games influence the outcome of the next game. And I know that I am simplifying my prediction by just accounting for streaks. I would love to lengthen the script to look at factors like how much the team has won/lost by, who's on the lineup, where the game is (home versus away), etc.

But at its least this was good coding practice.


This is like saying I've accounted for the phase of the moon when predicting my productivity, but in the future I'd like to account for how much sleep I got and whether I've been eating right and exercising.


Sort of but not exactly. In this case a prior outcome is generally caused by the same factors that would influence the next outcome. There are also factors in baseball that make putting together a long streak harder the deeper you get into it.

That said, there are a gazillion other data points that are more granular and would provide more predictive value than simply the binary result of prior games (especially if those games took place 12+ months ago).


If you want a broader dataset, check out the data from Retrosheet (http://www.retrosheet.org/). You can get box scores and play-by-play data from many years back to dive really deep into stats. You can use Chadwick command-line tools (http://chadwick.sourceforge.net/) to parse the data into SportsML or other formats.


See also: http://cran.r-project.org/web/packages/Lahman/index.html

Baseball stats pre-molded into a nicely workable form, available from your handy R interpreter.


Hall of Fame manager Earl Weaver said: "Momentum? Momentum is the next day's starting pitcher."

Baseball fan in me says that is correct but statistician in me would like to see more models like yours to quantify it.


What, this is just streak-based? Do streaks even exist in baseball?


I'd doubt it. Streaks in general are widely believed to exist even when they don't c.f the Hot-hand fallacy http://en.wikipedia.org/wiki/Hot-hand_fallacy

I would usually presume against streaks being a useful way of predicting the future unless significant evidence were provided.


yeah for now... They're most definitely a thing, though I have a documents worth of baseball elements I hope to incorporate


Are they?

    jerf@jerfhom:~$ python
    Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
    [GCC 4.7.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import random
    >>> 94.0/(94+68)
    0.5802469135802469
    >>> winp = 94.0/(94+68)
    >>> games = []
    >>> for x in range(50):
    ...     games.append('w' if random.random() < winp else 'L')
    ... 
    >>> ''.join(games)
    'wLwLwwwwwLwLwLwwwLLLwLwwLwLLLwwLwwwwwLwwwLwwLLwLLw'
In my full simulation of 162 games, the longest streak was a 7 game losing streak, despite the higher win percentage. Of course you'll get different results each run; my next run produced a 9 game winning streak, which some quick Googling suggests is in line with what happened in 2010.

Combine this with the fact that real play is not drawn uniformly (you may play a much worse team against which you have a much better win percentage for several games in a row) and I don't see much need for some sort of meaningful, statistically-predictive "streak" to explain game results.


The 2012 data I used as the basis of my program actually had the same thing you describe – the longest streak was an 8 game losing streak despite having more wins than losses overall.

And I understand exactly where you're coming from. This is very preliminary, and if anything it was good coding practice for me. Though I very much intend to incorporate more significant factors like the lineup, the opposing team, and their history.


First improvement: do this for every team ever. Then combine for all teams, first in an individual season, then try basing the win% iteratively based on more history.

Based on these models, you should have some good examples of selection bias, and see how the model changes based on what you are not testing for, but what is implicit in the data (since data is merely a set of samples of data generated by one iteration of the (unknowable to some degree) true talent functions for each team (player, lineup decision, injury, close call by an ump, etc.)

If you're interested in going down the rabbit hole, there's tons of people who can show the way (and they're nice! At least tangotiger is way nicer than he should be in listening to people who have put no effort in understanding what is good and what is beginner's blind bliss)

Hot and cold streaks are just random variance, so is whether balls are hit within reach of fielders or safely out of reach, given a certain contact quality (ground ball, fly ball, infield pop up, or line drive all have vastly different tendencies to fall for a hit - line drives ~.600-700 babip if I recall, FB ~ low .200ish, GB ~ .300, pop up 0ish?) point is these are all known, to se degree, given the historical data.

If anyone wants to explore this stuff further let me know & I can point you to the right spots to help a specific interest?


google tangotiger and then read his book. baseball prediction has been done to death by the sabermetrics community but this guy is one of the absolute best.


tango is awesome, but saying he is the best is misleading. He is just one point of view, albeit a fairly wise and proven one. There are many other people with other opinions, using other techniques, and they have gotten good results- He is a bit of a curmudgeon when it comes to modern statistical techniques and machine learning.

edit: sorry, misread "one of the" as "the"...


Marcel is king!


Not baseball, but for the college football http://winningformula.espn.com/



Replying to myself since I hit enter on mobile - this is a great site with data, analysis, and win expectancy charts in real time during games based on Score and run states.


I could make a better program:

- check bovada.lv for the line

- guess that as the winner

- fin.

(crowdsource ftw)

Streaks from last year's team? Seriously?



Their data is excellent – thanks for sharing. Interesting when they take into account so much more than streaks. I would love to dive into the relation of more of that data in the future.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: