For anyone interested in Markov chains (aimed at language) in Python, and their relationship to the larger world of language modeling (including "modern" RNNLMs and so on) this post by Yoav Goldberg is an excellent introduction, with clear and simple code [1]. Extending the Markov chain even a few steps can give really impressive generated results.
Clone/download [2] if you want to run the code yourself.
It's in Italian since it just was an hobby project and I didn't spend too much time adapting it for the public domain, but it may be interesting for someone.
I have a Twitter bot that makes up book passages combining books from the Gutenberg project: https://twitter.com/markovian_lit
Code is also on GitHub if it interests anyone - not the prettiest, but it was fun to build :-)
Along those lines, maybe also something like Fleksy or Swype, which are reportedly usable by blind people -- fleksy in particular is very smart about determining the size and orientation of the invisible "keyboard" the user is typing on.
I don't see how a markov chain has something to do with pagerank. I understand that pagerank use graphs, and a markov chain is some kind of graph, but I don't see anything else.
PageRank assumes your browsing activity is a Markov chain in the graph defined by hyperlinks in the web (assuming 1/n probability for each outgoing link in a page). The rank of a page is the limit #visits to the page/#total visits (which is the eigenvector of the transition matrix for eigenvalue 1)
Am I right in taking from this that markov chains don't necessarily adjust to history?
So in the example, if a "sunny day" state has a chance of 0.9 to stay in that state and a 0.1 chance of transitioning to rainy day and I already had 5 sunny days, the 6th day would still have a 0.1 chance to be a rainy day?
Yes that's right. The reason they're "Markov" chains is that they make the "Markov assumption" which is usually stated as "the future is independent of the past given the present". This just means that the probability of the next timestep depends only on what happened in the current one; it has no memory as such.
As detaro says:
"A Markov chain with memory for the last m steps is sometimes called a "Markov chain of order m""
So if you wanted it to depend on the last 5 timesteps, you'd create a 5th order Markov chain, and so on.
Probability doesn't adjust to history either: if you flip coins, and if you already have 5 heads in a row, your chance for a head the next time you flip is still exactly one half.
But weather is not driven by probability so history might matter.
i feel like i hear less and less about markov chains
are they still in research and use?
Markov chains will always be important, because they are a fairly fundamental concept (going back to the early 20th century).
They idea is certainly still widely used: two major classes of applications are Hidden Markov Models (where you attempt to infer the hidden state of a system from observations that depend on the state) [1], and Markov-Chain Monte Carlo (MCMC) [2] (where you sample from a probability distribution function by constructing a markov chain whose equilibrium distibution is that distribution).
is page rank still the most prominent search algo?
PageRank is really a ranking algorithm, rather than a search algorithm. It's been applied to many things other than web searching, and has many variants (e.g. see the review article "PageRank Beyond the Web" [3). That said, Google rankings depend on many things other than PageRank.
i always felt these probabilistic models were a conflation of correlation and causation
> i always felt these probabilistic models were a conflation of correlation and causation
> What do you mean?
the example from the post of weather events is a simplified version of what i mean
the collected data expresses that there is a probability of 50/50 rain versus shine
the first string of possible outputs is scattered and more random than the real data, so weights are introduced to nuance to the output and the second string is more closely representative of the data
but this introduces a relative psuedo causation, what the author refers to as stickyness
it's reminiscent of the monty hall problem(o)
altering the state of the model causes its probability to change relative to a specific potential, but only within the limited scope of the probable space
choosing one door causes the probabilities associated with the state to change but the system itself is unchanged by the choice
and what you see with the weather is a simulation of the collected data stead the underlying system, which i feel conflates correlation with causation
RNNs (recurrent neural networks) can be seen as generalizations of Markov chains - or as I prefer think of it, Markov chains are very limited RNNs. There is an enormous amount of research happening around these sequential models (and differentiable DAGs (directed acyclic graphs) aka neural networks in general) these days.
Markov chains have never (in the work I have seen at least) tried to imply causation - purely correlation/probability. There is other recent research on causality (see the work of Pearl), but just because one thing very likely follows another doesn't mean the preceding event "caused" the next one, as both events could be side effects from the same (unobserved) root cause.
but the difference between neural nets and markov chains is the probabilistic element of nn is the optimal traversal and could be replaced with a perfect algo if one were found to be computationally feasable
whereas markov chains start from a place of probability and require it
the construction stage of the markov model agrees with your statement:
> just because one thing very likely follows another doesn't mean the preceding event "caused" the next one
but when in use the current state is the cause of the following state informed by your model
nn can simulate a system whereas an mc can only simulate data from a system
RNNs are optimizing a probability, at least with standard training to maximize likelihood. As you say, with a different training criterion sure it could do something different - but so could a Markov chain if it was built to optimize a different objective (though it wouldn't be a Markov chain anymore I don't think).
Look at reinforcement learning - loads of stuff making the Markov assumption and doing quite well on complex tasks, though once again this is quite different than a Markov chain.
In theory (at least my opinion on theory) if we had an infinite size "state" it would be possible to just encode all past information in every state, and use Markov rules. It is just not practical, and much easier to extend lookup to multiple timesteps in the past especially with better understanding of optimizing things like RNNs.
Choosing the maximum p(X_t+1 | X_t) does not infer causality in any way in my mind - only saying that given the data when X_t happens, some event at X_t+1 is likely to happen. Classic correlation. I don't see where you are seeing causality in that part. I also don't see a distinction in simulating a system and data from a system - to me it is just that NNs can model much more complex systems than typical Markov chains.
In general I agree with you, but I find these two concepts are much, much more similar than they are different.
> but so could a Markov chain if it was built to optimize a different objective (though it wouldn't be a Markov chain anymore I don't think).
i wanted to think about this, and i thought maybe you could have a purely markov model if it were a self constructing self similar model
a markov fractal
at each event the chain reconstructs itself from its new perspective in the fractal
but even done elegantly i think it would be more undue overhead to constantly reconstruct the model than just feeding a self similar model constantly reconstructed inputs, which i see a more clear affinity to neural networks or something wholly different than either
Just to add a comment on page rank, from what I've heard that is a actually a very small contributor to the overall google algorithm. That is - at least in the way it was originally conceived by Larry Page - PageRank is now only a minuscule tuner amongst 100's of other tuners in the overall search algorithm now, but because it was the first and core part of the google algorithm, people still continue to assume it is the predominant one.
It doesn't necessarily matter whether they're still in use. All of these tools are worth learning, so that if a problem ever comes up that can be solved by markov chaining, you'll know it right away.
Same for all the other tools, though. And there are a lot of them.
This isn't quite related to your question, but I think Markov chains made their debut in the original information theory paper by Shannon
What? That paper even says: Stochastic processes of the type described above are known mathematically as discrete Markoff processes and have been extensively studied in the literature; the corresponding footnote says "For a detailed treatment see M. Frechet ... 1938".
They're widely used for error correcting codes and MCMC and are generally important for modeling any kind of discrete sequential process. I guess the theory is well understood though, so deep learning etc. gets more attention.
I think I understand the concept. Extrapolating the example of Rainy and Sunny days, how would you model this over a year, where there is seasonal variation? In winter for instance, the rainy days would more likely to continue than sunny days. And what are the recommended methods for creating models from existing data?
[1] http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/...
[2] https://gist.github.com/yoavg/d76121dfde2618422139