Is "Chwałek" a name of real Polish city or village? It sounds quite right, but as it turns out it's fake and was generated by a short Python script using a simple Markov chain constructed with a database of all Polish city and village names.
Can you guess is a given place name is real or fake? You can check it by playing this game.
- Run
main.py
script and have fun!
schemes.py
file contains some "strategies" to enhance faithfulness of generated words and it's currently tuned up for Polish language. It is possible to tune it for English, but it will be a lot harder, since in Polish spelling is much more bound to pronunciation than in English.- You can input your own dataset to play with: write it in
data.txt
file (no commas, spaces, tabs, just one name in one line) and runtrain.py
script. By defaultdata.txt
contains all Polish city and village names and it is pre-trained and ready to use.
The names given in training data are split into phones using rules found in schemes.py
. Then there is a Markov chain constructed with a pair of phones on each vertex such that the number written on edge from (a, b) to (b, c) is probability that after phones a, b there will be phone c. This Markov chain is stored (rather effortlessly) in file network.py
with training being run by train.py
script.
Generating a fake name boils down to just going through this Markov chain with a little tweak: to prevent generating absurdly short or long names the probabilities of the next phone being space are zeroed until the number of generated phones reaches four and are increasing progressively after seventh generated phone.
- Create a roughly working
schemes.py
file for English.