Hacking a Paris corpus from Inside Airbnb

There is an excellent resource called Inside Airbnb which has some data sources included in it.

I hacked together a script to extract from the descriptions in Paris a corpus.  And then applied this code.

On github I’ve put up the code and examples of this.

One problem with this example is that currently there are no stop words in French in the Scikitlearn library I was using. It’s quite difficult to do text analytics on multiple languages 🙂

I hope this forms a useful snippet.

It is getting increasingly easier in Python to do Topic Modelling and NLP like this. Which is excellent 🙂


