Hacking a Paris corpus from Inside Airbnb

There is an excellent resource called Inside Airbnb which has some data sources included in it.

I hacked together a script to extract from the descriptions in Paris a corpus.  And then applied this code.

On github I’ve put up the code and examples of this.

One problem with this example is that currently there are no stop words in French in the Scikitlearn library I was using. It’s quite difficult to do text analytics on multiple languages 🙂

I hope this forms a useful snippet.

It is getting increasingly easier in Python to do Topic Modelling and NLP like this. Which is excellent 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s