TS Corpus ML Tools

About TS Corpus ML Tools

TS Corpus is a Free&Independent Project that aims to build Turkish corpora, NLP tools and linguistic datasets.

Since 2011, we released various corpora for Turkish.

For some of these corpora, some specific tools were required. In order to fulfill our needs, we developed various tools.

For TS TimeLine Corpus , we had to built to two machine learning models

a language guesser, to classify Turkish and English news, we harvested from the sources.

a machine learning model to classify news category.

Text Level Model simply makes a prediction for the given text as "easy" or "hard" as a level classification.

Text Level Model Two simply makes a prediction for the given text as "easy", "medium" or "hard" as a level classification.

Author Gender Prediction Model tries to make a prediction the gender of the given newspaper column. The model is at an early stage and not reliable yet.

About Social vs Standard Language

word2Vec

Skipgram

FastText

First model (SkipGram_Newspaper) is trained with data taken from TS TimeLine Corpus. A selection of news, covering the same time period is extracted from the corpus, with the same data size used for the social media model.
Data is consisted of 65k news/columns that is ~24 million tokens.
Second model (SkipGram_Social_Media) is trained with data takin from Kemik Natural Language Processing Group, Yıldız Technical University and can be accessed via this page.
Data is consisted of 20 milion Tweets that is +24 million tokens.

These tools are free to use for academic studies and researches, but restricted for commercial usage.

Please note that, any text uploaded by users are saved for later studies and might be used to enhance the accuracy of the served models.

For feedback and questions, please use this form