EVOLUTION-MANAGER
Edit File: NEWS.md
# tokenizers 0.2.3 - Bug fixes and performance enhancements. # tokenizers 0.2.1 - Add citation information to JOSS paper. # tokenizers 0.2.0 ## Features - Add the `tokenize_ptb()` function for Penn Treebank tokenizations (@jrnold) (#12). - Add a function `chunk_text()` to split long documents into pieces (#30). - New functions to count words, characters, and sentences without tokenization (#36). - New function `tokenize_tweets()` preserves usernames, hashtags, and URLS (@kbenoit) (#44). - The `stopwords()` function has been removed in favor of using the **stopwords** package (#46). - The package now complies with the basic recommendations of the **Text Interchange Format**. All tokenization functions are now methods. This enables them to take corpus inputs as either TIF-compliant named character vectors, named lists, or data frames. All outputs are still named lists of tokens, but these can be easily coerced to data frames of tokens using the `tif` package. (#49) - Add a new vignette "The Text Interchange Formats and the tokenizers Package" (#49). ## Bug fixes and performance improvements - `tokenize_skip_ngrams` has been improved to generate unigrams and bigrams, according to the skip definition (#24). - C++98 has replaced the C++11 code used for n-gram generation, widening the range of compilers `tokenizers` supports (@ironholds) (#26). - `tokenize_skip_ngrams` now supports stopwords (#31). - If tokenisers fail to generate tokens for a particular entry, they return `NA` consistently (#33). - Keyboard interrupt checks have been added to Rcpp-backed functions to enable users to terminate them before completion (#37). - `tokenize_words()` gains arguments to preserve or strip punctuation and numbers (#48). - `tokenize_skip_ngrams()` and `tokenize_ngrams()` to return properly marked UTF8 strings on Windows (@patperry) (#58). - `tokenize_tweets()` now removes stopwords prior to stripping punctuation, making its behavior more consistent with `tokenize_words()` (#76). # tokenizers 0.1.4 - Add the `tokenize_character_shingles()` tokenizer. - Improvements to documentation. # tokenizers 0.1.3 - Add vignette. - Improvements to n-gram tokenizers. # tokenizers 0.1.2 - Add stopwords for several languages. - New stopword options to `tokenize_words()` and `tokenize_word_stems()`. # tokenizers 0.1.1 - Fix failing test in non-UTF-8 locales. # tokenizers 0.1.0 - Initial release with tokenizers for characters, words, word stems, sentences paragraphs, n-grams, skip n-grams, lines, and regular expressions.