Text Datasets

Index

Documentation

MLDatasets.PTBLMType
PTBLM(; split=:train, dir=nothing)
PTBLM(split; [dir])

The PTBLM dataset consists of Penn Treebank sentences for language modeling, available from https://github.com/tomsercu/lstm. The unknown words are replaced with <unk> so that the total vocabulary size becomes 10000.

source
MLDatasets.SMSSpamCollectionType
SMSSpamCollection(; dir=nothing)

The SMS Spam Collection v.1 (hereafter the corpus) is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. The corpus has a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.

The corpus has been collected by Tiago Agostinho de Almeida (http://www.dt.fee.unicamp.br/~tiago) and José María Gómez Hidalgo (http://www.esp.uem.es/jmgomez).

```julia-repl julia> using MLDatasets: SMSSpamCollection

julia> targets = SMSSpamCollection.targets();

julia> summary(targets) "5574-element Vector{Any}"

julia> targets[1] "ham"

julia> summary(features) "5574-element Vector{Any}"

source
MLDatasets.UD_EnglishType
UD_English(; split=:train, dir=nothing)
UD_English(split=; [dir])

A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).

The corpus comprises 254,825 words and 16,621 sentences, taken from five genres of web media: weblogs, newsgroups, emails, reviews, and Yahoo! answers. See the LDC2012T13 documentation for more details on the sources of the sentences. The trees were automatically converted into Stanford Dependencies and then hand-corrected to Universal Dependencies. All the basic dependency annotations have been single-annotated, a limited portion of them have been double-annotated, and subsequent correction has been done to improve consistency. Other aspects of the treebank, such as Universal POS, features and enhanced dependencies, has mainly been done automatically, with very limited hand-correction.

Authors: Natalia Silveira and Timothy Dozat and Marie-Catherine de Marneffe and Samuel Bowman and Miriam Connor and John Bauer and Christopher D. Manning Website: https://github.com/UniversalDependencies/UD_English-EWT

source