Uploaded Test files

2020-11-12 11:05:57 -05:00 · 2020-11-12 11:05:57 -05:00 · 2e81cb7d99
commit 2e81cb7d99
parent f584ad9d97
16627 changed files with 2065359 additions and 102444 deletions
--- a/venv/Lib/site-packages/sklearn/datasets/descr/twenty_newsgroups.rst
+++ b/venv/Lib/site-packages/sklearn/datasets/descr/twenty_newsgroups.rst
@ -0,0 +1,233 @@
+.. _20newsgroups_dataset:
+
+The 20 newsgroups text dataset
+------------------------------
+
+The 20 newsgroups dataset comprises around 18000 newsgroups posts on
+20 topics split in two subsets: one for training (or development)
+and the other one for testing (or for performance evaluation). The split
+between the train and test set is based upon a messages posted before
+and after a specific date.
+
+This module contains two loaders. The first one,
+:func:`sklearn.datasets.fetch_20newsgroups`,
+returns a list of the raw texts that can be fed to text feature
+extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
+with custom parameters so as to extract feature vectors.
+The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
+returns ready-to-use features, i.e., it is not necessary to use a feature
+extractor.
+
+**Data Set Characteristics:**
+
+    =================   ==========
+    Classes                     20
+    Samples total            18846
+    Dimensionality               1
+    Features                  text
+    =================   ==========
+
+Usage
+~~~~~
+
+The :func:`sklearn.datasets.fetch_20newsgroups` function is a data
+fetching / caching functions that downloads the data archive from
+the original `20 newsgroups website`_, extracts the archive contents
+in the ``~/scikit_learn_data/20news_home`` folder and calls the
+:func:`sklearn.datasets.load_files` on either the training or
+testing set folder, or both of them::
+
+  >>> from sklearn.datasets import fetch_20newsgroups
+  >>> newsgroups_train = fetch_20newsgroups(subset='train')
+
+  >>> from pprint import pprint
+  >>> pprint(list(newsgroups_train.target_names))
+  ['alt.atheism',
+   'comp.graphics',
+   'comp.os.ms-windows.misc',
+   'comp.sys.ibm.pc.hardware',
+   'comp.sys.mac.hardware',
+   'comp.windows.x',
+   'misc.forsale',
+   'rec.autos',
+   'rec.motorcycles',
+   'rec.sport.baseball',
+   'rec.sport.hockey',
+   'sci.crypt',
+   'sci.electronics',
+   'sci.med',
+   'sci.space',
+   'soc.religion.christian',
+   'talk.politics.guns',
+   'talk.politics.mideast',
+   'talk.politics.misc',
+   'talk.religion.misc']
+
+The real data lies in the ``filenames`` and ``target`` attributes. The target
+attribute is the integer index of the category::
+
+  >>> newsgroups_train.filenames.shape
+  (11314,)
+  >>> newsgroups_train.target.shape
+  (11314,)
+  >>> newsgroups_train.target[:10]
+  array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])
+
+It is possible to load only a sub-selection of the categories by passing the
+list of the categories to load to the
+:func:`sklearn.datasets.fetch_20newsgroups` function::
+
+  >>> cats = ['alt.atheism', 'sci.space']
+  >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
+
+  >>> list(newsgroups_train.target_names)
+  ['alt.atheism', 'sci.space']
+  >>> newsgroups_train.filenames.shape
+  (1073,)
+  >>> newsgroups_train.target.shape
+  (1073,)
+  >>> newsgroups_train.target[:10]
+  array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])
+
+Converting text to vectors
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to feed predictive or clustering models with the text data,
+one first need to turn the text into vectors of numerical values suitable
+for statistical analysis. This can be achieved with the utilities of the
+``sklearn.feature_extraction.text`` as demonstrated in the following
+example that extract `TF-IDF`_ vectors of unigram tokens
+from a subset of 20news::
+
+  >>> from sklearn.feature_extraction.text import TfidfVectorizer
+  >>> categories = ['alt.atheism', 'talk.religion.misc',
+  ...               'comp.graphics', 'sci.space']
+  >>> newsgroups_train = fetch_20newsgroups(subset='train',
+  ...                                       categories=categories)
+  >>> vectorizer = TfidfVectorizer()
+  >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
+  >>> vectors.shape
+  (2034, 34118)
+
+The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero
+components by sample in a more than 30000-dimensional space
+(less than .5% non-zero features)::
+
+  >>> vectors.nnz / float(vectors.shape[0])
+  159.01327...
+
+:func:`sklearn.datasets.fetch_20newsgroups_vectorized` is a function which 
+returns ready-to-use token counts features instead of file names.
+
+.. _`20 newsgroups website`: http://people.csail.mit.edu/jrennie/20Newsgroups/
+.. _`TF-IDF`: https://en.wikipedia.org/wiki/Tf-idf
+
+
+Filtering text for more realistic training
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is easy for a classifier to overfit on particular things that appear in the
+20 Newsgroups data, such as newsgroup headers. Many classifiers achieve very
+high F-scores, but their results would not generalize to other documents that
+aren't from this window of time.
+
+For example, let's look at the results of a multinomial Naive Bayes classifier,
+which is fast to train and achieves a decent F-score::
+
+  >>> from sklearn.naive_bayes import MultinomialNB
+  >>> from sklearn import metrics
+  >>> newsgroups_test = fetch_20newsgroups(subset='test',
+  ...                                      categories=categories)
+  >>> vectors_test = vectorizer.transform(newsgroups_test.data)
+  >>> clf = MultinomialNB(alpha=.01)
+  >>> clf.fit(vectors, newsgroups_train.target)
+  MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
+
+  >>> pred = clf.predict(vectors_test)
+  >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
+  0.88213...
+
+(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles
+the training and test data, instead of segmenting by time, and in that case
+multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious
+yet of what's going on inside this classifier?)
+
+Let's take a look at what the most informative features are:
+
+  >>> import numpy as np
+  >>> def show_top10(classifier, vectorizer, categories):
+  ...     feature_names = np.asarray(vectorizer.get_feature_names())
+  ...     for i, category in enumerate(categories):
+  ...         top10 = np.argsort(classifier.coef_[i])[-10:]
+  ...         print("%s: %s" % (category, " ".join(feature_names[top10])))
+  ...
+  >>> show_top10(clf, vectorizer, newsgroups_train.target_names)
+  alt.atheism: edu it and in you that is of to the
+  comp.graphics: edu in graphics it is for and of to the
+  sci.space: edu it that is in and space to of the
+  talk.religion.misc: not it you in is that and to of the
+
+
+You can now see many things that these features have overfit to:
+
+- Almost every group is distinguished by whether headers such as
+  ``NNTP-Posting-Host:`` and ``Distribution:`` appear more or less often.
+- Another significant feature involves whether the sender is affiliated with
+  a university, as indicated either by their headers or their signature.
+- The word "article" is a significant feature, based on how often people quote
+  previous posts like this: "In article [article ID], [name] <[e-mail address]>
+  wrote:"
+- Other features match the names and e-mail addresses of particular people who
+  were posting at the time.
+
+With such an abundance of clues that distinguish newsgroups, the classifiers
+barely have to identify topics from text at all, and they all perform at the
+same high level.
+
+For this reason, the functions that load 20 Newsgroups data provide a
+parameter called **remove**, telling it what kinds of information to strip out
+of each file. **remove** should be a tuple containing any subset of
+``('headers', 'footers', 'quotes')``, telling it to remove headers, signature
+blocks, and quotation blocks respectively.
+
+  >>> newsgroups_test = fetch_20newsgroups(subset='test',
+  ...                                      remove=('headers', 'footers', 'quotes'),
+  ...                                      categories=categories)
+  >>> vectors_test = vectorizer.transform(newsgroups_test.data)
+  >>> pred = clf.predict(vectors_test)
+  >>> metrics.f1_score(pred, newsgroups_test.target, average='macro')
+  0.77310...
+
+This classifier lost over a lot of its F-score, just because we removed
+metadata that has little to do with topic classification.
+It loses even more if we also strip this metadata from the training data:
+
+  >>> newsgroups_train = fetch_20newsgroups(subset='train',
+  ...                                       remove=('headers', 'footers', 'quotes'),
+  ...                                       categories=categories)
+  >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
+  >>> clf = MultinomialNB(alpha=.01)
+  >>> clf.fit(vectors, newsgroups_train.target)
+  MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
+
+  >>> vectors_test = vectorizer.transform(newsgroups_test.data)
+  >>> pred = clf.predict(vectors_test)
+  >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
+  0.76995...
+
+Some other classifiers cope better with this harder version of the task. Try
+running :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py` with and without
+the ``--filter`` option to compare the results.
+
+.. topic:: Recommendation
+
+  When evaluating text classifiers on the 20 Newsgroups data, you
+  should strip newsgroup-related metadata. In scikit-learn, you can do this by
+  setting ``remove=('headers', 'footers', 'quotes')``. The F-score will be
+  lower because it is more realistic.
+
+.. topic:: Examples
+
+   * :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py`
+
+   * :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`