Subscribe to the PwC Newsletter
Join the community, edit dataset, edit dataset tasks.
Some tasks are inferred based on the benchmarks list.

Add a Data Loader
Remove a data loader.
- huggingface/datasets -
- tensorflow/datasets -
- pytorch/text -
Edit Dataset Modalities
Edit dataset languages, edit dataset variants.
The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.
Add a new evaluation result row
Imdb movie reviews.

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.
Benchmarks Edit Add a new result Link an existing benchmark
Dataset loaders edit add remove.

Similar Datasets
License edit, modalities edit, languages edit.
- No suggested jump to results
- Notifications
IMDB Movie Reviews Large Dataset - 50k Reviews
laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k
Name already in use.
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more .
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Imdb-movie-reviews-large-dataset-50k.
This dataset is taken from https://ai.stanford.edu/~amaas/data/sentiment/ and then preprocess to put all positive and negative reviews in the same file for training and testing. It help you to put more effort on algorithm instead of data collection.
- Bahasa Indonesia
- Español – América Latina
- Português – Brasil
- Tiếng Việt
imdb_reviews
- Description :
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Additional Documentation : Explore on Papers With Code north_east
Homepage : http://ai.stanford.edu/~amaas/data/sentiment/
Source code : tfds.datasets.imdb_reviews.Builder
- 1.0.0 (default): New split API ( https://tensorflow.org/datasets/splits )
Download size : 80.23 MiB
Auto-cached ( documentation ): Yes
Supervised keys (See as_supervised doc ): ('text', 'label')
Figure ( tfds.show_examples ): Not supported.
imdb_reviews/plain_text (default config)
Config description : Plain text
Dataset size : 129.83 MiB
Feature structure :
- Feature documentation :
- Examples ( tfds.as_dataframe ):

imdb_reviews/bytes
Config description : Uses byte-level text encoding with tfds.deprecated.text.ByteTextEncoder
Dataset size : 129.88 MiB
imdb_reviews/subwords8k
Config description : Uses tfds.deprecated.text.SubwordTextEncoder with 8k vocab size
Dataset size : 54.72 MiB
imdb_reviews/subwords32k
Config description : Uses tfds.deprecated.text.SubwordTextEncoder with 32k vocab size
Dataset size : 50.33 MiB
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-10 UTC.
Datasets: datasets-maintainers / imdb Copied like 58
Dataset structure, data instances, data fields, data splits, dataset creation, curation rationale, source data, annotations, personal and sensitive information, considerations for using the data, social impact of dataset, discussion of biases, other known limitations, additional information, dataset curators, licensing information, citation information, contributions, dataset card for "imdb", dataset summary.
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
- Size of downloaded dataset files: 80.23 MB
- Size of the generated dataset: 127.06 MB
- Total amount of disk used: 207.28 MB
An example of 'train' looks as follows.
The data fields are the same among all splits.
- text : a string feature.
- label : a classification label, with possible values including neg (0), pos (1).
Initial Data Collection and Normalization
Who are the source language producers, annotation process, who are the annotators.
Thanks to @ghazi-f , @patrickvonplaten , @lhoestq , @thomwolf for adding this dataset.
Models trained or fine-tuned on imdb

lvwerra/distilbert-imdb
Sileod/deberta-v3-base-tasksource-nli.

mrm8488/t5-base-finetuned-imdb-sentiment
Fabriceyhc/bert-base-uncased-imdb.

edbeeching/gpt-neo-125M-imdb

federicopascual/finetuning-sentiment-model-3000-samples
Spaces using imdb.

IMDB movie review sentiment classification dataset
Load_data function.
Loads the IMDB dataset .
This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
As a convention, "0" does not stand for a specific word, but instead is used to encode the pad token.
- path : where to cache the data (relative to ~/.keras/dataset ).
- num_words : integer or None. Words are ranked by how often they occur (in the training set) and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If None, all words are kept. Defaults to None, so all words are kept.
- skip_top : skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
- maxlen : int or None. Maximum sequence length. Any longer sequence will be truncated. Defaults to None, which means no truncation.
- seed : int. Seed for reproducible data shuffling.
- start_char : int. The start of a sequence will be marked with this character. Defaults to 1 because 0 is usually the padding character.
- oov_char : int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
- index_from : int. Index actual words with this index and higher.
- **kwargs : Used for backwards compatibility.
- Tuple of Numpy arrays : (x_train, y_train), (x_test, y_test) .
x_train, x_test : lists of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words - 1 . If the maxlen argument was specified, the largest possible sequence length is maxlen .
y_train, y_test : lists of integer labels (1 or 0).
- ValueError : in case maxlen is so low that no input sequence could be kept.
Note that the 'out of vocabulary' character is only used for words that were present in the training set but are not included because they're not making the num_words cut here. Words that were not seen in the training set but are in the test set have simply been skipped.
get_word_index function
Retrieves a dict mapping words to their index in the IMDB dataset.
The word index dictionary. Keys are word strings, values are their index.

IMAGES
VIDEO
COMMENTS
IMDB dataset having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.
IMDB Movie Reviews Large Dataset - 50k Reviews. Contribute to laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k development by creating an
This tutorial uses the IMDB movie reviews dataset containing 50k movie reviews to create a sentiment analysis and machine learning model.
Deep learning on Movie Reviews Dataset (IMDB Dataset - 50k reviews) | Deep Learning Project 2 · Key moments. View all · Key moments · Description.
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark
"I can't believe that those praising this movie herein aren't thinking of some other film. I was prepared for the possibility that this would be awful, but the
Answer to Solved Dataset :IMDB Dataset of 50K Movie Reviews.
This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded