forked from orionw/RedditHumorDetection
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcreate_data.sh
More file actions
11 lines (11 loc) · 828 Bytes
/
create_data.sh
File metadata and controls
11 lines (11 loc) · 828 Bytes
1
2
3
4
5
6
7
8
9
10
11
# This script will take the full data and regenerate the TSV files needed for the model.
# Unfortunately, due to how new I was at this, I didn't set seeds in all the locations that are needed. Fortunately, I did save the original data splits which are located in the `data` folder. However, recreating this (as is currently in these files) shows results that are ~1% off in either direction and are consistent with reported paper results.
pip3 install -r requirements.txt
# process the data
python3 full_datasets/reddit_jokes/reddit_cleaning/GetSplitFiles.py
python3 full_datasets/reddit_jokes/reddit_cleaning/GetTSVFileForBERT.py
cp full_datasets/reddit_jokes/reddit_cleaning/output/output_for_bert/full/*.tsv data/
# remove the validation set
rm data/dev.tsv
# move the test set for evaluation
mv data/test.tsv data/dev.tsv