Classification of textual data that implements LSTM and BERT on web of science dataset using Python
Find a file
2026-01-25 20:11:20 -05:00
.github Added the mirrors. 2026-01-12 09:13:50 -05:00
logs Updated the code. 2025-12-03 14:08:46 -05:00
.gitignore Removed Pycharm config files. 2026-01-25 20:11:20 -05:00
attention_plot_correct_0.png Finished the code. 2026-01-02 08:00:49 -05:00
attention_plot_correct_1.png Finished the code. 2026-01-02 08:00:49 -05:00
attention_plot_incorrect_2.png Finished the code. 2026-01-02 08:00:49 -05:00
attention_plot_incorrect_3.png Finished the code. 2026-01-02 08:00:49 -05:00
attention_plot_incorrect_4.png Fixed the cuda issue. 2025-12-10 15:12:54 -05:00
BERT.png Fixed the cuda issue. 2025-12-10 15:12:54 -05:00
BERT.py Finished the code. 2026-01-02 08:00:49 -05:00
BERT_comparaison.png Fixed the cuda issue. 2025-12-10 15:12:54 -05:00
dataset-acquisition.png Added the project files. 2025-12-02 21:27:50 -05:00
dataset-acquisition.py Updated the code. 2025-12-04 17:23:22 -05:00
Experiment-1-LSTM-Standardized-Split.py Finished the code. 2026-01-02 08:00:49 -05:00
Experiment-2-BERT-Standardized-Split.py Finished the code. 2026-01-02 08:00:49 -05:00
Experiment-3-LSTM-Single.py Finished the code. 2026-01-02 08:00:49 -05:00
Experiment-4-BERT-Single.py Finished the code. 2026-01-02 08:00:49 -05:00
Experiment-5-BERT-Attention-Matrix.py Finished the code. 2026-01-02 08:00:49 -05:00
LICENSE.md Added the mirrors. 2026-01-12 09:13:50 -05:00
LSTM.png Finished the code. 2026-01-02 08:00:49 -05:00
LSTM.py Finished the code. 2026-01-02 08:00:49 -05:00
LSTM_accuracies.png Finished the code. 2026-01-02 08:00:49 -05:00
LSTM_for_comparaison.png Finished the code. 2026-01-02 08:00:49 -05:00
README.md Updated the code. 2025-12-02 21:54:09 -05:00
X.txt Added the project files. 2025-12-02 21:27:50 -05:00
Y.txt Added the project files. 2025-12-02 21:27:50 -05:00
YL1.txt Added the project files. 2025-12-02 21:27:50 -05:00
YL2.txt Added the project files. 2025-12-02 21:27:50 -05:00

Classification of Textual Data

In this repository, we explore two advanced models, LSTM and BERT on the Web of Science dataset. The project involves building a custom LSTM model from scratch and fine-tuning a pre-trained BERT model for text classification. We compare the performance of these models in classifying scientific paper abstracts into their corresponding fields and sub-fields. The main objectives were to pre-process raw text data, implement LSTM and BERT from the ground up, run experiments and analyze the results in terms of accuracy and model performance. The final report includes a detailed comparison between both models, insights on the impact of pretraining and performance discussion based on our findings.

Authors

  • Batuhan Berk Başoğlu, 260768350 - batuhan-basoglu
  • Jared Tritt, 260763506 - Jaredtritt
  • Alys Pisani-Houze, 261093153