The National High School Graduation Examination (NHSGE) is considered the most critical exam of high-schoolers in Vietnam. It serves two purposes, high school graduation, and university entrance. The subjects are Maths, Literature, English, Physics, Chemistry, Biology, History, Geography, and Civic Education. This project helps journalists analyze and discover some insight into the NHSGE's result in 2020.
The raw data, scrapped from diemthi.hcm.edu.vn, the official website of the Vietnam Ministry of Education & Training, is processed to be a clean dataset.
The clean dataset includes Registration Number, Date of Birth, and the scores of subjects. The maximum score is 10, and the minimum is 0. If a student does not take or register for a subject, the score will be -1.
In the web scrapping phase, some Registration Numbers are invalid. The simple technique of try-except solved the problem. I got the list of invalid Registration Numbers then eliminated it from the process.
In the cleaning phase, because the raw data results from the web scrapping, I cleaned it based on HTML knowledge. For students with maximum scores, I need to handle some exceptions.
The student's name is Vietnamese, so I have decoded it for readability.
In order to analyze the NHSGE data and generate insights out of it, I followed the process:
Bar chart showing the year of birth distribution
Avarage Score by Age Group
WordCloud of Student's Last Name
Stacked Distribution of Maths and English Scores