Featured

Introducing the projects!

I’ve set up this place to document my university projects and personal project histories, learnings and progress. I’m going to start with pages for the university projects I’ve done in 2016 and continue with ongoing technology projects I’m working on.

In my spare time, I am also working on a personal database and Python project to help students choose subjects at UQ.

Improving code with RegEx

Background

Because of the way the prerequisites were entered into the course websites, I needed to write a little function parse strings such as “INFS1200 + GEOM1100 + 1200 + 1300″. In English, we know that 1200 and 1300 refer to GEOM1200 and GEOM1300 but that was assumed so left off. However, I needed to create a List object with the course codes themselves. So after I use list.split() on the string, I iterated over the list and the letters of the previous element to any current element that consists only of digits.

My little function was fine while I could list = string.split(” + “), but this breaks down when you discover that sometimes they are entered as “INFS1200 or GEOM1100 and 1200 & 1300” or something equally inconsistent.

Improvements

So it was time to learn RegEx! And therefore re-write the function in a more generic way. I needed to find any string that consisted of 4 uppercase letters followed by 4 digits or 2 uppercase letters followed by 4 digits, putting each result into the list.

After some searching and experimentation, I found that (simple) RegEx wasn’t as difficult as I thought and kind of fun. I used the following:

[A-Z]{4}[0-9]{4} | [A-Z]{2}[0-9]{4} | [0-9]{4}r

This works well and means I don’t have to worry how they separate the course codes. After adding the course codes to the list, I then iterate over it to fix the missing subject matter letters (INFS or GEOM etc).

Database Correction Interlude

This is the third post in a series, documenting the progression of the project and the challenges I faced at each stage. I’m intending these posts to be almost reflective in nature, rather than very technical, though I will lightly justify some technical choices also. Once I feel that I can move the project out of a prototype stage, I’ll put up a more formal page documenting the technology and some overall reflections.

Because the state of the database kept bugging me, I went back to the original ER diagrams with my friend, and we came to a quite simple solution.

Just simply remove the Faculty relation, adding instead a ‘faculty’ attribute to both the School relation and Plan/Program relation.

A few reasons this is better and still valid:

  • I wouldn’t be saving any space with the dedicated relation anyway (unless I used ID numbers, but then there’s more querying to join Faculty name with its ID and to the School).
  • There are only around 20 – 30 unique Schools at UQ so, querying it with distinct  isn’t really an issue.
  • Because Plan/Program would be a 1:M relationship with Faculty anyway, you’d still have to search every row in the Plan/Program table to determine all study plans offered by a Faculty.
  • Because I’m not actually storing any further data about the faculty, there’s no point in devoting a whole table to it.

This really simplifies the data creation process, and keeping the faculty against both the school and plan whilst still allowing me to run interesting queries.

I also took the opportunity to update the database schema so the columns were of a decent length.

 

UQ Course Chooser Stage 2 – Database Planning and Evolution.

This is the second post in a series, documenting the progression of the project and the challenges I faced at each stage. I’m intending these posts to be almost reflective in nature, rather than very technical, though I will lightly justify some technical choices also. Once I feel that I can move the project out of a prototype stage, I’ll put up a more formal page documenting the technology and some overall reflections.

Following on from part one, I have a Python script to scrape UQ’s course and program websites and now need to design and build a database to store the scraped data.

ER Diagrams and Database planning

I used Draw.io to do the initial diagramming, in both Chen notation as I was taught at university, and in Martin notation as is used more commonly used in the industry. Many of us used Draw.io during our university studies for its collaborative functionality as part of the team-based database design project. We could discuss the diagram and easily move objects around in real-time, either completely remotely or just on separate computers sitting next to each other. I kept using Draw.io for my own project since I was already experienced in it; and also using Google Drive, it was automatically saved and versioned as well as being accessible from any Internet-connected PC.

I like that Chen notation shows the relationship between entities and we don’t need to include the tables created during the mapping steps.

Design Rationale and Mapping

I needed to store information about the following entities:

  • Course
  • Plan/Program
  • School
  • Faculty

A Course entity contains attributes such as CourseCode (the Primary Key), Title, Coordinator; and the important multi-valued attributes Semesters and Prerequisites. Semesters and Prerequisites would be mapped to new tables. Prequeresites ends up being a correlation table just mapping CourseCodes to other CourseCodes.

I figured that a Course is related to a Plan/Program since Courses make up Programs (of study). A Course is offered by a School. Schools have a title and a Faculty (entity) that manages that School. A Faculty also offers a Plan so it needs to be an entire entity by itself so both relationships are available.

My original Chen Notation diagram was complex and had many relations and relationships:

Unfortunately, I’ve lost some detail when I decided to simplify it during the mapping stage. After diagraming it using Martin Notation it seemed quite hard to comprehend.

Originally, I wanted to be able to correlate faculties against schools and programs; and schools and faculties against individual courses. This way, as well as having a database to help choose subjects, I could also run interesting statistical analysis against the relationship between courses/schools/faculties/programs to determine things like… what faculty has the most programs, or what school has the most individual courses.

My colleague suggested (separately of my unplanned redesigns) I use the ER diagramming tool built into MySQL Workbench. It uses Martin notation and has the nice feature of being able to generate an SQL script based on the diagram.

MySQL Workbench

Following on from my Martin Notation on Draw.io, following is one of my early ER diagrams I made in SQL Workbench:

After discussions with my friend, I further simplified the ER diagram and model:

Clearly, the model and diagram are now much simpler and easier to understand, with the Course table still taking centre-stage. But still, Program is no longer linked to Faculty. (Having said that, an equijoin with a nested query would likely work and still keep the database simple.)

Recently I have removed the Faculty table since it didn’t really need to exist, and I figured I should just store them in columns against each School instance (row) given that each school only has one faculty.

Final Thoughts

Of course, while it’s now very easy to insert the course data into the database, I have also lost all the faculty-school-program relationships by removing that table. Now, more sub-queries would be required to cross-reference it. I decided that since there would only be 20-30 Schools across five Faculties, the time saved in the queries didn’t warrant the extra complexity. There are only about 3000 courses in total and probably a few hundred programs so I didn’t think speed was really an issue.

Also, I really wanted to get the database created so I could at least prototype the Python database saving code.

Now I actually have the Python script saving things to the database, I’ll probably go back and either change the database again or scrape the data first then go and reprocess the database to rebuild the Faculty table once the entire dataset is scrapped. Having the experience of redesigning the database a few times I feel I am in a better place to go back and create a database structure similar to what I wanted originally.

Building the database itself presented further technical challenges as I was also on holidays at the time….

Project: UQ Course Chooser

A project I’m currently working on in my spare time is a Python script/database/web app to help UQ students choose their subjects for the next semester.

This is the first in a series of blog posts almost in a diary format, so I can document the progression of the project and the challenges I faced. I’m intending these posts to be almost reflective in nature, rather than very technical, though I will lightly justify some technical choices also. Once I feel that I can move the project out of a prototype stage, I’ll put up a more formal page documenting the technology and some overall reflections.

Rationale

The idea stemmed from my own frustrations with choosing subjects for my Graduate Certificate in Information Technology. We have a list of courses (at UQ, a subject is a referred to as a “course”, I’ll use them interchangeably) we can choose from for a particular Program (qualification). This list of subjects unfortunately just divides the subjects into categories related to the program such as “Part A – Compulsory”, “Part B – Introductory Electives”. So when I was choosing my own subjects, I would just open up each course in the list in a new browser tab and then remove the subjects (close the browser tab) that are only offered in a different semester or that I don’t have the prerequisites for.

I devised three main “stages” and components for my project – a Python script to scrape UQ’s course websites, a database to store the course and program information from the scrape, and a web app which is the actual interface and UI to show the course information (I envisage a page where the user can select their program of study, the semester, and enter any other subjects they’ve studied. Then the app can show a list of subjects available, highlighting the ones that meet criteria.).

Stage one – Scraping script

The first course of action was to determine if it was even possible to programmatically determine the course information (semester, prerequisites, course title etc). A quick look over the to look at the HTML source of the course websites proved that it was. I pleasantly discovered that the web pages were nicely formatted in a consistent way with HTML elements given sensible ids and classes such as <h1 id="course-title">Introduction to Software Engineering (CSSE7030)</h1> for the course title or
<p id="course-incompatible">COMP1502 or CSSE1001</p> for the courses that another course is incompatible with. (You wouldn’t get credit for this course if you have passed those other courses.) I also found a list of all programs offered by the university. These facts proved that it would be possible (maybe even not-to-difficult) to scrape UQ’s course and program websites to build up a database of subjects and programs.

Finally, I was able to get back to the project after the semester had finished so I set about finishing the Python script.

My script is made up of three main components so far:

  • A Course() class (object) to store various attributes about each course using appropriate data types (though Python is loosely typed).
  • A find_links(url) which searches through a web page and returns a list of all the courses llisted based on URLs in UQ’s course-URL format.
  • A database_access.py module which contains some helper methods for database querying and writing and also the main course writer method which receives a Course object and writes its attributes to the database.

In my main script, I loop over the result of find_links() on a given URL, using the result of each iteration to create a new Course() object from that URL, appending each new Course object into a list. Then I loop over that list passing each Course object to the database_access.course_create(course) method.

courses = []
for each_course in find_links(url_to_run):
     courses.append(Course("http://www.uq.edu.au" +
                           course_page_url + each_course))

print(courses)

for each_course in courses:
    database_access.create_course(each_course)

database_access.connection.close()

 

Once I got enough working that I could scrape individual course data, I peeled away from this and set about creating the database to actually store the scraped information. I’ll discuss my experiences with mySQL and Python in a later post.