Chat Log Parser – Now with Facebook support!

Continuing on from my original post about this project, I’ll say a few words about what I learned while adding Facebook chat support.

Once I found the “fbchat-archive-parser” project on Github, I could easily parse the CSV it generates using similar code as I do with the Viber backup. There are a few differences in the output though, one which made it easier and another which I needed to do more research on. There is a third issue which I haven’t yet determined a proper solution to, which is a product of Facebook themselves.

First, it’s prudent to explain how the process works:

  1. fbchat-archive-parser parses the messages.html file of the Facebook Archive.
    • The Facebook Archive is an archive of everything you’ve done on Facebook – posts, photos, chats, friends lists, etc. You can request it from Facebook and they’ll gather up everything, compiling it into an archive which multiple files and folders inside. One of these is messages.html. The messages aren’t in any particular order or grouping for some reason, but they seem to all be there. (Also, Facebook themselves don’t always retrieve the friend names for the archive, sometimes using the member-id instead, which causes my third issue.) So, the open source community created that script to parse that file into separate chats and various formats.
  2. By selecting the options in the script to create separate files per conversation, my script can then read a conversation similarly to how it does for Viber.
  3. Once I have the messages, I can send them to Jinja2 for HTML output.

Easy difference

Kindly enough, fbchat-archive-parser creates valid CSV, enclosing messages containing commas or spanning multiple lines in quotation marks. This allows me to dramatically simplify the code, not having to check the previous line or join multiple message fragments together with comma separation.

def messenger(filename, messenger_chat):
    with codecs.open(filename, "r") as chatfile:
        chat = csv.reader(chatfile, delimiter=",")
        # the Facebook chatlog parser includes headers.
        # Maybe I'll parse them later. For now, just skip over the first line
        next(chatfile)
        for line in chat:
            if len(line) > 1:
                try:
                    timestamp = parse(line[2])
                    m = Message(line[1], 0, timestamp, line[3])
                    messenger_chat.add_message(m)
                except ValueError:
                    # this must be a continuation of the previous message
                    rest_content = "\n"
                    for i, message_fragment in enumerate(line):
                        rest_content += message_fragment
                        if i + 1 != len(line):
                            rest_content += ", "
                    messenger_chat.get_most_recently_found_msg().contents = rest_content

Actually, I don’t need that entire except block since chats won’t continue over lines in a way that breaks the CSV. But I left it in just in case; I’ll probably take it out soon.

Harder difference (with learnings)

The date format in Viber’s chat export was easy to parse into a datatime object. However, the CSV export of Messenger was in ISO standard, which is good..

….2016-06-17T19:41+10:00….

but I couldn’t easily parse it using strptime. I think strptime didn’t have enough format parsing options for the timezone marker. I did some searching and eventually found the Python library “dateutil”. The dateutil library has a parser feature where it will intelligently parse dates in almost any format without having to manually specify the format mask. Using this library I could easily parse the dates in the Facebook chat archive. In truth, I don’t like to rely on third-party libraries for simple tasks (or what should be a simple task) but for now at least I have something working.

Third issue

The third issue is the display of contact names in the archive. Sometimes Facebook doesn’t provide the contact names, but the Facebook unique ID of the contact. From the fbchat-archive-parser readme.md file:

Why are do some names appear as <some number>@facebook.com?

For some reason, Facebook seems to randomly swap names for IDs. In recent times, it has gotten worse. You can have the parser resolve the names via Facebook itself with the --resolve flag. Keep in mind, this is a beta feature and may not work perfectly.

$ fbcap ./messages.htm -t second --resolve
Facebook username/email: facebook_username
Facebook password:

This requires your Facebook credentials to get accurate results. This does not relay your credentials through any servers and is a direct connection from your computer to Facebook. Please look at the code if you are feeling paranoid or skeptical 🙂

Because of this, I can’t really determine who’s the sender and receiver. Actually I can’t determine who’s the receiver since it doesn’t give any indication, unlike Viber which literally says, “you” in the chat log.

I could easily ask the user who “they” are so I can update the is_user property, but then this would breakdown if the Facebook swaps names for IDs at some point.

For now I’m not trying to layout the messages on either side of the page for Facebook like I am for Viber. So I just show each message one after the other.

Next…

But the code is up on Github anyway, if you have some ideas, please suggest them or send me a PR. My next plan will be add support for KakaoTalk, the Korean instant messenger which also provides text based chat log exports.

Posted by Anthony