A curious case of a hairy flash drive (Part 2)

Okay, the title changed a bit, that was intentional. The goal of this episode is to explore how flash drives could get documents corrupt. A fantastic idea has been brewing up in my mind about using machine learning techniques to learn to predict if a flash drive is corrupt

I Lied

I did not think up this problem, I was contacted by a lady with this problem, and I opted to find a solution to it, but asked for credits in this work and she agreed to it, so my challenge begins. To get started, I will be starting the first generating a dataset that I can use to explore this problem.

The Plan

This is less of a plan and essentially the progress we have made. The first step here is to generate 10k microsoft word documents. I don't know how to do this so I got the help of my friend John Oke who completed the task - like a bad guy in under 3hrs, I was really impressed. Now I have 10k documents on my laptop containing random texts that poises the next level to the understanding the problem.

The Next thing I will do is get this files into python's HDf5 format, which data scientists have been preaching about for a while. Each file I generated is around 12kb, making it manageable for me to work with. This clean working documents will serve as my positive set. I will also copy them to the corrupting flash drive to ensure that they go bad. I have a theory that not all of them will not be unreadable, but hey, that's how things work in real life, data does not usually come back super clean.

Quick answers:

Is this code on GitHub: Well, yes it is here: https://github.com/e911miri/word-generator

Who is the mystery lady with this project: Well that's classified :p

Do you intend to complete this: I don't know, it sounds like fun, so I am taking it up.

Where can I learn more about machine learning: There is an amazing course on Udacity