A curious case of a hairy flash drive (Part 2)
Okay, the title changed a bit, that was intentional. The goal of this episode is to explore how flash drives could get documents corrupt. A fantastic idea has been brewing up in my mind about using machine learning techniques to learn to predict if a flash drive is corrupt
I Lied
I did not think up this problem, I was contacted by a lady with this problem, and I opted to find a solution to it, but asked for credits in this work and she agreed to it, so my challenge begins. To get started, I will be starting the first generating a dataset that I can use to explore this problem.
The Plan
This is less of a plan and essentially the progress we have made. The first step here is to generate 10k microsoft word documents. I don't know how to do this so I got the help of my friend John Oke who completed the task - like a bad guy in under 3hrs, I was really impressed. Now I have 10k documents on my laptop containing random texts that poises the next level to the understanding the problem.
The Next thing I will do is get this files into python's HDf5 format, which data scientists have been preaching about for a while. Each file I generated is around 12kb, making it manageable for me to work with. This clean working documents will serve as my positive set. I will also copy them to the corrupting flash drive to ensure that they go bad. I have a theory that not all of them will not be unreadable, but hey, that's how things work in real life, data does not usually come back super clean.
Quick answers:
Is this code on GitHub
: Well, yes it is here: https://github.com/e911miri/word-generator
Who is the mystery lady with this project
: Well that's classified :p
Do you intend to complete this
: I don't know, it sounds like fun, so I am taking it up.
Where can I learn more about machine learning
: There is an amazing course on Udacity