Known Unknowns

As our circle of knowledge expands, so does the circumference of darkness surrounding it.
-Albert Einstein

Quantifying Ignorance

Ever since I was a child I have had a dream, albeit a naïve one, to know everything. Obviously, I am not the most brilliant person imaginable, but what it has resulted in is an extreme generalist skill set. I have a BA as well as an MSc, I have worked in public and the private sector, I have done very manual jobs and new millennial jobs like quantitative research. I like to think of myself as a jack of all trades and master of none.

Marshall McLuhan once wrote that violence is a quest for identify. He predicted, quite rightly, that as the ‘global village’ grows people will realise that the little things that they feel make them unique are not so unique, resulting in a need to redefine or destroy others to make themselves ‘personalised.’ I mention this because I feel I am on that same quest, sans destorying people. Every time I complete an analysis, learn a skill, get a job or promotion social media confronts me with people with much more success, knowledge, experience who are achieving even greater things.

At this stage I have two options: Curl up in my bed and say “what’s the point, I will never be the best.” Or I can get up every day and keep trying to make myself better and painfully recognize my limitations and try to overcome them. Iperfer to do the later.

I am also a long term planner. I like to work in five year plans that include personal, professional and financial goals. It helps keep me focused and is usually a five-year projection with a slight ‘push’ goal. If I say I will normally save £100 a month, I force myself to save £110. If I say I will get a pay increase of 2% I fight for 3%. So I am always pushing, but realistically so I both achieve and avoid disappointment. It has been the same with my personal goals. For the past 10 years, becoming an intelligence analyst was that goal. In the process, I acquired more skills and seniority than the traditional role requires. Because of my own past success I have been forced to re-evaluate my future. The conclusion I reached is to pursue a career in Data Science. To achieve that goal I need a road map.

The name of this blog known unknowns, is a infamously mocked, but poorly understood intelligence concept. However known unknowns are at the heart of what the intelligence community does. They are important because we need to understand what we do not know, which is just as important to decision making as what we do know. A blindly confident decision on half data is still a failure even if you trip into success becasue we have learned nothing and cannot replicate that result. So a known unknown implies two states of knowledge; a theoretical level of information necessary to know EVERYTHING about situation, and what you do know. In the gap is uncertainty that compromises our ability to achieve goals within the that situation. We want that risk of failure to be as low as possible.

So here I am, a dream to be a data scientist and an intelligence problem of creating a road map to get there in an uncertain environment. 500 words later I get to the point of the post. Is there a way I can quantify what I don’t know about data science that will help me making a proper learning plan? Can I quantify my own ignorance? Let’s try!

Method

As above, I need an external and objective criterion to know what the entirety of the problem is. For this I selected Uducity’s Data Career Skills Guidance. For those unfamiliar, Uducity a skills training platform for the digital economy. They put out great stuff on what employers of digital and tech industry are doing and looking for in employees. Udcity provides 28 skills in 10 categories they recommend a data scientist should have. I separated some of the skills out such as R/Python to R and Python. So I ended up with 36 skills.

Category Skill
Programming and Tools Spreadsheets
Programming and Tools SQL
Programming and Tools R
Programming and Tools Python
Programming and Tools Jupyter
Statistics Descriptive
Statistics Inferential
Statistics A/B Testing
Mathematics Algebra
Mathematics Notation
Data Wrangling Data Cleaning
Data Wrangling Data Blending
Data Wrangling Data Transforming
Data Wrangling Data Formatting
Data Wrangling Relational Databases (SQL)
Communication and Data Visual Programming
Communication and Data Visual Dashboarding
Data Intuition Asking Questions
Data Intuition Becoming Subject Matter Experts
Data Intuition Business and product things
Data Intuition Practice!
Machine Learning Supervised Learning
Machine Learning Unsupervised Learning
Software Skills Coding Testing. Debugging
Software Skills Version Control
Software Skills Model Deployment
Software Skills Data at Scale
Advanced Mathematics Linear Algebra
Advanced Mathematics Calculus
Experiment Design Controlling variables and choosing good control and testing groups
Experiment Design Sample Size and Power law
Experiment Design Confidence level
Experiment Design SMART experiments: Specific, Measurable, Actionable, Realistic, Timely
Experiment Design Bayesian Statistics
Experiment Design Bootstrapping
Experiment Design Simulation

Now I need to apply myself to these areas with some kind of grading option. I am considering three solutions:

Binary 1/0- You have the skill or not. But this doesn’t give any nuances

Continuous Scale – 1 to n with the numbers providing a level of expertise. More nuances without a great deal of complexity

I have arbitrarily chosen a continues scale of 1 to 5. My fear of using binary or a 1 to 3 scale is that there will not be enough nuances to make decisions about my level knowledge and most items would either be 1 in binary or 2 in continuous scale. Anything higher than five will be too nuanced to draw comparative insight from. For defining the scores I tried to create a non-bias system for grading. I choose to use the amount of help I would need to apply the skill. Such as:

1- Could not perform on my own even with help

2- Would require assistance or training to implement

3- With self study/ guidance, could understand and apply

4- Likely apply with minimal assistance

5- Comfortable to perform on my own

I also added a second criteria which is my assessment of confidence in that skill as per the guidance of Applied Information Economics Estimative Calibration by Douglas Hubbard. This gives some check to my subjective skills assessment.

1- 1% to 20% Confidence

2- 21% to 40% Confidence

3- 41% to 60% Confidence

4- 61% to 80% Confidence

5- 81% to 100% Confidence

The intersection between the two points will also help me in understanding any over or under confidence in my skill set that will affect a learning plan. My completed dataset looks like:

Category Skill Knowledge Confidence
Programming and Tools Spreadsheets 5 5
Programming and Tools SQL 3 4
Programming and Tools R 4 3
Programming and Tools Python 4 3
Programming and Tools Jupyter 2 4
Statistics Descriptive 4 3
Statistics Inferential 4 3
Statistics A/B Testing 1 4
Mathematics Algebra 4 4
Mathematics Notation 4 4
Data Wrangling Data Cleaning 4 5
Data Wrangling Data Blending 4 5
Data Wrangling Data Transforming 4 4
Data Wrangling Data Formatting 4 4
Data Wrangling Relational Databases (SQL) 3 4
Communication and Data Visual Programming 4 3
Communication and Data Visual Dashboarding 5 5
Data Intuition Asking Questions 5 4
Data Intuition Becoming Subject Matter Experts 5 4
Data Intuition Business and product things 4 4
Data Intuition Practice! 4 2
Machine Learning Supervised Learning 4 4
Machine Learning Unsupervised Learning 3 4
Software Skills Coding Testing. Debugging 3 5
Software Skills Version Control 3 5
Software Skills Model Deployment 2 5
Software Skills Data at Scale 1 5
Advanced Mathematics Linear Algebra 3 3
Advanced Mathematics Calculus 4 3
Experiment Design Controlling variables and choosing good control and testing groups 3 5
Experiment Design Sample Size and Power law 3 5
Experiment Design Confidence level 4 3
Experiment Design SMART experiments: Specific, Measurable, Actionable, Realistic, Timely 4 5
Experiment Design Bayesian Statistics 4 4
Experiment Design Bootstrapping 2 4
Experiment Design Simulation 3 4

Analysis

The first thing we can look at is a radar graph that compares about my skill to the ‘idyllic’. To do this I first aggregated my scores board category. I then calculated the maximum score I could achieve in each category using our 1 to 5 scale and overlaid that on the graph.

While this graph is interesting we need to hone in on areas of weakness for an individual learning plan. We can do that by looking at the percentage I scored against the total points available.

Category My Score Total Score Percent
Advanced Mathematics 7 10 70.00
Communication and Data Visual 9 10 90.00
Data Intuition 18 20 90.00
Data Wrangling 19 25 76.00
Experiment Design 23 35 65.71
Machine Learning 7 10 70.00
Mathematics 8 10 80.00
Programming and Tools 18 25 72.00
Software Skills 9 20 45.00
Statistics 9 15 60.00

So in what board categories am I under developed? It looks like Statistics and Software Skills and experimental design are under 70%. I do find Statistics is a bit surprising. Let’s go deeper into the skills again.

Category Skill Knowledge Confidence
8 Statistics A/B Testing 1 4
27 Software Skills Data at Scale 1 5
26 Software Skills Model Deployment 2 5
35 Experiment Design Bootstrapping 2 4
24 Software Skills Coding Testing. Debugging 3 5
25 Software Skills Version Control 3 5
30 Experiment Design Controlling variables and choosing good control and testing groups 3 5
31 Experiment Design Sample Size and Power law 3 5
36 Experiment Design Simulation 3 4
6 Statistics Descriptive 4 3
7 Statistics Inferential 4 3
32 Experiment Design Confidence level 4 3
33 Experiment Design SMART experiments: Specific, Measurable, Actionable, Realistic, Timely 4 5
34 Experiment Design Bayesian Statistics 4 4

Okay, so statistics is lowered by my lack of experience in A/B Testing (A topic I know on paper, but have never had to do), but in terms of more traditional statistics I am alright. For software skills, I am being let down by my lack of ‘scale’ experience. Point taken! Any company I have worked for was desktop only. And for experimental design it is a mixed bag. In some areas, I look great and in others, like bootstrapping and simulation I need some work.Great so we now have a priority list of items to tackle in our learning plan!

Let’s do one more check though. I want to make sure that I am not being to over confident about my skills. So what do the scores look like if we scatter plot them and add a linear trend line?

Okay, this is interesting. There appears to be a slight negative relationship between my knowledge and confidence. If I believe I have little knowledge of something I am confident in that assessment, but as I feel my knowledge increases my confidence in that assessment goes down. So if anything it appears I am under confident in my own abilities. #humble

Conclusion

The tagline for this blog is Albert Einstein’s quote;

As our circle of knowledge expands, so does the circumference of darkness surrounding it.

Being under confident means I am at least internally consistent with that belief. Obviously there are problems with my analysis. Having a more objective knowledge score would be a big help and would negate the need for a confidence score. There are also many skills I do possess that are necessary for a data scientist that do not appear in this list such as; communication and presentation skills, and my experience in intelligence that is currently summed up here as ‘Asking Questions.’ This is as far as I have gotten. What do you think of my approach? Is there any other way I am quantify what I don’t know? My plan is to make is a project which I return to from time to time, update the data and keep exploring.