In my last post, I lamented a bit the potential for advanced technologies to reinforce existing biases at a quicker and larger scale, without either our awareness nor consent.

The issue of algorithmically detecting gender that I briefly mentioned is one that I’ve been struggling with this semester. Inferring gender can be crucial in revealing lags at publications and harmful disparities in behavior where diversity data is not provided, yet the act of classification can be morally and socially harmful.

Nathan Matias is a Ph.D student at the neighboring Civic Media Lab who has been a great ally in helping soften the steep learning curve of grad school and also in tackling gender/identity related projects. His master’s thesis work in Open Gender Tracker has led to many great posts on the Guardian’s Datablog.

Like a truly great friend, he not only answered my questions with regard to ethically detecting gender, he wrote a comprehensive blogpost about it. He highlights the practical realities of keeping work both computationally effective and ethically responsible. A must-read for anyone considering work in this area. Thank you Nathan!

Tags: , ,

· · · ◊ ◊ ◊ · · ·

At the beginning of the seminar I’m taking in Race and Racism, we examined the historical origins of race– smirking at 19th century attempts to neatly classify and digest mankind into square boxes of color, things like Blumenbach‘s 5 categories (Caucasian, Mongolian, Ethiopian, American, Malay) and his offensive descriptions of the capabilities of each, how misguided they were, how clearly politically incorrect they are to us now.

One image stands out in particular– a chart, on the left axis, the races; on the right, a marking in each category of human competence, such that, summing up the checked boxes, one might literally rank different humans against another by racial category. The White man wins, in every grouping. It’s a glaring embodiment of early racist ideology, printed on paper, long since dismissed. It would be improbable to imagine a chart like that printed in a 21st century publication; the author would be fired, no doubt.

Yet the chart seemed eerily familiar to me. At first I couldn’t place it; I hadn’t seen it before. Then I realized: what it reminded me of was my own education, not in sociology or anthropology, but in Computer Science. As a graduate student, with a focus on Machine Learning and social applications, the very core of many of the algorithms I study and write is statistical classification, a major topic of research. What we are doing now– what Google does with its personalized search, what Facebook does with its unsettlingly accurate ads– is automating that same thought process of 19th century anthropologists. We look at a person, here, her technological imprint on the web, the traces she leaves, her purchase history, friends, and log-ins, then say: “what sort of human being are you”? And then, using our charts and tables– no longer printed ones, but weights on variables in our algorithms and databases of records– we first classify her, and after, yes, we rank her (sometimes we rank her first and then classify her, as well).

Users are commonly classified by gender, or what a machine can predict your gender to be, often quite accurately, and yes, different genders are ranked very differently. In the goods-driven world of online advertisement, a woman is worth differently than a man depending on what the product is, sometimes more, sometimes less. This entire pipeline is problematic to say the least; first, in the binary classification of gender; and then, in the ranking of different gendered individuals against one another; but our algorithms only reflect the constructs already driving these approaches.

Algorithms don’t quite yet classify users by race, or at least not outloud, because race is such a charged issue in most countries. But that doesn’t mean that when an algorithm doesn’t label a category outright, it doesn’t profile users. Algorithms, which learn, much like humans, based on history, only reinforce existing social constructs wherever they are used, because that is what they do: digest data, find a pattern, and make predictions according to that pattern. Harvard Professor Latanya Sweeney discovered that searches for racially-associated names were disproportionately causing targeted ads for criminal background checks and records to appear. These searches could go beyond offensive because targeted ads are reinforced by user behavior; if I click on that ad, I tell the machine that it was effective in targeting, reinforcing prejudiced thinking. Automated selection processes, which are beginning to gain popularity, such as automated admissions to schools and credit ratings, could cause real, harmful, physical ramifications.

This isn’t to say that there is anything inherently “evil” in Machine Learning itself; it’s a fascinating field of study, and could become a major tool in public health, disaster relief, and poverty alleviation. The machine is doing nothing new in its actions; it is merely a force multiplier of human behavior. I believe that classification is core to human cognition; and yes, we label others upon contact, always. Whether we like it or not, there is no way to escape being classified and classifying others. It’s impossible to meet someone without assigning some kind of underlying worth to them; it sounds ugly out loud, but our classification are essentially value-laden in order to be useful.

Ultimately, our machines only reflect our selves: it is vital to realize that computers are human, raised on human values, and there is no such thing as objective computation. The question that remains is: what kind of value systems will we feed our algorithms?

Tags: , , ,

· · · ◊ ◊ ◊ · · ·

twitter X MIT

01 Oct 2014

we’re launched!

Tags: , , ,

· · · ◊ ◊ ◊ · · ·

The slides from Democratizing Data Science, the vision paper that William, Ramesh, and I presented for KDD @Bloomberg on Sunday are now available online.

What a great first conference experience! Really interesting speakers and projects all around.

Take part in the conversation by tweeting at us (@mpetitchou, @tweetsbyramesh, @williampli) or putting your own opinions and experiences out there.

Tags: , ,

· · · ◊ ◊ ◊ · · ·

Guys! Guys! Guess what. Even though I’m practicing my April Ludgate glare in real life, today I’m going to be more like this. Why?

I co-wrote my first paper with two cool cats at MIT CSAIL, William Li and Ramesh Sridharan, and it got accepted to the KDD Conference as a highlight talk!

That means next Sunday, August 24th you can hear me taco ‘bout it in real life at 11am in the Bloomberg Building, 731 Lexington Avenue, NY, NY.

The theme of this year’s conference is “Data Mining for Social Good”, and our paper is a short vision statement on effecting positive social change with data science. We briefly define “Data Science”, ask what it means to democratize the field, and to what end that may be achieved. In other words, the current applications of Data Science, a new but growing field, in both research and industry, has the potential for great social impact, but in reality, resources are rarely distributed in a way to optimize the social good.

The conference on Sunday at Bloomberg is free, and the line-up looks promising. There are three “tracks” going on that morning, “Data Science & Policy”, “Urban Computing”, and “Data Frameworks”. Ours is in the 3rd track. Sign up here!

For the full text of the paper, click here.

Tags: , ,

· · · ◊ ◊ ◊ · · ·

I’ve compiled a short list of resources on Sentiment Analysis, especially as applied to (political) debates. Check it out on the Govlab blog.

Tags: ,

· · · ◊ ◊ ◊ · · ·

I’ve compiled a short list of resources on Sentiment Analysis, especially as applied to (political) debates. Check it out on the Govlab blog.

Tags: ,

· · · ◊ ◊ ◊ · · ·

Hey nerds!
Check out this cool model me and my friend Andy developed at Knewton last summer!

 

Tags: , , ,

· · · ◊ ◊ ◊ · · ·

One should usually not take advice on modeling from a 5-foot tall nerdy asian girl, that is, unless it’s Data Modeling we’re talking about. (Whether or not you should take my advice then is up to your own discretion.) This summer, I’m interning at an education technology company, Knewton, where I have to great opportunity to model student behavior with real data. I’m learning a ton about what it takes to be a Data Scientist, although what concerns me more is the Scientist part, since the term is a bit of a buzzword anyway. Along the way, I figured I’d share some tidbits of knowledge with you. This post is specifically targeted towards non-technical people: my goal is to explain things in such a clear way that anyone with a healthy curiosity should be able to comprehend. I will focus on examples and fun demos, since that’s how I learn best. Feedback appreciated!

WHAT IS A MARKOV CHAIN?
A Markov chain, named after this great moustached man is a type of mathematical system composed of a finite number of discrete states and transitions between these states, denoted by their transition probabilities. The most important thing about a Markov Chain is that it satisfies the Markov Property: that each state depends only on the state directly proceding it* and no others. This independence assumption makes a Markov Chain easy to manipulate mathematically. (*This is a Markov Chain of degree 1, but you could also have a Markov Chain of degree n where we look at the past n states only.) A Markov Chain is a specific kind of Markov Process with discrete states.

A VISUAL
That’s a lot of words for a concept that is in fact very simple. Here’s a picturesque example instead:

Imagine that you are a small frog in a pond of lily pads. The pond is big but there are a countable (discrete) number of lily pads (states). You start on one lily pad (start state) and jump to the next with a certain probability (transition probability). When you’re on one lily pad, you only think of the next one to jump to, and you don’t really care about what lily pads you’ve jumped on in the past (memoryless).

That’s all!

WHY DO WE USE IT?
Markov Chains have many, many applications. (Check out this Wikipedia page for a long list.) They’re useful whenever we have a chain of events, or a discrete set of possible states. A good example is a time series: at time 1, perhaps student S answers question A; at time 2, student S answers question B, and so on.

A RANDOM TEXT GENERATOR
Now, for the fun part!

For Knewton’s company hackday, I’ve built a text analysis “funkit” that can perform a variety
of fun things, given an input text file (corpus). You can clone the source code here. Don’t worry if the word “cloning” sounds very scifi, you can check out the README that I’ve written (residing in that link) for detailed instructions on how to use the code. As long as you have python installed on your computer (Macs come pre-installed) you should be fine and dandy.

What we’re most interested in is the parrot() function. This is the “Markov Chain Babbler” or Random Text Generator that mimics an input text. (Markov Chain Babblers are used to generate Lorem Ipsums (text fillers) such as this wonderful Samuel L. Ipsum example.

Included are a few of my favorite sample “corpuses” (scary word for sample text) taken from Project Gutenburg, it includes:

“memshl.txt” which is the complete Memoirs of Sherlock Holmes
“kerouac.txt”, an excerpt from On the Road
“aurelius.txt”, Marcus Aurelius’ Meditations
and finally, “nietzsche.txt”, Nietzsche’s Beyond Good and Evil.

Here’s a prime snippet of text generated using the Nietzsche corpus, of length 100, one of my favorites:

“CONTEMPT. The moral physiologists. Do not find it broadens and a RIGHT OF RANK, says with other work is much further than a fog, so thinks every sense of its surface or good taste! For my own arts of morals in the influence of life at the weakening and distribution of disguise is himself has been enjoyed by way THERETO is thereby. The very narrow, let others, especially among things generally acknowledged to Me?.. Most people is a philosophy depended nevertheless a living crystallizations as well as perhaps in”

Despite being “nonsense”, it captures the essence of the German philosopher quite well. If you squint a little, it doesn’t take much imagination to see this arise from the mouth of Nietzsche himself.

Here’s some Kerouac text, too:

“Flat on a lot of becoming a wonderful night. I knew I wrote a young fellow in the next door, he comes in Frisco. That’s rights. A western plateau, deep one and almost agreed to Denver whatever, look at exactly what he followed me at the sleeping. He woke up its bad effects, cooked, a cousin of its proud tradition. Well, strangest moment; into the night, grand, get that he was sad ride with a brunette. You reckon if I bought my big smile.”

HOW IT WORKS
Parakeet generates text using a simple level-1 Markov Chain, just like we described above. Let’s break it down:

1. We read the input file and “tokenize” it– in other words we break it up into words and punctuation.
2. Now, for each word in text, we store every possible next word that follows it. We do this using a Python dictionary, aka a hash table.

For example, if we have the following sentence,
“the only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time”

We have the following dependencies:

{‘,’: ['the', 'mad', 'mad', 'desirous'],
‘are’: ['the', 'mad'],
‘at’: ['the'],
‘be’: ['saved'],
‘desirous’: ['of'],
‘everything’: ['at'],
‘for’: ['me'],
‘live’: [','],
‘mad’: ['ones', 'to', 'to', 'to'],
‘me’: ['are'],
‘of’: ['everything'],
‘ones’: [',', 'who'],
‘only’: ['people'],
‘people’: ['for'],
‘same’: ['time'],
‘saved’: [','],
‘talk’: [','],
‘the’: ['only', 'mad', 'ones', 'same'],
‘to’: ['live', 'talk', 'be'],
‘who’: ['are']}

Note that in this naive (non-space smart) implementation of the text generator, when we have duplicate occurances of next words, for example
‘mad’: ['ones', 'to', 'to', 'to'], we store it once each time.

3. Now the fun part. Say we want to generate a paragraph of 100 words. First, we randomly choose a startword, that is capitalized first word. Now, we randomly choose a next word from its list of next words (since frequent next words will have many duplicates, it will be chosen more often), and from that word, continue the process till we achieve a paragraph of length 100.

WHY IS THIS A MARKOV PROCESS?
Well, when we build up our paragraphs, we choose our next word based only on the choices generated by our current word. Doing so, we ignore the history of previous words we have chosen (which is why many of the sentences are nonsensical), yet since each choice of the next word is logical based on the current one, we end up with something that emulates the writing style (chaining).

SOURCE CODE:
Check out my code on github, located here. Simply fire up your terminal, and type “git clone the-url”, and it will copy the repo into a directory on your local machine. Further instructions are in the README.

Tags: , , , ,

· · · ◊ ◊ ◊ · · ·

One should usually not take advice on modeling from a 5-foot tall nerdy asian girl, that is, unless it’s Data Modeling we’re talking about. (Whether or not you should take my advice then is up to your own discretion.) This summer, I’m interning at an education technology company, Knewton, where I have to great opportunity to model student behavior with real data. I’m learning a ton about what it takes to be a Data Scientist, although what concerns me more is the Scientist part, since the term is a bit of a buzzword anyway. Along the way, I figured I’d share some tidbits of knowledge with you. This post is specifically targeted towards non-technical people: my goal is to explain things in such a clear way that anyone with a healthy curiosity should be able to comprehend. I will focus on examples and fun demos, since that’s how I learn best. Feedback appreciated!

WHAT IS A MARKOV CHAIN?
A Markov chain, named after this great moustached man is a type of mathematical system composed of a finite number of discrete states and transitions between these states, denoted by their transition probabilities. The most important thing about a Markov Chain is that it satisfies the Markov Property: that each state depends only on the state directly proceding it* and no others. This independence assumption makes a Markov Chain easy to manipulate mathematically. (*This is a Markov Chain of degree 1, but you could also have a Markov Chain of degree n where we look at the past n states only.) A Markov Chain is a specific kind of Markov Process with discrete states.

A VISUAL
That’s a lot of words for a concept that is in fact very simple. Here’s a picturesque example instead:

Imagine that you are a small frog in a pond of lily pads. The pond is big but there are a countable (discrete) number of lily pads (states). You start on one lily pad (start state) and jump to the next with a certain probability (transition probability). When you’re on one lily pad, you only think of the next one to jump to, and you don’t really care about what lily pads you’ve jumped on in the past (memoryless).

That’s all!

WHY DO WE USE IT?
Markov Chains have many, many applications. (Check out this Wikipedia page for a long list.) They’re useful whenever we have a chain of events, or a discrete set of possible states. A good example is a time series: at time 1, perhaps student S answers question A; at time 2, student S answers question B, and so on.

A RANDOM TEXT GENERATOR
Now, for the fun part!

For Knewton’s company hackday, I’ve built a text analysis “funkit” that can perform a variety
of fun things, given an input text file (corpus). You can clone the source code here. Don’t worry if the word “cloning” sounds very scifi, you can check out the README that I’ve written (residing in that link) for detailed instructions on how to use the code. As long as you have python installed on your computer (Macs come pre-installed) you should be fine and dandy.

What we’re most interested in is the parrot() function. This is the “Markov Chain Babbler” or Random Text Generator that mimics an input text. (Markov Chain Babblers are used to generate Lorem Ipsums (text fillers) such as this wonderful Samuel L. Ipsum example.

Included are a few of my favorite sample “corpuses” (scary word for sample text) taken from Project Gutenburg, it includes:

“memshl.txt” which is the complete Memoirs of Sherlock Holmes
“kerouac.txt”, an excerpt from On the Road
“aurelius.txt”, Marcus Aurelius’ Meditations
and finally, “nietzsche.txt”, Nietzsche’s Beyond Good and Evil.

Here’s a prime snippet of text generated using the Nietzsche corpus, of length 100, one of my favorites:

“CONTEMPT. The moral physiologists. Do not find it broadens and a RIGHT OF RANK, says with other work is much further than a fog, so thinks every sense of its surface or good taste! For my own arts of morals in the influence of life at the weakening and distribution of disguise is himself has been enjoyed by way THERETO is thereby. The very narrow, let others, especially among things generally acknowledged to Me?.. Most people is a philosophy depended nevertheless a living crystallizations as well as perhaps in”

Despite being “nonsense”, it captures the essence of the German philosopher quite well. If you squint a little, it doesn’t take much imagination to see this arise from the mouth of Nietzsche himself.

Here’s some Kerouac text, too:

“Flat on a lot of becoming a wonderful night. I knew I wrote a young fellow in the next door, he comes in Frisco. That’s rights. A western plateau, deep one and almost agreed to Denver whatever, look at exactly what he followed me at the sleeping. He woke up its bad effects, cooked, a cousin of its proud tradition. Well, strangest moment; into the night, grand, get that he was sad ride with a brunette. You reckon if I bought my big smile.”

HOW IT WORKS
Parakeet generates text using a simple level-1 Markov Chain, just like we described above. Let’s break it down:

1. We read the input file and “tokenize” it– in other words we break it up into words and punctuation.
2. Now, for each word in text, we store every possible next word that follows it. We do this using a Python dictionary, aka a hash table.

For example, if we have the following sentence,
“the only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time”

We have the following dependencies:

{‘,’: ['the', 'mad', 'mad', 'desirous'],
‘are’: ['the', 'mad'],
‘at’: ['the'],
‘be’: ['saved'],
‘desirous’: ['of'],
‘everything’: ['at'],
‘for’: ['me'],
‘live’: [','],
‘mad’: ['ones', 'to', 'to', 'to'],
‘me’: ['are'],
‘of’: ['everything'],
‘ones’: [',', 'who'],
‘only’: ['people'],
‘people’: ['for'],
‘same’: ['time'],
‘saved’: [','],
‘talk’: [','],
‘the’: ['only', 'mad', 'ones', 'same'],
‘to’: ['live', 'talk', 'be'],
‘who’: ['are']}

Note that in this naive (non-space smart) implementation of the text generator, when we have duplicate occurances of next words, for example
‘mad’: ['ones', 'to', 'to', 'to'], we store it once each time.

3. Now the fun part. Say we want to generate a paragraph of 100 words. First, we randomly choose a startword, that is capitalized first word. Now, we randomly choose a next word from its list of next words (since frequent next words will have many duplicates, it will be chosen more often), and from that word, continue the process till we achieve a paragraph of length 100.

WHY IS THIS A MARKOV PROCESS?
Well, when we build up our paragraphs, we choose our next word based only on the choices generated by our current word. Doing so, we ignore the history of previous words we have chosen (which is why many of the sentences are nonsensical), yet since each choice of the next word is logical based on the current one, we end up with something that emulates the writing style (chaining).

SOURCE CODE:
Check out my code on github, located here. Simply fire up your terminal, and type “git clone the-url”, and it will copy the repo into a directory on your local machine. Further instructions are in the README.

Tags: , , , ,

· · · ◊ ◊ ◊ · · ·