If we take out the 2 outliers (r, go) - the correlation is 0.61.
SELECT *
FROM (
SELECT word, COUNT(*) c FROM (
SELECT SPLIT(REGEXP_REPLACE(LOWER(text), r'[^a-z]', ' '), ' ') words
FROM `bigquery-public-data.hacker_news.full`
WHERE parent IN (
SELECT id
FROM `bigquery-public-data.hacker_news.full`
WHERE title LIKE 'Ask HN: Who is hiring?%2017%'
)
), UNNEST(words) word
WHERE LENGTH(word)>1 OR (word='r')
GROUP BY 1
HAVING c>30
) a JOIN (
SELECT LOWER(WantWorkLanguage) language, COUNT(*) responses, ROUND(100*COUNTIF(v='Female')/COUNT(*), 2) perc_female
FROM (
SELECT SPLIT(WantWorkLanguage , '; ') WantWorkLanguage, Gender v
FROM `fh-bigquery.stackoverflow.survey_results_public_2017`
WHERE WantWorkLanguage!='NA' AND Gender!='NA'
), UNNEST(WantWorkLanguage) WantWorkLanguage
GROUP BY 1
HAVING responses>2000
) b
ON a.word=b.language
Be even better if you could do it against a summary of appearances in job postings at major job sites. That will represent more of the market women are working in probably. They might have summary data you can use on the languages if not the actual postings.
Let's see. What if I take all mentions of each language on HN's who's hiring threads, vs % of women interested in each language?
There is correlation!
Chart:
- http://i.imgur.com/mcN6Ghz.png
If we take out the 2 outliers (r, go) - the correlation is 0.61.
(caveat: "go" is an overloaded word)