Behind the title – Using Word Embeddings to Measure Cultural Associations of Occupations

Click here to see interactive visualization of results

Motivation

Engineers, hairdressers, doctors, or cleaners there is a wide variety of occupations individuals encounter on their everyday life. When we come across an occupation, certain images may come to our mind. For instance, we may think of some occupations, such as nurses as feminine while others such as mechanics as masculine, or we may think of consultants as selfish and teachers as kind. These images may depend on our own experiences with members of those occupations, but crucially they are also informed by the culture around us. According to Patterson (2014) culture can be understood as a shared structure of meanings. We can rely on this structure to understand the world around us, categorize others, and inform our actions (Cerulo et al., 2021; DiMaggio, 1997; Lizardo, 2017). Thus, we may think of nurses as feminine, because this association between nurse and femininity is embedded in our culture. However, nurses are not simply characterized by their femininity, but also by their cultural association with other concepts such as ability, likeability, ethnicity, or social status. Similarly, other occupations also have a wide set of cultural occupations. Since those cultural associations relate to how occupations are perceived this also has wider implications for inequality. Examples of how cultural associations of occupations contribute to inequality include processes of the devaluation of more feminine occupations (Busch, 2018; Grönlund & Magnusson, 2013), the justification of unequal earnings between occupations via associations with merit and success (Valentino & Vaisey, 2022), or also how the cultural associations with status and gender contribute to the segregation of occupations along gender and class lines (Cejka & Eagly, 1999; Cochran et al., 2011; Fang & Tilcsik, 2022). As such it is relevant to capture these cultural associations to better understand social inequalities. However, successfully capturing them has proven difficult for previous research. Some studies rely on proxies such as the share of female workers in a given occupation (Busch, 2018; Busch-Heizmann, 2015) however, this may be only partially indicative of cultural associations of an occupation. Other researchers have relied on surveys to derive what is associated with a given occupation (Fang & Tilcsik, 2022; Friehs et al., 2022; He et al., 2019). While this is a more direct approach it is limited for it may only capture cultural association respondents are willing to express and this only for a restricted set of occupations. Furthermore, it is also limited because the samples may not be representative and thus, they may not accurately depict culture as embedded in wider society. To overcome these issues, I propose to use word embeddings. This new method uses big sets of text and machine learning to quantify words and their semantic relations (Mikolov et al., 2013). As proposed by Stoltz and Taylor (2021) they can be used as cultural maps. Therefore, this project made use of them and sought to determine how successfully they can capture the cultural associations of occupations.

Word embeddings as cultural maps

Cultural sociologists have long been interested in quantifying meanings and culture (Ghaziani & Baldassarri, 2011; Hoffman, 2019; Mohr, 1998; Mohr & Duquenne, 1997). This line of research is based on structuralist theory. According to structuralism, words do not have an internal meaning, but rather their meaning emerges through the relations they have to other words, i.e. how they are used together with other words (Arseniev-Koehler, 2022; Mohr, 1998). This means that two words will have a similar meaning if their context (the words they co-occur with) is similar.

Word embeddings are based upon this idea and thus can serve as a powerful tool to quantify meanings and by extension culture. Concretely, in word embeddings, each word is assigned a multidimensional vector with the help of large textual datasets and machine learning algorithms. The algorithms optimize the word vectors so that a missing word can be predicted correctly given a set of words or for our purposes given a context (Arseniev-Koehler, 2022; Mikolov et al., 2013). As such, two words will have similar vectors if they appear in similar contexts. Since vectors can also be understood as coordinates, it follows that word embeddings can be understood as cultural maps, wherein words are positioned based on their semantic relationships to each other (Stoltz & Taylor, 2021). The advantage of such a map is that we can use vector geometry to derive the relationship between words in terms of their meanings. A typical example of this is analogy solving. Let’s say we wanted to solve the analogy of man is to woman as king is to what (see Kozlowski et al., 2019). In a word embedding space, we can capture the juxtaposition between man and woman by drawing a line from man to woman which is the same as the operation \(\vec{woman}-\vec{man}\). To solve the analogy we need to find the word which together with king forms the same juxtaposition as woman and man. As such we can add our juxtaposition line to king and we get \(\vec{king}+(\vec{woman}-\vec{man})=\vec{queen}\)

Beyond analogy solving word embeddings can also be used to determine the similarity between two words. Concretely, we can see what the angle is between two vectors. This information is captured by the dot product between two vectors. The dot product will be equal to 1 if two vectors point in the same direction suggesting identical meanings, -1 if the two word vectors point in opposite directions suggesting opposed meanings, and 0 if the two word vectors are perpendicular to each other suggesting no relationship of any kind in meaning. Based on this, we may be interested in capturing gendered connotations of the nurse occupation and as such we may calculate the similarity between woman and nurse. The resulting dot product would give us some information on how close the two words are in meaning. However, we cannot use this dot product to assess whether this similarity is due to gendered connotations of nurses. This is because the dot product captures any possible (dis)similarities that exist between two words. Thus, a simple dot product between woman and nurse would also reflect the dissimilarities that exist between the two words such as one being a medical occupation and the other not. Instead, if we want to measure gendered associations we need to see whether the word nurse is closer to woman or to man. For this, we can make use of the vector \((\vec{woman}-\vec{man})\). Since it is a line pointing from woman to man, we can also understand it as a gender dimension. By assessing the similarity between nurse and this gender dimension we can now assess whether the term nurse is closer to woman or to man and thus what its gendered connotations are (Garg et al., 2018) . This idea of constructing a dimension between two vectors and assessing its similarity to other words is what allows me to assess the different cultural connotations of occupations as I will outline below.

Mapping Occupations and Their Cultural Associations

Approach

Based on the approach above this project tried to capture a range of cultural associations that occupations have. For this, I mapped a wide range of detailed occupational titles in a word embedding space. I used the list of female and male occupational titles provided by the Swiss Federal Statistical Office. This list has the advantage of providing very detailed occupational titles specifically the occupational titles people report on surveys. At the same time, the list also links the detailed titles to the international standard classification of occupations (ISCO) thus allowing me to easily aggregate the titles based on their occupational groups. I locate these titles in a word embedding space that has already been trained using internet data gathered as part of the common crawl project as well as Wikipedia data (Grave et al., 2018).

While there are many dimensions one can focus on I restrict myself to a set of potentially relevant dimensions:

  • Gender: Occupations are segregated along gender lines and this may result in occupations being associated with masculinity or femininity.

  • Ethnicity: Similar to gender occupations are also segregated based on ethnicity and thus have ethnicized associations.

  • Warmth and Competence: Psychological research has argued that stereotypes people hold about other groups can be characterized by dimensions of warmth (likeability, sociability) and competence (ability) (Cuddy et al., 2008; Fiske, 2018). This has also been argued to be the case for occupations (Friehs et al., 2022; He et al., 2019)

  • Excitement: Current research has increasingly pointed out that individuals wish to follow occupations that spark joy in them and which they are passionate about (Cech, 2021). As such it may follow that some occupations have stronger associations with being enjoyable and exciting while others do not.

  • Affluence, Cultivation, and Education: Occupations are strongly linked with one’s social class and social position. Thus, it is relevant to see how far this link is translated into cultural associations. For this, I consider affluence, cultivation, and education as they have been argued to be elements of one’s social class (Kozlowski et al., 2019; Savage et al., 2013)

To accurately capture the dimensions I follow previous studies (Boutyline, 2017; Kozlowski et al., 2019) and use sets of word pairs for constructing the dimensions. Thus, rather than relying on the word pair woman-man to capture gender I, amongst others also use additional pairs such as feminine-masculine, or girl-boy, all of which aim to capture the same underlying gender distinction. By constructing multiple vectors all of which ideally describe the same dimension and averaging them, I aim to construct more accurate dimensions (Stoltz et al., 2023). In order to assess the adequacy of the chosen word pairs I use the PairDist and 3Cosadd measures as proposed by Boutyline and Johnston (2023). The two measures capture the extent to which the word pairs capture the same underlying dimension. The word pairs were thus selected such that these values were as best as possible.

Click here to see interactive visualization of results

Occupations and individual dimensions

Once I computed the association for all detailed titles with all dimensions, I averaged the male and female scores and aggregated them based on their ISCO unit group. ISCO codes consist of multiple hierarchical levels of which the unit group is the lowest (see table). Detailed titles such as Graphic Designer, Illustrator, or Multimedia designer were subsumed under the unit group ‘2166 Graphic and Multimedia designers’. Furthermore, following Boutyline et al (2023), I z-standardized the results. This shows which occupations are more feminine, boring, warm, etc. when compared to other occupations.

To get an overview of what is being captured by the dimensions and the word embeddings it makes sense to consider the occupations with the highest and lowest scores for each dimension. In what follows below I will highlight some relevant points for some dimensions.

Gender

Table 1 - Man (+) vs. Woman (-)
Unit Group Z-Score
Top 5
3150 Ship and aircraft controllers and technicians 3.64
7110 Building frame and related trades workers 3.54
7319 Handicraft workers not elsewhere classified 3.09
9613 Sweepers and related labourers 2.63
5244 Contact centre salespersons 2.47
Bottom 5
3222 Midwifery associate professionals -4.21
1341 Child care services managers -2.34
9111 Domestic cleaners and helpers -2.33
2222 Midwifery professionals -2.28
3251 Dental assistants and therapists -2.19

The top and bottom 5 occupations seem to mostly align with expectations of what occupations we would consider feminine and masculine. Midwifery associate professionals score particularly high in terms of femininity. This seems reasonable considering that it is an occupation centered around women. Similarly, occupations pertaining to care work also have high femininity scores.

Cultivation and Affluence

Table 2 - Cultivated (+) vs. Uncultivated (-)
Unit Group Z-Score
Top 5
8155 Fur and leather preparing machine operators 3.56
9321 Hand packers 2.69
4414 Scribes and related workers 2.52
9310 Mining and construction labourers 2.52
8159 Textile, fur and leather products machine operators not elsewhere classified 2.34
Bottom 5
9610 Refuse workers -3.13
9622 Odd job persons -2.84
9311 Mining and quarrying labourers -2.62
9510 Street and related service workers -2.50
9611 Garbage and recycling collectors -2.38
Table 3 - Rich (+) vs. Poor (-)
Unit Group Z-Score
Top 5
1323 Construction managers 2.51
3434 Chefs 2.49
3132 Incinerator and water treatment plant operators 2.01
7232 Aircraft engine mechanics and repairers 1.99
3310 Financial and mathematical associate professionals 1.95
Bottom 5
9622 Odd job persons -5.76
5312 Teachers' aides -3.59
9510 Street and related service workers -3.51
7220 Blacksmiths, toolmakers and related trades workers -3.40
9311 Mining and quarrying labourers -2.99

Regarding these two dimensions of social class, the results from the analysis, show that the occupations scoring highest on cultivation and affluence are not necessarily occupations that we would expect to be highly cultivated or very affluent. This may have two explanations. It could be that the dimensions I constructed for affluence and culture are not adequate to actually capture those concepts. However, it could also be that occupations such as ‘Fur and leather preparing machine operators’ or ‘Construction Managers’ score high on culture/affluence because those occupations deal with matters associated with culture or affluence. Similarly, to how I argued that midwives have a strong association with femininity partly because it is a job concerned with women, ‘Fur and leather preparing machine operators’ may score high on culture, because fur and leather goods have a strong association with cultivation.

Table 4 - Warm (+) vs. Cold (-)
Unit Group Z-Score
Top 5
4229 Client information workers not elsewhere classified 3.34
3222 Midwifery associate professionals 2.89
9332 Drivers of animal-drawn vehicles and machinery 2.87
5322 Home-based personal care workers 2.63
5113 Travel guides 2.55
Bottom 5
7544 Fumigators and other pest and weed controllers -3.45
4213 Pawnbrokers and money-lenders -3.10
9123 Window cleaners -2.99
7124 Insulation workers -2.89
9613 Sweepers and related labourers -2.69
Table 5 - Competent (+) vs. Incompetent (-)
Unit Group Z-Score
Top 5
9310 Mining and construction labourers 3.23
9331 Hand and pedal vehicle drivers 2.90
2650 Creative and performing artists 2.53
2210 Medical doctors 2.44
4414 Scribes and related workers 2.34
Bottom 5
7215 Riggers and cable splicers -3.43
9123 Window cleaners -3.07
5312 Teachers' aides -2.80
5244 Contact centre salespersons -2.64
9121 Hand launderers and pressers -2.57
Table 6 - Interesting (+) vs. Boring (-)
Unit Group Z-Score
Top 5
7323 Print finishing and binding workers 3.31
3222 Midwifery associate professionals 2.22
2650 Creative and performing artists 2.18
2424 Training and staff development professionals 2.10
9331 Hand and pedal vehicle drivers 2.08
Bottom 5
9510 Street and related service workers -4.08
9622 Odd job persons -3.96
5312 Teachers' aides -3.42
9613 Sweepers and related labourers -2.89
4410 Other clerical support workers -2.85
Table 7 - Swiss (+) vs. Foreign (-)
Unit Group Z-Score
Top 5
9331 Hand and pedal vehicle drivers 3.77
2519 Software and applications developers and analysts not elsewhere classified 2.40
4225 Enquiry clerks 2.24
7311 Precision-instrument makers and repairers 2.15
1111 Legislators 1.97
Bottom 5
9311 Mining and quarrying labourers -3.35
9622 Odd job persons -3.31
9613 Sweepers and related labourers -3.26
8112 Mineral and stone processing plant operators -3.24
5169 Personal services workers not elsewhere classified -3.21
Table 8 - Educated (+) vs. Uneducated (-)
Unit Group Z-Score
Top 5
2341 Primary school teachers 3.05
2300 Teaching professionals 2.45
2352 Special needs teachers 2.37
2210 Medical doctors 2.35
5411 Fire-fighters 2.31
Bottom 5
9123 Window cleaners -3.54
9510 Street and related service workers -3.31
9622 Odd job persons -3.27
3150 Ship and aircraft controllers and technicians -3.16
5244 Contact centre salespersons -3.15

Distribution of Cultural Associations across Different Classification Levels

While some interesting insights can be revealed by looking at the scores of each occupation individually, a relevant question is whether cultural associations as measured here also serve to distinguish bigger occupational groups from each other. To assess this, I conduct ANOVAs for each dimension to assess whether there are significant differences between ISCO minor, sub-major and major groups.

Dimension DFn DFd F-value
Differences between Major Groups
Gender 8 450 11.207***
Warmth 8 450 9.731***
Competence 8 450 12.374***
Excitement 8 450 11.205***
Ethnicity 8 450 15.836***
Culture 8 450 5.599***
Affluence 8 450 10.744***
Education 8 450 20.553***
Differences between Sub-Major Groups
Gender 44 414 4.376***
Warmth 44 414 4.615***
Competence 44 414 4.052***
Excitement 44 414 4.711***
Ethnicity 44 414 4.286***
Culture 44 414 3.569***
Affluence 44 414 4.627***
Education 44 414 7.844***
Differences between Minor Groups
Gender 139 319 2.591***
Warmth 139 319 2.627***
Competence 139 319 1.834***
Excitement 139 319 2.361***
Ethnicity 139 319 2.159***
Culture 139 319 2.053***
Affluence 139 319 2.557***
Education 139 319 2.91***
Note: * p<0.05 ** p<0.01 *** p<0.001

The results show that all dimensions differ significantly across minor groups, sub-major groups, as well as major groups. Thus, this indicates that these cultural associations can potentially be used to distinguish between occupational groups. This does however not mean that all occupations within the same group are homogenous in terms of their cultural associations, taking the gender dimension as an example we can see how the scores are distributed across sub-major groups.

Figure 1
Figure 1

This figure reveals, that for some sub-major groups such as ‘53 Personal Care Workers’ the scores are tightly distributed around the mean, while in other groups such as ‘21 Science and Engineering Professionals’ occupations are more spread around the mean and both sides of the gender dimension. Results from random intercept models further illustrate the heterogeneity within occupational groups.

  Woman(-) vs. Man(+) Cold(-) vs. Warm(+) Incompetent(-) vs. Competent(+) Boring(-) vs. Interesting(+) Foreign(-) vs. Swiss(+) Uncultivated(-) vs. Cultivated(+) Poor(-) vs. Rich(+) Uneducated(-) vs. Educated(+)
Predictors Estimates Estimates Estimates Estimates Estimates Estimates Estimates Estimates
(Intercept) 0.05 0.07 -0.01 -0.10 -0.06 -0.01 -0.05 -0.07
Random Effects
σ2 0.66 0.65 0.77 0.72 0.72 0.75 0.69 0.60
Ï„00 0.12 Minor_Group 0.13 Minor_Group 0.00 Minor_Group 0.02 Minor_Group 0.05 Minor_Group 0.06 Minor_Group 0.09 Minor_Group 0.00 Minor_Group
0.11 Sub_Major_Group 0.14 Sub_Major_Group 0.10 Sub_Major_Group 0.15 Sub_Major_Group 0.04 Sub_Major_Group 0.16 Sub_Major_Group 0.14 Sub_Major_Group 0.18 Sub_Major_Group
0.19 Major_Group 0.08 Major_Group 0.16 Major_Group 0.20 Major_Group 0.30 Major_Group 0.08 Major_Group 0.22 Major_Group 0.32 Major_Group
N 9 Major_Group 9 Major_Group 9 Major_Group 9 Major_Group 9 Major_Group 9 Major_Group 9 Major_Group 9 Major_Group
45 Sub_Major_Group 45 Sub_Major_Group 45 Sub_Major_Group 45 Sub_Major_Group 45 Sub_Major_Group 45 Sub_Major_Group 45 Sub_Major_Group 45 Sub_Major_Group
140 Minor_Group 140 Minor_Group 140 Minor_Group 140 Minor_Group 140 Minor_Group 140 Minor_Group 140 Minor_Group 140 Minor_Group
Observations 459 459 459 459 459 459 459 459
  • p<0.05   ** p<0.01   *** p<0.001

Taking all three levels simultaneously into account, we can see that for some dimensions such as warmth, there is variance within all three levels. However, for other dimensions such as education within-group variance is confined to the sub-major and major groups, indicating homogeneity in terms of education at the minor level.

Interrelationship between Cultural Associations

While individual cultural associations of occupations may offer interesting insights on their own, it is also important to see how different associations are related to each other. Thus, I jointly visualize the cultural associations of all occupations at the ISCO unit group level. In this plot, the x-axis denotes the different cultural dimensions, and the y-axis describes how much a given occupation leans toward one of the poles of the dimension. The individual points have been connected to represent the different cultural profiles of each occupation.

Figure 2 - Exemplary Occupational Profiles
Figure 2 - Exemplary Occupational Profiles

By using hierarchical clustering, I also visualize how occupations can be grouped based on these occupational profiles, and how much these groupings coincide with the conventional ISCO classification. Using color to denote the nine different ISCO major groups, Figure 3 with 15 clusters, shows that the ISCO major groups only partially coincide with the derived clusters. Some clusters are dominated by one color (i.e. consist of mainly one major group) whereas others have a wide mix of colors/major groups. Thus, ISCO groups are also heterogeneous in terms of the joint distribution of their different cultural associations.