Word Embeddings in qhChina

This page documents the word embeddings functionality in the qhChina package, with a focus on the customized Word2Vec implementation.

Word2Vec Implementation

qhChina provides a custom implementation of Word2Vec with both CBOW (Continuous Bag of Words) and Skip-gram architectures, designed specifically for research in humanities and social sciences with Chinese text.

Basic Usage

from qhchina.analytics.word2vec import Word2Vec

# Initialize a Word2Vec model
model = Word2Vec(
    vector_size=100,  # Dimensionality of word vectors
    window=5,         # Context window size
    min_count=5,      # Minimum word frequency threshold
    sg=1,             # 1 for Skip-gram; 0 for CBOW
    negative=5,       # Number of negative samples
    alpha=0.025,      # Initial learning rate
    seed=42           # Random seed for reproducibility
)

# Prepare tokenized sentences
sentences = [
    ["我", "喜欢", "这部", "电影"],
    ["这", "是", "一个", "有趣", "的", "故事"],
    # More sentences...
]

# Train the model
model.train(sentences, epochs=5)

# Get word vector
vector = model.get_vector("电影")

# Find similar words
similar_words = model.most_similar("电影", topn=10)

Key Features

Architecture Options

CBOW (Continuous Bag of Words): Predicts the target word from context words
Skip-gram: Predicts context words from the target word

# CBOW model (default)
cbow_model = Word2Vec(sg=0)

# Skip-gram model
skipgram_model = Word2Vec(sg=1)

Training Parameters

Parameter	Description
`vector_size`	Dimensionality of word vectors (default: 100)
`window`	Maximum distance between target and context words (default: 5)
`min_count`	Ignores words with frequency below this threshold (default: 5)
`alpha`	Initial learning rate (default: 0.025)
`min_alpha`	Final learning rate (default: None)
`negative`	Number of negative samples for each positive sample (default: 5)
`ns_exponent`	Exponent for negative sampling distribution (default: 0.75)
`max_vocab_size`	Maximum vocabulary size (default: None)
`sample`	Threshold for downsampling frequent words (default: 1e-3)
`shrink_windows`	Whether to use dynamic window size (default: True)
`seed`	Random seed for reproducibility (default: 1)
`cbow_mean`	Whether to use mean or sum for context word vectors in CBOW (default: True)
`use_double_precision`	Whether to use double precision for calculations (default: False)
`use_cython`	Whether to use Cython for performance-critical operations (default: False)
`gradient_clip`	Clipping value for gradients (default: 1.0)
`exp_table_size`	Size of the precomputed sigmoid table (default: 1000)
`max_exp`	Maximum value in the precomputed sigmoid table (default: 6.0)

Batch Training

The Word2Vec implementation supports batch-based training for better performance:

# Train with batching
model.train(sentences, epochs=5, batch_size=64)

Advanced Methods

# Save and load models
model.save("my_model.model")
loaded_model = Word2Vec.load("my_model.model")

# Update model with new sentences
model.build_vocab(new_sentences, update=True)
model.train(new_sentences, epochs=3)

# Word similarity
similarity = model.similarity("电影", "电视")

# Get most similar words
similar_words = model.most_similar("中国", topn=10)

Temporal Reference Word2Vec

qhChina provides a specialized implementation called TempRefWord2Vec for tracking semantic change over time. This model does not require training separate models for each time period. Instead, it creates temporal variants of target words in a single vector space using a specialized training approach.

Basic Usage

from qhchina.analytics.word2vec import TempRefWord2Vec

# Prepare corpus data from different time periods
time_labels = ["1980", "1990", "2000", "2010"]
corpora = [corpus_1980, corpus_1990, corpus_2000, corpus_2010]

# Target words to track for semantic change
target_words = ["改革", "经济", "科技", "人民"]

# Initialize and train the model in one step
model = TempRefWord2Vec(
    corpora=corpora,          # List of corpora for different time periods
    labels=time_labels,       # Labels for each time period
    targets=target_words,     # Words to track for semantic change
    vector_size=256,
    window=5,
    min_count=5,
    sg=1,                     # Use Skip-gram model
    negative=10,
    seed=42
)

# Train the model
model.train(calculate_loss=True, batch_size=64)

# Access temporal variants of words
reform_1980s = model.get_vector("改革_1980")
reform_2010s = model.get_vector("改革_2010")

# Find words similar to a target in a specific time period
similar_to_reform_1980s = model.most_similar("改革_1980", topn=10)
similar_to_reform_2010s = model.most_similar("改革_2010", topn=10)

How It Works

The TempRefWord2Vec model works by:

Creating temporal variants of target words by appending time period labels (e.g., “改革_1980s”, “改革_2010s”)
Training a single Word2Vec model with all corpora, but making each target word specific to its time period
Maintaining the shared vector space for all non-target words across all time periods
This allows direct comparison of how a word’s semantic associations change over time

Analyzing Semantic Change

def calculate_semantic_change(model, target_word, labels, limit_top_similar=200, min_length=2):
    """
    Calculate semantic change by comparing cosine similarities across time periods.
    
    Parameters:
    -----------
    model: Trained TempRefWord2Vec model
    target_word: Target word to analyze
    labels: Time period labels
    limit_top_similar: Number of most similar words to consider
    min_length: Minimum word length to include
    
    Returns:
    --------
    Dict mapping transition names to lists of (word, change) tuples
    """
    results = {}
    
    # Get all words in vocabulary (excluding temporal variants)
    all_words = [word for word in model.vocab.keys() 
                if word not in model.reverse_temporal_map]
    
    # Get embeddings for all words
    all_word_vectors = np.array([model.get_vector(word) for word in all_words])

    # For each adjacent pair of time periods
    for i in range(len(labels) - 1):
        from_period = labels[i]
        to_period = labels[i+1]
        transition = f"{from_period}_to_{to_period}"
        
        # Get temporal variants for the target word
        from_variant = f"{target_word}_{from_period}"
        to_variant = f"{target_word}_{to_period}"
        
        # Get vectors for the target word in each period
        from_vector = model.get_vector(from_variant).reshape(1, -1)
        to_vector = model.get_vector(to_variant).reshape(1, -1)
        
        # Calculate cosine similarity for all words with the target word in each period
        from_sims = cosine_similarity(from_vector, all_word_vectors)[0]
        to_sims = cosine_similarity(to_vector, all_word_vectors)[0]
        
        # Calculate differences in similarity
        sim_diffs = to_sims - from_sims
        
        # Create word-change pairs and sort by change
        word_changes = [(all_words[i], float(sim_diffs[i])) for i in range(len(all_words))]
        word_changes.sort(key=lambda x: x[1], reverse=True)
        
        # Consider only words that were among the most similar in either period
        most_similar_from = model.most_similar(from_variant, topn=limit_top_similar)
        most_similar_to = model.most_similar(to_variant, topn=limit_top_similar)
        
        considered_words = set(word for word, _ in most_similar_from) | set(word for word, _ in most_similar_to)
        
        # Filter results based on considered words and length
        word_changes = [change for change in word_changes 
                      if change[0] in considered_words and len(change[0]) >= min_length]
        
        results[transition] = word_changes
    
    return results

# Example usage
target_word = "人民"
changes = calculate_semantic_change(model, target_word, time_labels)

# Display words that became more associated with "人民"
for transition, word_changes in changes.items():
    print(f"\nTransition: {transition}")
    
    # Words with increased similarity (moved towards)
    print("Words moved towards:")
    for word, change in word_changes[:10]:
        print(f"  {word}: {change:.4f}")
    
    # Words with decreased similarity (moved away)
    print("\nWords moved away from:")
    for word, change in word_changes[-10:]:
        print(f"  {word}: {change:.4f}")

Visualization Examples

You can visualize the semantic change using the standard vector projection tools:

from qhchina.analytics.vectors import project_2d

# Get vectors for target word across all time periods
target_word = "改革"
vectors = {}
for period in time_labels:
    temporal_variant = f"{target_word}_{period}"
    vectors[temporal_variant] = model.get_vector(temporal_variant)

# Add common words to the visualization
common_words = ["政策", "开放", "经济", "发展", "市场"]
for word in common_words:
    vectors[word] = model.get_vector(word)

# Project to 2D
project_2d(
    vectors=vectors,
    method="pca",
    title=f"Semantic Change of '{target_word}' Over Time",
    adjust_text_labels=True
)

Vector Analysis

qhChina provides tools for analyzing and visualizing word embeddings.

Vector Projection

from qhchina.analytics.vectors import project_2d

# Project vectors to 2D space using PCA
project_2d(
    vectors={word: model.get_vector(word) for word in ["中国", "美国", "俄罗斯", "日本", "德国"]},
    method="pca",
    title="Countries in Vector Space"
)

# Using t-SNE for better clustering visualization
project_2d(
    vectors={word: model.get_vector(word) for word in words_list},
    method="tsne",
    perplexity=5,
    title="t-SNE Projection of Word Vectors"
)

Bias Analysis

from qhchina.analytics.vectors import calculate_bias, project_bias

# Define gender dimension
gender_pairs = [("男人", "女人"), ("他", "她"), ("父亲", "母亲")]

# Calculate bias scores along gender dimension
target_words = ["医生", "护士", "工程师", "教师", "科学家"]
bias_scores = calculate_bias(gender_pairs, target_words, model)

# Project words on the gender dimension
project_bias(
    x=gender_pairs,
    y=None,
    targets=target_words,
    word_vectors=model,
    title="Gender Bias in Profession Words"
)

Vector Alignment

When comparing word vectors across different models (e.g., from different training runs), you can align them to enable direct comparison:

from qhchina.analytics.vectors import align_vectors

# Align model2's vectors to model1's vector space
align_vectors(model1, model2)

# Now you can directly compare vectors
vector1 = model1.get_vector("电影")
vector2 = model2.get_vector("电影")

Practical Examples

Analyzing Conceptual Change

# Initialize model with specific parameters for historical analysis
model = Word2Vec(
    vector_size=200,
    window=10,
    min_count=10,
    sg=1,
    negative=10
)

# Train on early period corpus
model.build_vocab(early_period_texts)
model.train(early_period_texts, epochs=5)
early_model = model.copy()

# Update model with later period corpus
model.build_vocab(later_period_texts, update=True)
model.train(later_period_texts, epochs=5)

# Compare semantic neighborhoods
early_neighbors = early_model.most_similar("革命", topn=20)
modern_neighbors = model.most_similar("革命", topn=20)

Creating Semantic Fields

# Get all words similar to a concept
economy_terms = model.most_similar("经济", topn=50)

# Find clusters within a semantic field
from sklearn.cluster import KMeans
from qhchina.analytics.vectors import cosine_similarity

# Get vectors for economy-related terms
vectors = [model.get_vector(word) for word, _ in economy_terms]

# Cluster vectors
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(vectors)

# Group words by cluster
semantic_fields = {}
for i, (word, _) in enumerate(economy_terms):
    cluster = clusters[i]
    if cluster not in semantic_fields:
        semantic_fields[cluster] = []
    semantic_fields[cluster].append(word)

Performance Considerations

For large corpora, increase max_vocab_size to limit memory usage
Use sample parameter to downsample frequent words for better results
For very large vocabularies, consider filtering words before training
Set shrink_windows=True for more diverse contexts during training

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096.