Corpus Analysis in qhChina

qhChina provides a suite of tools for analyzing corpus data, with a focus on comparing corpora and identifying linguistic patterns.

Comparing Corpora

The compare_corpora function allows you to identify statistically significant differences in word usage between two corpora. This is particularly useful for studying language variation across different text collections, such as texts from different time periods, regions, or sources.

Basic Usage

from qhchina.analytics import compare_corpora

# Example data
corpus_a = ["中国", "经济", "发展", "改革", "经济", "政策", "中国", "市场"]
corpus_b = ["美国", "经济", "市场", "金融", "美国", "贸易", "进口", "出口"]

# Compare the corpora
results = compare_corpora(
    corpusA=corpus_a, 
    corpusB=corpus_b, 
    method="fisher",
    filters={"min_count": 10},
    as_dataframe=True
)

# Sort by statistical significance
results = results.sort_values("p_value")

Parameters

Parameter	Description
`corpusA`	List of tokens from the first corpus
`corpusB`	List of tokens from the second corpus
`method`	Statistical test to use: ‘fisher’ (default), ‘chi2’, or ‘chi2_corrected’
`filters`	Dictionary of filters to apply to results (see Filtering Options)
`as_dataframe`	Whether to return results as a pandas DataFrame (default: True)

Filtering Options

The filters parameter accepts a dictionary with the following options:

Filter	Description
`min_count`	Minimum count for a word to be included (int or tuple of two ints)
`max_p`	Maximum p-value threshold for statistical significance
`stopwords`	List of words to exclude from results
`min_length`	Minimum character length for words

Results Interpretation

The function returns a DataFrame (or list of dictionaries), with each entry containing:

word: The word being compared
abs_freqA: Absolute frequency in corpus A
abs_freqB: Absolute frequency in corpus B
rel_freqA: Relative frequency in corpus A
rel_freqB: Relative frequency in corpus B
rel_ratio: Ratio of relative frequencies (A:B)
p_value: Statistical significance of the difference

A small p-value indicates that the difference in word frequency between the two corpora is statistically significant.

Example Analysis

# Apply multiple filters during corpus comparison
filtered_results = compare_corpora(
    corpusA=corpus_a,
    corpusB=corpus_b,
    method="fisher",
    filters={
        "min_count": 3,       # Minimum count in both corpora
        "max_p": 0.05,        # Only statistically significant differences
        "stopwords": ["的", "了", "和"],  # Exclude common words
        "min_length": 2       # Only include words with at least 2 characters
    }
)

# Identify words that are significantly more common in corpus A
words_overrepresented_in_A = filtered_results[
    (filtered_results["p_value"] < 0.05) & 
    (filtered_results["rel_ratio"] > 1)
]

# Identify words that are significantly more common in corpus B
words_overrepresented_in_B = filtered_results[
    (filtered_results["p_value"] < 0.05) & 
    (filtered_results["rel_ratio"] < 1)
]

# Visualize the most significant differences
import matplotlib.pyplot as plt
import numpy as np

top_words = filtered_results.sort_values("p_value").head(10)
plt.figure(figsize=(10, 6))
plt.barh(
    top_words["word"],
    np.log2(top_words["rel_ratio"]),
    color=[("blue" if ratio > 1 else "red") for ratio in top_words["rel_ratio"]]
)
plt.axvline(x=0, color="black", linestyle="-")
plt.xlabel("Log2 Ratio (Corpus A / Corpus B)")
plt.title("Most Significant Word Frequency Differences")
plt.tight_layout()
plt.show()

Co-occurrence Matrix

The cooc_matrix function allows you to create a co-occurrence matrix from a collection of documents, capturing how frequently words occur together within a context.

Basic Usage

from qhchina.analytics.collocations import cooc_matrix

# Example data - list of tokenized documents
documents = [
    ["中国", "经济", "发展", "改革"],
    ["美国", "经济", "市场", "金融"],
    ["中国", "市场", "贸易", "改革"],
    # More documents...
]

# Create co-occurrence matrix using window method
cooc = cooc_matrix(
    documents=documents,
    method="window",
    horizon=2,           # Context window size
    min_abs_count=2,     # Minimum word frequency
    as_dataframe=True    # Return as pandas DataFrame
)

Parameters

Parameter	Description
`documents`	List of tokenized documents, where each document is a list of tokens
`method`	Co-occurrence method: ‘window’ (default) or ‘document’
`horizon`	Size of the context window (only used if `method='window'`)
`min_abs_count`	Minimum absolute count for a word to be included
`min_doc_count`	Minimum number of documents a word must appear in
`vocab_size`	Maximum vocabulary size (optional)
`binary`	Count co-occurrences as binary (0/1) rather than frequencies
`as_dataframe`	Return matrix as a pandas DataFrame
`vocab`	Predefined vocabulary to use (optional)
`use_sparse`	Use a sparse matrix for better memory efficiency with large vocabularies

Word Co-occurrence Analysis

# Find words that frequently co-occur with a target word
target_word = "经济"
if target_word in cooc.index:
    cooc_with_target = cooc[target_word].sort_values(ascending=False)
    print(f"Words co-occurring with '{target_word}':")
    print(cooc_with_target.head(10))

# Visualize co-occurrence network
import networkx as nx
import matplotlib.pyplot as plt

# Create a graph from the co-occurrence matrix
G = nx.Graph()

# Add nodes for each word
for word in cooc.index:
    G.add_node(word)

# Add edges for co-occurrences above a threshold
threshold = 3  # Minimum co-occurrence count
for word1 in cooc.index:
    for word2 in cooc.columns:
        if word1 != word2 and cooc.loc[word1, word2] >= threshold:
            G.add_edge(word1, word2, weight=cooc.loc[word1, word2])

# Draw the network
plt.figure(figsize=(12, 12))
pos = nx.spring_layout(G, seed=42)
nx.draw_networkx_nodes(G, pos, node_size=100)
nx.draw_networkx_edges(G, pos, width=0.5, alpha=0.5)
nx.draw_networkx_labels(G, pos, font_size=10)
plt.axis("off")
plt.title("Word Co-occurrence Network")
plt.tight_layout()
plt.show()

Temporal Analysis with TempRefWord2Vec

For analyzing corpus data over time, qhChina provides the TempRefWord2Vec model and supporting functions. This allows you to track semantic change in specific words across different time periods.

from qhchina.analytics import TempRefWord2Vec

# Prepare corpus data from different time periods
time_labels = ["1980s", "1990s", "2000s", "2010s"]
corpora = [corpus_1980s, corpus_1990s, corpus_2000s, corpus_2010s]

# Target words to track for semantic change
target_words = ["改革", "经济", "科技", "人民"]

# Initialize and train the model
model = TempRefWord2Vec(
    corpora=corpora,          # List of corpora for different time periods
    labels=time_labels,       # Labels for each time period
    targets=target_words,     # Words to track for semantic change
    balance=True,             # Balance corpus sizes
    vector_size=100,
    window=5,
    min_count=5,
    sg=1                      # Use Skip-gram model
)

# Train the model
model.train(epochs=5)

# Access temporal variants of words
reform_1980s = model.get_vector("改革_1980s")
reform_2010s = model.get_vector("改革_2010s")

# Find similar words for a target in different time periods
similar_to_reform_1980s = model.most_similar("改革_1980s", topn=10)
similar_to_reform_2010s = model.most_similar("改革_2010s", topn=10)

Combining with Word Embeddings

You can use the corpus analysis tools in combination with word embeddings for more sophisticated analyses:

from qhchina.analytics.word2vec import Word2Vec
from qhchina.analytics.corpora import compare_corpora

# Train Word2Vec model on your corpus
model = Word2Vec(vector_size=100, window=5, min_count=5)
model.build_vocab(all_sentences)
model.train(all_sentences, epochs=5)

# Analyze differences between two corpora
comparison = compare_corpora(
    corpusA=corpus_a, 
    corpusB=corpus_b,
    filters={"min_count": 5, "max_p": 0.05}
)
significant_words = comparison[comparison["p_value"] < 0.05]["word"].tolist()

# Examine the semantic relationships between significant words
from qhchina.analytics.vectors import project_2d

# Get vectors for significant words that appear in the model
significant_vectors = {}
for word in significant_words:
    if word in model.vocab:
        significant_vectors[word] = model.get_vector(word)

# Visualize the semantic space
if significant_vectors:
    project_2d(
        vectors=significant_vectors,
        method="tsne",
        perplexity=5,
        title="Semantic Space of Statistically Significant Words"
    )

Practical Examples

Comparative Analysis of Historical Texts

# Comparing language use in different historical periods
modern_corpus = ["现代", "科技", "发展", "技术", "电脑", "系统", ...]
classical_corpus = ["古代", "诗词", "文学", "礼仪", "制度", ...]

# Find distinctive vocabulary in each period
period_comparison = compare_corpora(
    modern_corpus, 
    classical_corpus,
    min_count=5
)

Topic-Focused Analysis

# Extract all sentences containing a specific term
from qhchina.helpers import texts

economy_sentences = texts.extract_sentences_with_term(
    all_sentences, 
    "经济"
)

# Compare sentences with the term to the broader corpus
economy_comparison = compare_corpora(
    [token for sentence in economy_sentences for token in sentence],
    [token for sentence in all_sentences for token in sentence],
    min_count=5
)

Performance Considerations

For very large corpora, consider using min_count to filter out rare words
When creating co-occurrence matrices for large vocabularies, use use_sparse=True
The compare_corpora function stores results in memory, so for extremely large corpora, consider processing in batches

References

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational linguistics, 19(1), 61-74.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22-29.
Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096.