Collocation Analysis in qhChina
This page documents the collocation analysis functionality in the qhChina package, which allows you to identify and analyze words that frequently occur together in text.
Finding Collocates
The find_collocates
function allows you to identify words that co-occur with specific target words more frequently than would be expected by chance.
Basic Usage
from qhchina.analytics.collocations import find_collocates
# Find collocates of "经济"
collocates = find_collocates(
sentences=sentences,
target_words=["经济"],
method="window",
horizon=3,
as_dataframe=True
)
Parameters
Parameter | Description |
---|---|
sentences |
List of tokenized sentences (each a list of tokens) |
target_words |
List of target words (or a single word) to find collocates for |
method |
Method to use: ‘window’ (default) or ‘sentence’ |
horizon |
Context window size (only used with method='window' ) |
filters |
Dictionary of filters to apply to results (see section on Filtering Results) |
as_dataframe |
Whether to return results as a pandas DataFrame (default: True) |
Results Interpretation
The function returns a list of dictionaries (or a DataFrame), with each entry containing:
target
: The target wordcollocate
: A word that co-occurs with the targetexp_local
: Expected frequency of co-occurrence (if independent)obs_local
: Observed frequency of co-occurrenceratio_local
: Ratio of observed to expected frequencyobs_global
: Total frequency of the collocate in the corpusp_value
: Statistical significance of the association
A small p-value indicates that the association between the target word and the collocate is statistically significant. The ratio_local
value indicates the strength of the association, with higher values indicating stronger associations.
Example Analysis
# Get top 10 most significant collocates
top_collocates = collocates.head(10)
print("Top collocates of target word:")
for _, row in top_collocates.iterrows():
print(f"{row['collocate']}: observed={row['obs_local']}, expected={row['exp_local']:.2f}, ratio={row['ratio_local']}")
Collocation Methods
qhChina provides two methods for finding collocates:
Window-based Collocation
The window-based method (method='window'
) looks for words that appear within a specified distance (horizon) of the target word. This method is better for identifying words that have a close syntactic relationship with the target word.
# Find words that appear within 3 words of "中国"
window_collocates = find_collocates(
sentences=sentences,
target_words=["中国"],
method="window",
horizon=3
)
Sentence-based Collocation
The sentence-based method (method='sentence'
) looks for words that appear in the same sentence as the target word. This method is better for identifying broader thematic associations.
# Find words that appear in the same sentences as "改革"
sentence_collocates = find_collocates(
sentences=sentences,
target_words=["改革"],
method="sentence"
)
Multiple Target Words
You can analyze collocates for multiple target words simultaneously:
# Find collocates for multiple target words
multi_collocates = find_collocates(
sentences=sentences,
target_words=["中国", "美国", "日本"],
method="window",
horizon=3
)
china_collocates = china_collocates[china_collocates["target"] == "中国"]
Filtering Results
The filters
parameter allows you to apply multiple filters to the results:
# Define filters
filters = {
'max_p': 0.05, # Maximum p-value threshold
'stopwords': ["的", "了", "在", "是", "和", "有", "被"], # Words to exclude
'min_length': 2 # Minimum character length
}
# Find collocates with filters
filtered_collocates = find_collocates(
sentences=sentences,
target_words=["经济"],
filters=filters
)
Available Filters
Filter | Description |
---|---|
max_p |
Maximum p-value threshold for statistical significance |
stopwords |
List of words to exclude from results |
min_length |
Minimum character length for collocates |
Filtering After Results
You can also apply filters after obtaining the results:
# Get only statistically significant collocates (p < 0.05)
significant_collocates = collocates[collocates["p_value"] < 0.05]
# Sort by strength of association
significant_collocates = significant_collocates.sort_values("ratio_local", ascending=False)
Visualizing Collocations
Bar Chart of Top Collocates
import matplotlib.pyplot as plt
# Get top 10 collocates by significance
top10 = collocates.sort_values("p_value").head(10)
plt.figure(figsize=(10, 6))
plt.barh(
top10["collocate"][::-1], # Reverse for bottom-to-top display
top10["ratio_local"][::-1]
)
plt.xlabel("Observed/Expected Ratio")
plt.title(f"Top Collocates of '{top10['target'].iloc[0]}'")
plt.tight_layout()
plt.show()
Network Visualization
import networkx as nx
import matplotlib.pyplot as plt
# Create a network of collocations
G = nx.Graph()
# Add the target word as a central node
target = "经济"
G.add_node(target, size=20)
# Add edges to significant collocates
significant = collocates[collocates["p_value"] < 0.01]
significant = significant.sort_values("ratio_local", ascending=False).head(15)
for _, row in significant.iterrows():
collocate = row["collocate"]
weight = row["ratio_local"]
G.add_node(collocate, size=10)
G.add_edge(target, collocate, weight=weight)
# Draw the network
plt.figure(figsize=(12, 12))
pos = nx.spring_layout(G)
# Draw nodes with different sizes
node_sizes = [G.nodes[node]["size"] * 50 for node in G.nodes()]
nx.draw_networkx_nodes(G, pos, node_size=node_sizes)
# Draw edges with weights affecting width
edge_weights = [G[u][v]["weight"] / 5 for u, v in G.edges()]
nx.draw_networkx_edges(G, pos, width=edge_weights, alpha=0.7)
# Add labels
nx.draw_networkx_labels(G, pos, font_size=12)
plt.axis("off")
plt.title(f"Collocation Network for '{target}'")
plt.tight_layout()
plt.show()
Practical Examples
Comparing Collocations Across Corpora
# Find collocates in two different corpora
collocates_corpus1 = find_collocates(
sentences=corpus1_sentences,
target_words=["经济"],
as_dataframe=True
)
collocates_corpus2 = find_collocates(
sentences=corpus2_sentences,
target_words=["经济"],
as_dataframe=True
)
# Merge the dataframes to compare
collocates_corpus1["corpus"] = "Corpus 1"
collocates_corpus2["corpus"] = "Corpus 2"
combined = pd.concat([collocates_corpus1, collocates_corpus2])
# Find collocates that appear in both corpora
collocates1 = set(collocates_corpus1["collocate"])
collocates2 = set(collocates_corpus2["collocate"])
common_collocates = collocates1.intersection(collocates2)
# Compare the strength of association in both corpora
comparison = combined[combined["collocate"].isin(common_collocates)]
pivot = comparison.pivot(index="collocate", columns="corpus", values="ratio_local")
Tracking Collocations Over Time
# Assume we have corpora from different time periods
periods = ["1980s", "1990s", "2000s", "2010s"]
period_data = {period: sentences_for_period for period, sentences_for_period in zip(periods, all_period_sentences)}
# Track collocations of a term over time
target_word = "改革"
collocations_over_time = {}
for period, sentences in period_data.items():
collocates = find_collocates(
sentences=sentences,
target_words=[target_word],
as_dataframe=True
)
collocations_over_time[period] = collocates.sort_values("p_value").head(10)["collocate"].tolist()
# See which collocates appear in multiple periods
all_collocates = set()
for period_collocates in collocations_over_time.values():
all_collocates.update(period_collocates)
presence_matrix = {collocate: [] for collocate in all_collocates}
for period in periods:
for collocate in all_collocates:
presence_matrix[collocate].append(collocate in collocations_over_time[period])
# Convert to DataFrame for easy viewing
collocate_tracking = pd.DataFrame(presence_matrix, index=periods).T
Performance Considerations
- For large corpora, consider processing sentences in batches
- When analyzing multiple target words, the computation time increases linearly with the number of targets
- The sentence-based method is generally faster than the window-based method
References
- Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22-29.
- Evert, S. (2008). Corpora and collocations. Corpus linguistics. An international handbook, 2, 1212-1248.