Title: | The Reinert Method for Textual Data Clustering |
---|---|
Description: | An R implementation of the Reinert text clustering method. For more details about the algorithm see the included vignettes or Reinert (1990) <doi:10.1177/075910639002600103>. |
Authors: | Julien Barnier [aut, cre], Florian Privé [ctb] |
Maintainer: | Julien Barnier <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.1.9000 |
Built: | 2024-11-01 04:55:57 UTC |
Source: | https://github.com/juba/rainette |
Split a dtm into two clusters with reinert algorithm
cluster_tab(dtm, cc_test = 0.3, tsj = 3)
cluster_tab(dtm, cc_test = 0.3, tsj = 3)
dtm |
to be split, passed by |
cc_test |
maximum contingency coefficient value for the feature to be kept in both groups. |
tsj |
minimum feature frequency in the dtm |
Internal function, not to be used directly
An object of class hclust
and rainette
Returns the number of segment of each cluster for each source document
clusters_by_doc_table(obj, clust_var = NULL, doc_id = NULL, prop = FALSE)
clusters_by_doc_table(obj, clust_var = NULL, doc_id = NULL, prop = FALSE)
obj |
a corpus, tokens or dtm object |
clust_var |
name of the docvar with the clusters |
doc_id |
docvar identifying the source document |
prop |
if TRUE, returns the percentage of each cluster by document |
This function is only useful for previously segmented corpus. If doc_id
is NULL and there is a sement_source
docvar, it will be used instead.
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 2) res <- rainette(dtm, k = 3, min_segment_size = 15) corpus$cluster <- cutree(res, k = 3) clusters_by_doc_table(corpus, clust_var = "cluster", prop = TRUE)
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 2) res <- rainette(dtm, k = 3, min_segment_size = 15) corpus$cluster <- cutree(res, k = 3) clusters_by_doc_table(corpus, clust_var = "cluster", prop = TRUE)
Cut a tree into groups
cutree(tree, ...)
cutree(tree, ...)
tree |
the hclust tree object to be cut |
... |
arguments passed to other methods |
If tree
is of class rainette
, invokes cutree_rainette()
. Otherwise, just run stats::cutree()
.
A vector with group membership.
Cut a rainette result tree into groups of documents
cutree_rainette(hres, k = NULL, h = NULL, ...)
cutree_rainette(hres, k = NULL, h = NULL, ...)
hres |
the |
k |
the desired number of clusters |
h |
unsupported |
... |
arguments passed to other methods |
A vector with group membership.
Cut a rainette2 result object into groups of documents
cutree_rainette2(res, k, criterion = c("chi2", "n"), ...)
cutree_rainette2(res, k, criterion = c("chi2", "n"), ...)
res |
the |
k |
the desired number of clusters |
criterion |
criterion to use to choose the best partition. |
... |
arguments passed to other methods |
A vector with group membership.
Returns, for each cluster, the number of source documents with at least n segments of this cluster
docs_by_cluster_table(obj, clust_var = NULL, doc_id = NULL, threshold = 1)
docs_by_cluster_table(obj, clust_var = NULL, doc_id = NULL, threshold = 1)
obj |
a corpus, tokens or dtm object |
clust_var |
name of the docvar with the clusters |
doc_id |
docvar identifying the source document |
threshold |
the minimal number of segments of a given cluster that a document must include to be counted |
This function is only useful for previously segmented corpus. If doc_id
is NULL
and there is a sement_source
docvar, it will be used instead.
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 2) res <- rainette(dtm, k = 3, min_segment_size = 15) corpus$cluster <- cutree(res, k = 3) docs_by_cluster_table(corpus, clust_var = "cluster")
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 2) res <- rainette(dtm, k = 3, min_segment_size = 15) corpus$cluster <- cutree(res, k = 3) docs_by_cluster_table(corpus, clust_var = "cluster")
Import a corpus in Iramuteq format
import_corpus_iramuteq(f, id_var = NULL, thematics = c("remove", "split"), ...)
import_corpus_iramuteq(f, id_var = NULL, thematics = c("remove", "split"), ...)
f |
a file name or a connection |
id_var |
name of metadata variable to be used as documents id |
thematics |
if "remove", thematics lines are removed. If "split", texts as splitted at each thematic, and metadata duplicated accordingly |
... |
arguments passed to |
A description of the Iramuteq corpus format can be found here : http://www.iramuteq.org/documentation/html/2-2-2-les-regles-de-formatages
A quanteda corpus object. Note that metadata variables in docvars are all imported as characters.
rainette_uc_index
docvar
merge_segments(dtm, min_segment_size = 10, doc_id = NULL)
merge_segments(dtm, min_segment_size = 10, doc_id = NULL)
dtm |
dtm of segments |
min_segment_size |
minimum number of forms by segment |
doc_id |
character name of a dtm docvar which identifies source documents. |
If min_segment_size == 0
, no segments are merged together.
If min_segment_size > 0
then doc_id
must be provided
unless the corpus comes from split_segments
, in this case
segment_source
is used by default.
the original dtm with a new rainette_uc_id
docvar.
return documents indices ordered by CA first axis coordinates
order_docs(m)
order_docs(m)
m |
dtm on which to compute the CA and order documents, converted to an integer matrix. |
Internal function, not to be used directly
ordered list of document indices
Corpus clustering based on the Reinert method - Simple clustering
rainette( dtm, k = 10, min_segment_size = 0, doc_id = NULL, min_split_members = 5, cc_test = 0.3, tsj = 3, min_members, min_uc_size )
rainette( dtm, k = 10, min_segment_size = 0, doc_id = NULL, min_split_members = 5, cc_test = 0.3, tsj = 3, min_members, min_uc_size )
dtm |
quanteda dfm object of documents to cluster, usually the
result of |
k |
maximum number of clusters to compute |
min_segment_size |
minimum number of forms by document |
doc_id |
character name of a dtm docvar which identifies source documents. |
min_split_members |
don't try to split groups with fewer members |
cc_test |
contingency coefficient value for feature selection |
tsj |
minimum frequency value for feature selection |
min_members |
deprecated, use |
min_uc_size |
deprecated, use |
See the references for original articles on the method. Computations and results may differ quite a bit, see the package vignettes for more details.
The dtm object is automatically converted to boolean.
If min_segment_size > 0
then doc_id
must be provided unless the corpus comes from split_segments
,
in this case segment_source
is used by default.
The result is a list of both class hclust
and rainette
. Besides the elements
of an hclust
object, two more results are available :
uce_groups
give the group of each document for each k
group
give the group of each document for the maximum value of k available
Reinert M, Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte, Cahiers de l'analyse des données, Volume 8, Numéro 2, 1983. http://www.numdam.org/item/?id=CAD_1983__8_2_187_0
Reinert M., Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de Méthodologie Sociologique, Volume 26, Numéro 1, 1990. doi:10.1177/075910639002600103
split_segments()
, rainette2()
, cutree_rainette()
, rainette_plot()
, rainette_explor()
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15)
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15)
Shiny gadget for rainette clustering exploration
rainette_explor(res, dtm = NULL, corpus_src = NULL)
rainette_explor(res, dtm = NULL, corpus_src = NULL)
res |
result object of a |
dtm |
the dfm object used to compute the clustering |
corpus_src |
the quanteda corpus object used to compute the dtm |
No return value, called for side effects.
rainette_plot
## Not run: require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15) rainette_explor(res, dtm, corpus) ## End(Not run)
## Not run: require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15) rainette_explor(res, dtm, corpus) ## End(Not run)
Generate a clustering description plot from a rainette result
rainette_plot( res, dtm, k = NULL, type = c("bar", "cloud"), n_terms = 15, free_scales = FALSE, measure = c("chi2", "lr", "frequency", "docprop"), show_negative = FALSE, text_size = NULL, show_na_title = TRUE, cluster_label = NULL, keyness_plot_xlab = NULL, colors = NULL )
rainette_plot( res, dtm, k = NULL, type = c("bar", "cloud"), n_terms = 15, free_scales = FALSE, measure = c("chi2", "lr", "frequency", "docprop"), show_negative = FALSE, text_size = NULL, show_na_title = TRUE, cluster_label = NULL, keyness_plot_xlab = NULL, colors = NULL )
res |
result object of a |
dtm |
the dfm object used to compute the clustering |
k |
number of groups. If NULL, use the biggest number possible |
type |
type of term plots : barplot or wordcloud |
n_terms |
number of terms to display in keyness plots |
free_scales |
if TRUE, all the keyness plots will have the same scale |
measure |
statistics to compute |
show_negative |
if TRUE, show negative keyness features |
text_size |
font size for barplots, max word size for wordclouds |
show_na_title |
if TRUE, show number of NA as plot title |
cluster_label |
define a specific term for clusters identification in keyness plots. Default is "Cluster" or "Cl." depending on the number of groups. If a vector of length > 1, define the cluster labels manually. |
keyness_plot_xlab |
define a specific x label for keyness plots. |
colors |
vector of custom colors for cluster titles and branches (in the order of the clusters) |
A gtable object.
quanteda.textstats::textstat_keyness()
, rainette_explor()
, rainette_stats()
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15) rainette_plot(res, dtm) rainette_plot( res, dtm, cluster_label = c("Assets", "Future", "Values"), colors = c("red", "slateblue", "forestgreen") )
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15) rainette_plot(res, dtm) rainette_plot( res, dtm, cluster_label = c("Assets", "Future", "Values"), colors = c("red", "slateblue", "forestgreen") )
Generate cluster keyness statistics from a rainette result
rainette_stats( groups, dtm, measure = c("chi2", "lr", "frequency", "docprop"), n_terms = 15, show_negative = TRUE, max_p = 0.05 )
rainette_stats( groups, dtm, measure = c("chi2", "lr", "frequency", "docprop"), n_terms = 15, show_negative = TRUE, max_p = 0.05 )
groups |
groups membership computed by |
dtm |
the dfm object used to compute the clustering |
measure |
statistics to compute |
n_terms |
number of terms to display in keyness plots |
show_negative |
if TRUE, show negative keyness features |
max_p |
maximum keyness statistic p-value |
A list with, for each group, a data.frame of keyness statistics for the most specific n_terms features.
quanteda.textstats::textstat_keyness()
, rainette_explor()
, rainette_plot()
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15) groups <- cutree_rainette(res, k = 3) rainette_stats(groups, dtm)
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res <- rainette(dtm, k = 3, min_segment_size = 15) groups <- cutree_rainette(res, k = 3) rainette_stats(groups, dtm)
Corpus clustering based on the Reinert method - Double clustering
rainette2( x, y = NULL, max_k = 5, min_segment_size1 = 10, min_segment_size2 = 15, doc_id = NULL, min_members = 10, min_chi2 = 3.84, parallel = FALSE, full = TRUE, uc_size1, uc_size2, ... )
rainette2( x, y = NULL, max_k = 5, min_segment_size1 = 10, min_segment_size2 = 15, doc_id = NULL, min_members = 10, min_chi2 = 3.84, parallel = FALSE, full = TRUE, uc_size1, uc_size2, ... )
x |
either a quanteda dfm object or the result of |
y |
if |
max_k |
maximum number of clusters to compute |
min_segment_size1 |
if |
min_segment_size2 |
if |
doc_id |
character name of a dtm docvar which identifies source documents. |
min_members |
minimum members of each cluster |
min_chi2 |
minimum chi2 for each cluster |
parallel |
if TRUE, use |
full |
if TRUE, all crossed groups are kept to compute optimal partitions, otherwise only the most mutually associated groups are kept. |
uc_size1 |
deprecated, use min_segment_size1 instead |
uc_size2 |
deprecated, use min_segment_size2 instead |
... |
if |
You can pass a quanteda dfm as x
object, the function then performs two simple
clustering with varying minimum uc size, and then proceed to find optimal partitions
based on the results of both clusterings.
If both clusterings have already been computed, you can pass them as x
and y
arguments
and the function will only look for optimal partitions.
doc_id
must be provided unless the corpus comes from split_segments
,
in this case segment_source
is used by default.
If full = FALSE
, computation may be much faster, but the chi2 criterion will be the only
one available for best partition detection, and the result may not be optimal.
For more details on optimal partitions search algorithm, please see package vignettes.
A tibble with optimal partitions found for each available value of k
as rows, and the following
columns :
clusters
list of the crossed original clusters used in the partition
k
the number of clusters
chi2
sum of the chi2 value of each cluster
n
sum of the size of each cluster
groups
group membership of each document for this partition (NA
if not assigned)
Reinert M, Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte, Cahiers de l'analyse des données, Volume 8, Numéro 2, 1983. http://www.numdam.org/item/?id=CAD_1983__8_2_187_0
Reinert M., Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de Méthodologie Sociologique, Volume 26, Numéro 1, 1990. doi:10.1177/075910639002600103
rainette()
, cutree_rainette2()
, rainette2_plot()
, rainette2_explor()
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res1 <- rainette(dtm, k = 5, min_segment_size = 10) res2 <- rainette(dtm, k = 5, min_segment_size = 15) res <- rainette2(res1, res2, max_k = 4)
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 3) res1 <- rainette(dtm, k = 5, min_segment_size = 10) res2 <- rainette(dtm, k = 5, min_segment_size = 15) res <- rainette2(res1, res2, max_k = 4)
Starting with groups membership computed from a rainette2
clustering,
every document not assigned to a cluster is reassigned using a k-nearest
neighbour classification.
rainette2_complete_groups(dfm, groups, k = 1, ...)
rainette2_complete_groups(dfm, groups, k = 1, ...)
dfm |
dfm object used for |
groups |
group membership computed by |
k |
number of neighbours considered. |
... |
other arguments passed to |
Completed group membership vector.
cutree_rainette2()
, FNN::knn()
Shiny gadget for rainette2 clustering exploration
rainette2_explor(res, dtm = NULL, corpus_src = NULL)
rainette2_explor(res, dtm = NULL, corpus_src = NULL)
res |
result object of a |
dtm |
the dfm object used to compute the clustering |
corpus_src |
the quanteda corpus object used to compute the dtm |
No return value, called for side effects.
Generate a clustering description plot from a rainette2 result
rainette2_plot( res, dtm, k = NULL, criterion = c("chi2", "n"), complete_groups = FALSE, type = c("bar", "cloud"), n_terms = 15, free_scales = FALSE, measure = c("chi2", "lr", "frequency", "docprop"), show_negative = FALSE, text_size = 10 )
rainette2_plot( res, dtm, k = NULL, criterion = c("chi2", "n"), complete_groups = FALSE, type = c("bar", "cloud"), n_terms = 15, free_scales = FALSE, measure = c("chi2", "lr", "frequency", "docprop"), show_negative = FALSE, text_size = 10 )
res |
result object of a |
dtm |
the dfm object used to compute the clustering |
k |
number of groups. If NULL, use the biggest number possible |
criterion |
criterion to use to choose the best partition. |
complete_groups |
if TRUE, documents with NA cluster are reaffected by k-means clustering initialised with current groups centers. |
type |
type of term plots : barplot or wordcloud |
n_terms |
number of terms to display in keyness plots |
free_scales |
if TRUE, all the keyness plots will have the same scale |
measure |
statistics to compute |
show_negative |
if TRUE, show negative keyness features |
text_size |
font size for barplots, max word size for wordclouds |
A gtable object.
quanteda.textstats::textstat_keyness()
, rainette2_explor()
, rainette2_complete_groups()
Remove features from dtm of each group base don cc_test and tsj
select_features(m, indices1, indices2, cc_test = 0.3, tsj = 3)
select_features(m, indices1, indices2, cc_test = 0.3, tsj = 3)
m |
global dtm |
indices1 |
indices of documents of group 1 |
indices2 |
indices of documents of group 2 |
cc_test |
maximum contingency coefficient value for the feature to be kept in both groups. |
tsj |
minimum feature frequency in the dtm |
Internal function, not to be used directly
a list of two character vectors : cols1
is the name of features to
keep in group 1, cols2
the name of features to keep in group 2
Split a character string or corpus into segments, taking into account punctuation where possible
split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'character' split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'Corpus' split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'corpus' split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'tokens' split_segments(obj, segment_size = 40, segment_size_window = NULL)
split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'character' split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'Corpus' split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'corpus' split_segments(obj, segment_size = 40, segment_size_window = NULL) ## S3 method for class 'tokens' split_segments(obj, segment_size = 40, segment_size_window = NULL)
obj |
character string, quanteda or tm corpus object |
segment_size |
segment size (in words) |
segment_size_window |
window around segment size to look for best splitting point |
If obj is a tm or quanteda corpus object, the result is a quanteda corpus.
require(quanteda) split_segments(data_corpus_inaugural)
require(quanteda) split_segments(data_corpus_inaugural)
Switch documents between two groups to maximize chi-square value
switch_docs(m, indices, max_index, max_chisq)
switch_docs(m, indices, max_index, max_chisq)
m |
original dtm |
indices |
documents indices orderes by first CA axis coordinates |
max_index |
document index where the split is maximum |
max_chisq |
maximum chi-square value |
Internal function, not to be used directly
a list of two vectors indices1
and indices2
, which contain
the documents indices of each group after documents switching, and a chisq
value,
the new corresponding chi-square value after switching