TextGraphs
Introduction
TextGraphs.jl
offers Graphs representations of Text, along with natural language proccessing (NLP) functionalities.
It is inspired by SpeechGraphs(https://repositorio.ufrn.br/jspui/handle/123456789/23273), which transform text into graphs. TextGraphs.jl
novel features include graph properties (e.g. centrality) and latent space embeddings (adding latent semantic information to graphs).
Julia uses multiple dispatching, focusing on modular functions and high-performance computing. There's a previous object-oriented Python implementation by github/facuzeta.
Quick introduction
Check the documentation for further information.
Install
Install with Pkg.
pkg>add TextGraphs
You should also have R and package udpipe available.
$sudo apt install r-base
$sudo Rscript -e 'install.packages("udpipe")'
Features
Graph types
You can build the following graphs from text (AbstractString
):
Raw
- Naive (
naive_graph
) uses the original sequence of words. - Phrases Graph(
phrases_graph
): Uses the original sequence of phrases.
POS, Stems and Lemmas
- Stem (
stem_graph
) uses stemmed words. - Lemma (
lemma_graph
): Uses lemmatized words. - Part of Speech Graph (POS,
pos_graph
) uses syntactical functions.
Latent space embeddings
- Latent space embedding (LSE,
latent_space_graph
) graphs.
Properties
You can obtain several properties of the graphs:
Direct measures graph_props
returns values of density, # of self loops, # of SCCs, size of largest SCC, and mean centrality (betweeness, closeness and eigenvector methods).
Erdős–Rényi ratios rand_erdos_ratio_props
returns values of density and mean centrality ratios between the graph and a random Erdõs-Rényi graph with identical number of vertices and edges.
Usage
julia>using TextGraphs
julia>naive_graph("Sample for graph")
{3, 2} directed Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)
julia>stem_graph("Sample for graph";snowball_language="english") # Optional keyword argument
{3, 2} directed Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)
julia> graph_props(naive_graph("Sample for graph"))
Dict{String, Real} with 7 entries:
"mean_close_centr" => 0.388889
"size_largest_scc" => 1
"num_strong_connect_comp" => 3
"density" => 0.333333
"num_self_loops" => 0
"mean_between_centr" => 0.166667
"mean_eig_centr" => 0.333335
Plot
using GraphMakie , GLMakie
g = naive_graph("Colorless green ideas sleep furiously")
stem_g = stem_graph("No meio do caminho tinha uma pedra tinha uma pedra no meio do caminho")
g_labels = map(x -> get_prop(naive_g,x,:token), collect(1:nv(naive_g)))
stem_g_labels = map(x -> get_prop(stem_g,x,:token), collect(1:nv(stem_g)))
graphplot(naive_g,nlabels=g_labels)
graphplot(stem_g,nlabels=stem_g_labels)
spec3_layout = Spectral(dim=3)
graphplot(naive_g,node_size=30,nlabels=g_labels,layout=spec3_layout)
<!–- Commenting out to debug doc CI on github
TextGraphs.add_prop_label_tokens
— Methodadd_prop_label_tokens(metagraph,metagraph_unique_tokens)
Add tokens as properties of nodes in a MetaGraph
.
This function is used internally to attach word labels to each node. Unique tokens must have length equal to the number of vertices
TextGraphs.build_labelled_graph
— Methodbuild_labelled_graph(x::AbstractArray)
This function is used internally to build graph lebelled with unique tokens.
TextGraphs.bypass_eigenvector_centrality
— Methodbypass_eigenvector_centrality(g::Union{MetaDiGraph,SimpleGraph})
Calculate eigenvector centrality for each node in g.
This function returns an Array with either the eigenvector centrality values or missing. It is needed because LinAlg.jl.eigs seems to bear erratic behavior, sometimes returning vector bound error.
TextGraphs.erdos_graph_short
— Methoderdos_graph_short(g::MetaDiGraph)
Generate random Erdős–Rényi graph from MetaDiGraph.
Short version of erdos_renyi function that takes a MetaDiGraph instead of numebr of vertices and nodes.
TextGraphs.get_graph_labels
— Methodget_graph_labels(g::MetaDiGraph)
Return graph labels.
TextGraphs.graph_props
— Methodgraph_props(g::MetaDiGraph)
Calculate several properties for a MetaDiGraph.
This function returns a Dict with numeric values for density, # of self loops, # of SCCs, size of largest SCC, and mean centrality (betweeness, closeness and eigenvector methods)
TextGraphs.lemma_graph
— Methodlemma_graph(my_text::AbstractString;text_language="english")
Build lemmatized graph from text (AbstractString
) using R package udpipe.
Currently, supports portuguese and english corpora. Defaults language to "english". Any other value will set it to "portuguese".
TextGraphs.link_consecutive
— Methodlink_consecutive(array_with_tokens)
Transform serialized tokens into a directed graph.
This function is used internally to build graphs from text. Each token has an unique node in the graph.
TextGraphs.mean_graph_centrs
— Methodmean_graph_centrs(g::Union{MetaDiGraph,SimpleGraph})
Calculate mean values for centrality (betweeness, closeness and eigenvector methods).
This function returns a Dict with numeric values for each centrality method.
TextGraphs.naive_graph
— Methodnaive_graph(raw_text::AbstractString)
Build graph from text (AbstractString
) with unprocessed words.
TextGraphs.node_props
— Methodnode_props(g::Union{MetaDiGraph,SimpleGraph})
Calculate betweeness, closeness and eigenvector centralities for each node.
This function returns a Dict with vectors of values for each centrality method.
TextGraphs.phrases_graph
— Methodphrases_graph(raw_text::AbstractString)
Build graph from text (AbstractString
) using sentences as unique tokens.
TextGraphs.pos_graph
— Methodpos_graph(my_text)
Build POS Tagging from text (AbstractString
) using R package udpipe.
Currently, supports portuguese and english corpora.
TextGraphs.rand_erdos_props
— Methodrand_erdos_props(g::MetaDiGraph;eval_method="z_score",n_samples=1000)
Calculate ratios between a given MetaDiGraph and corresponding random Erdős–Rényi graphs.
This function returns a Dict with numeric values for density, connected components and 9 mean centralities (betweeness, closeness and eigenvector methods). Currently returning error for some samples. eval_method
must can be either 'z_score' or 'ratio'.
TextGraphs.stem_graph
— Methodstem_graph(my_text)
Build graph from text (AbstractString
) using lemmatized words.
Stemming is performed with Snowball.jl
stemmer. Default language is "english".
TextGraphs.udp_import_annotations
— Methodudp_import_annotations(raw_text)
Get anonnotated DataFrame by importing R::udpipe object created with udpipe::annonate.
This function is used internally.
TextGraphs.window_props
— Functionwindow_props(raw_text,nwindow=5,txt_stepsize=1,graph_function=naive_graph;prop_type="raw",rnd_eval_method="ratio")
Calculate average properties from windowed subsets of text.
User must provide source text, window length, step size, graph building function (e.g. naivegraph). Set `proptypeto 'random' to obtain properties from Erdos-Renye graphs. If
proptypeis random, set
rndevalmethod` to either 'ratio' (default) or 'zscore'.
TextGraphs.window_props_lemma
— Functionwindow_props_lemma(raw_text,nwindow=5,txt_stepsize=1,text_language="english")
Calculate average properties from windowed subsets of lemmatized text.
User must provide source text, window length, step size and text language. This function is faster than using window_props
with graph_function=lemma_graph
.
–>