TextGraphs

<!–- Stable –> Dev Build Status Coverage

Introduction

TextGraphs.jl offers Graphs representations of Text, along with natural language proccessing (NLP) functionalities.

It is inspired by SpeechGraphs(https://repositorio.ufrn.br/jspui/handle/123456789/23273), which transform text into graphs. TextGraphs.jl novel features include graph properties (e.g. centrality) and latent space embeddings (adding latent semantic information to graphs).

Julia uses multiple dispatching, focusing on modular functions and high-performance computing. There's a previous object-oriented Python implementation by github/facuzeta.


Quick introduction

Check the documentation for further information.

Install

Install with Pkg.

pkg>add TextGraphs

You should also have R and package udpipe available.

$sudo apt install r-base
$sudo Rscript -e 'install.packages("udpipe")'

Features

Graph types

You can build the following graphs from text (AbstractString):

Raw

  • Naive (naive_graph) uses the original sequence of words.
  • Phrases Graph(phrases_graph): Uses the original sequence of phrases.

POS, Stems and Lemmas

  • Stem (stem_graph) uses stemmed words.
  • Lemma (lemma_graph): Uses lemmatized words.
  • Part of Speech Graph (POS, pos_graph) uses syntactical functions.

Latent space embeddings

  • Latent space embedding (LSE, latent_space_graph) graphs.

Properties

You can obtain several properties of the graphs:

Direct measures graph_props returns values of density, # of self loops, # of SCCs, size of largest SCC, and mean centrality (betweeness, closeness and eigenvector methods).

Erdős–Rényi ratios rand_erdos_ratio_props returns values of density and mean centrality ratios between the graph and a random Erdõs-Rényi graph with identical number of vertices and edges.

Usage

julia>using TextGraphs  
julia>naive_graph("Sample for graph")  
{3, 2} directed Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)  
julia>stem_graph("Sample for graph";snowball_language="english") # Optional keyword argument  
{3, 2} directed Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)  
julia> graph_props(naive_graph("Sample for graph"))
Dict{String, Real} with 7 entries:
  "mean_close_centr"        => 0.388889
  "size_largest_scc"        => 1
  "num_strong_connect_comp" => 3
  "density"                 => 0.333333
  "num_self_loops"          => 0
  "mean_between_centr"      => 0.166667
  "mean_eig_centr"          => 0.333335

Plot

using GraphMakie , GLMakie

g = naive_graph("Colorless green ideas sleep furiously")
stem_g = stem_graph("No meio do caminho tinha uma pedra tinha uma pedra no meio do caminho")

g_labels = map(x -> get_prop(naive_g,x,:token), collect(1:nv(naive_g)))
stem_g_labels = map(x -> get_prop(stem_g,x,:token), collect(1:nv(stem_g)))
graphplot(naive_g,nlabels=g_labels)
graphplot(stem_g,nlabels=stem_g_labels)

spec3_layout = Spectral(dim=3)
graphplot(naive_g,node_size=30,nlabels=g_labels,layout=spec3_layout)

<!–- Commenting out to debug doc CI on github

TextGraphs.add_prop_label_tokensMethod
add_prop_label_tokens(metagraph,metagraph_unique_tokens)

Add tokens as properties of nodes in a MetaGraph.

This function is used internally to attach word labels to each node. Unique tokens must have length equal to the number of vertices

source
TextGraphs.bypass_eigenvector_centralityMethod
bypass_eigenvector_centrality(g::Union{MetaDiGraph,SimpleGraph})

Calculate eigenvector centrality for each node in g.

This function returns an Array with either the eigenvector centrality values or missing. It is needed because LinAlg.jl.eigs seems to bear erratic behavior, sometimes returning vector bound error.

source
TextGraphs.erdos_graph_shortMethod
erdos_graph_short(g::MetaDiGraph)

Generate random Erdős–Rényi graph from MetaDiGraph.

Short version of erdos_renyi function that takes a MetaDiGraph instead of numebr of vertices and nodes.

source
TextGraphs.graph_propsMethod
graph_props(g::MetaDiGraph)

Calculate several properties for a MetaDiGraph.

This function returns a Dict with numeric values for density, # of self loops, # of SCCs, size of largest SCC, and mean centrality (betweeness, closeness and eigenvector methods)

source
TextGraphs.lemma_graphMethod
lemma_graph(my_text::AbstractString;text_language="english")

Build lemmatized graph from text (AbstractString) using R package udpipe.

Currently, supports portuguese and english corpora. Defaults language to "english". Any other value will set it to "portuguese".

source
TextGraphs.link_consecutiveMethod
link_consecutive(array_with_tokens)

Transform serialized tokens into a directed graph.

This function is used internally to build graphs from text. Each token has an unique node in the graph.

source
TextGraphs.mean_graph_centrsMethod
mean_graph_centrs(g::Union{MetaDiGraph,SimpleGraph})

Calculate mean values for centrality (betweeness, closeness and eigenvector methods).

This function returns a Dict with numeric values for each centrality method.

source
TextGraphs.node_propsMethod
node_props(g::Union{MetaDiGraph,SimpleGraph})

Calculate betweeness, closeness and eigenvector centralities for each node.

This function returns a Dict with vectors of values for each centrality method.

source
TextGraphs.phrases_graphMethod
phrases_graph(raw_text::AbstractString)

Build graph from text (AbstractString) using sentences as unique tokens.

source
TextGraphs.pos_graphMethod
pos_graph(my_text)

Build POS Tagging from text (AbstractString) using R package udpipe.

Currently, supports portuguese and english corpora.

source
TextGraphs.rand_erdos_propsMethod
rand_erdos_props(g::MetaDiGraph;eval_method="z_score",n_samples=1000)

Calculate ratios between a given MetaDiGraph and corresponding random Erdős–Rényi graphs.

This function returns a Dict with numeric values for density, connected components and 9 mean centralities (betweeness, closeness and eigenvector methods). Currently returning error for some samples. eval_method must can be either 'z_score' or 'ratio'.

source
TextGraphs.stem_graphMethod
stem_graph(my_text)

Build graph from text (AbstractString) using lemmatized words.

Stemming is performed with Snowball.jl stemmer. Default language is "english".

source
TextGraphs.udp_import_annotationsMethod
udp_import_annotations(raw_text)

Get anonnotated DataFrame by importing R::udpipe object created with udpipe::annonate.

This function is used internally.

source
TextGraphs.window_propsFunction
window_props(raw_text,nwindow=5,txt_stepsize=1,graph_function=naive_graph;prop_type="raw",rnd_eval_method="ratio")

Calculate average properties from windowed subsets of text.

User must provide source text, window length, step size, graph building function (e.g. naivegraph). Set `proptypeto 'random' to obtain properties from Erdos-Renye graphs. Ifproptypeis random, setrndevalmethod` to either 'ratio' (default) or 'zscore'.

source
TextGraphs.window_props_lemmaFunction
window_props_lemma(raw_text,nwindow=5,txt_stepsize=1,text_language="english")

Calculate average properties from windowed subsets of lemmatized text.

User must provide source text, window length, step size and text language. This function is faster than using window_props with graph_function=lemma_graph.

source

–>