Is a Knowledge Graph capable of capturing human knowledge?
In recent years Knowledge Graphs have been used to solve one of the biggest problems not only in machine learning but also in computer science in general: how to represent knowledge.
“Knowledge representation and reasoning is the area of Artificial Intelligence (AI) concerned with how knowledge can be represented symbolically and manipulated in an automated way by reasoning programs. More informally, it is the part of AI that is concerned with thinking, and how thinking contributes to intelligent behavior.” [Brachman and Levesque, 2004]
This aspect is critical since any “agent” — human, animal, electronic, mechanical, to behave intelligently, requires knowledge. Think about us as humans, for a very wide range of activities, we make decisions based on what we effortlessly and unconsciously know (or believe) about the world. Our [intelligent] behaviour is clearly conditioned, if not dominated, by knowledge.
Knowledge representation and reasoning focuses on the knowledge, not the knower. In this context, a graph based representation is becoming one of the most prominent approaches, thanks to its flexibility of representing concepts and relationships amongst them in a simple and generic data structure.
Despite this broad adoption, limited work has been done in proving formally that such representation is really able to capture knowledge properly. If demonstrated, it would be the foundation for extracting relevant insights and for making autonomous agents (machine learning agents in this case) capable of “taking decisions” or helping humans on that.
This blog post aims to prove this thesis in a concrete and formal way by answering the following (simple) question:
Is a knowledge graph capable of capturing human knowledge?
Starting from a formal definition of what is a knowledge graph and considering a concrete example in a relevant domain, we will move to the formal proof leveraging a graph embedding approach to encode graph structures in a multidimensional space where mathematical operations are possible.
What is a Knowledge Graph?
For this question there is no gold standard, universally accepted definition, but my favorite is the one given by Gomez-Perez et al. [Gomez-Perez et al., 2020]:
“A knowledge graph consists of a set of interconnected typed entities and their attributes.”
According to this definition, the basic unit of a Knowledge Graph is the representation of an entity, such as a person, organization, or location, or perhaps a sporting event or a book or movie. Each entity might have various attributes. For a person, those attributes would include the name, address, birth date, and so on. Entities are connected to each other by relations: for example, a person works for a company, and a user likes a page or follows another user. Relations can also be used to bridge two separate Knowledge Graphs [Negro, 2021].
Even though the previous definition is one of the simplest available I prefer to make it even simpler:
“A knowledge graph is a graph containing knowledge.”
The reason why I like this is because it is in line with the other definitions of similar graphs. What is a social network, for example? It is a graph where nodes represent people and relationships represent social connections among them. From this a Knowledge Graph is a graph where each node contains a relevant piece of information (a concept that can be a person, a gene, a disease, whatever is relevant to be persistent as an entity in the specific domain) and the relationships contain any “connection”, logical or real, among two related concepts.
Each of these “specialized” graphs contains latent features that reflect inner statistical properties, such as the tendency of some entities to be related to others by similar characteristics. For example, in a social network, homophily is the driving force behind these latent features. It boils down to the expression “Birds of a feather flock together.” Friendships are mostly based on similar interests, origins or experiences. Relationships determine which people are influenced by whom and the extent to which information is exchanged among them. In a Knowledge Graph these relationships are driven by domain-specific connections. For example, the relationship between a gene and the related protein it regulates.
It is worth noting here that both definitions are completely agnostic about the database used for storing it. In this blog post there is no discussion about RDF format versus Label Property Graph. I personally prefer the second one and, hence, it is used along the examples and the proof. In particular, I use Neo4j to store and process my Knowledge Graph. Nevertheless, the results are replicable on any other database.
The use case
As clear from the discussion above, in any Knowledge Graph the domain is king. At the time of writing, the coronavirus disease is spreading across the entire globe. At GraphAware, we have created one of the biggest Knowledge Graphs containing all the information related to this and related diseases. Moreover, we are partnered with the University of South Florida in developing a digital tool to give government agencies, researchers and health professionals unparalleled insight into COVID-19. The project, funded by the National Science Foundation (NSF) will collect, synthesize and visualize comprehensive coronavirus data from around the world.
One of the datasets included in our COVID-19 Knowledge Graph is the Drug Repurposing Knowledge Graph (DRKG). This big graph is “a comprehensive biological knowledge graph relating genes, compounds, diseases, biological processes, side effects and symptoms. DRKG includes information from six existing databases including DrugBank, Hetionet, GNBR, String, IntAct and DGIdb, and data collected from recent publications particularly related to Covid19. It includes 97,238 entities belonging to 13 entity-types; and 5,874,261 triplets belonging to 107 edge-types.” [Ioannidis et al., 2020]
This portion of our Knowledge Graph is definitely big enough and information-rich to be used for the purposes of our demonstration, so the rest will be ignored. I imported it using Hume Orchestra, the GraphAware ecosystem for creating and dealing with Knowledge Graphs. Nevertheless, in the code available for this blog post you will find the python code for importing it in Neo4j.
What insights are we looking for?
The driving force behind this blog post is extracting insights out of a Knowledge Graph and, of course, we would like this process to be supported by machine learning. Ideally, we seek to build models that can learn from data in order to solve particular problems. Among the different problems we could consider, we chose classification.
Imagine that you would like to classify a new disease using the information you have about other (potentially similar) diseases. Analyzing all the diseases manually, to determine a way for classifying the new one properly would be prohibitively expensive. So, ideally, we would like to have a model that could classify diseases given only a small number of manually labelled examples.
When this process happens in a graph, the source of analysis (predictors) is the structure and content of the graph itself — nodes, properties and relationships — and the targets of the classification are node labels, it is defined as node classification. In this case the goal is to predict the labels C — which could be a type, category, or attribute — associated with all the nodes in the graph (or part of it) when we are given the true labels on a limited set of nodes — the training set.
Node classification is perhaps the most popular machine learning task on graph data. At first glance, it appears to be a simple variation of classic supervised classification, but there are important differences. In classical classification we assume that each datapoint is statistically independent from any other, and the data points are identically distributed. The combination of these two assumptions on data points is called independent and identically distributed (i.i.d.) [Hamilton, 2020]. The most successful node classification techniques completely break the i.i.d. assumption, by focusing on an interconnected set of nodes and explicitly leveraging this connection among nodes.
The DRKG dataset doesn’t have that much detailed information but just node identifiers and relationships among nodes. On one hand, this is a perfect use case for leveraging just the graph structure and node connections. On the other hand we don’t have much more to predict than the node class (Gene, Compound, Biological Process and so on).
The approach
The DRKG has many entities of multiple types and many relationship types that connect those entities. The goal is to create a machine learning model that is trained on a subset of nodes and is capable of predicting the types of unseen nodes. The overall approach used here can be decomposed in the following key points:
- Graph embedding: The complexity of the problem is reduced using an embedding technique. This operation will translate the graph space into a multidimensional space that is capable of capturing fully or partially (depending on the algorithm used) the structure and the connection among entities
- Sampling: Since the graph is quite large, for validation purposes a subset of the nodes is used. This subset is extracted by randomly selecting 10k nodes.
- Training set and testing set: The sampled subset is split in a training set, used for training the model, and a testing set for evaluating the accuracy of the model. A simple 80/20 technique is used in this case. Others, like cross-validation could be used but here the purpose is not to evaluate what is the best model.
- Train a classification model: We use a one-vs-rest linear regression model to train a classifier. The goal, quite simple but powerful enough to show that it works, is to show that the model trained using the nodes embedding is capable of properly classifying the nodes back to their original label.
- Test the classification model: For evaluating the model and, hence, check the quality of our classification we use the testing set (obtained as 20% of the sample dataset). As a unique way of measuring the quality of the results we consider a combined value between precision and recall: the F1 score. The F1 score can be interpreted as a weighted harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
Node Embedding
Node embedding is becoming a prominent technique in the graph analytics space. Embedding methods learn low-dimensional distributed representation of nodes in a graph. These learned representations can serve as latent features for a variety of inference tasks on graphs, such as node classification (our goal), link prediction and network reconstruction among the others.
In Knowledge Graphs, embedding techniques assume even more value, since each relationship denotes a fact between the source (node) and target (node). Embeddings allow to translate this symbolic representation in a format that simplifies the manipulation while preserving the inherent structure. Knowledge graphs adhere to some deterministic rules, e.g. constraints or transitivity. Moreover, as mentioned in the introduction, there are also latent features that reflect inner statistical properties, such as the tendency of some entities to be related by similar characteristics (genes in different organisms encoding similar proteins) or entities that can be divided into different groups (block structure). These latent features which may emerge through the application of statistical relational learning techniques [Gomez-Perez et al., 2020].
Among the different techniques that have been proposed for generic graph embedding and specifically for knowledge graph embedding, we will consider one of the simpler and more performant but at the same time even very accurate: Fast Random Projection (FastRP) [Chen et al., 2019].
FastRP is a scalable and performant algorithm for learning distributed node representations in a graph. Even though it is over 4,000 times faster than state-of-the-art methods such as node2vec and similar, it achieves comparable accuracy in many real-world scenarios. If you’re interested in the details, I would recommend reading the related article.
Neo4j provides the Graph Data Science (GDS) library that implements the FastRP in a very efficient way. In the current proof we will use that one. Graph algorithms in the GDS don’t run directly on the stored graph. Instead they run on a virtual graph which is a projection of the underlying property graph data model. A graph projection can be seen as a view over the stored graph, containing only analytically relevant, potentially aggregated, topological and property information. Graph projections are stored entirely in-memory using compressed data structures optimized for topology and property lookup operations. In our case the projection has been created using the following query (the full version is available in the source code):
In the query above there are two very important aspects:
- The nodes are considered all the same, we are using a single label that covers all the nodes in the DRKG.
- The relationships are considered UNDIRECTED, so the resulting graph is undirected. In my own experience this is generally more effective for this type of embedding computation.
Furthermore, FastRP doesn’t take into account the relationship types either. The query above is used for extracting the proper projection but the name of the relationships will be ignored by the algorithm.
Once the graph has been built in memory (it shouldn’t take that much time, but it depends on the “horse” power of your machine) the next simple query computes the embedding for the full graph (all the nodes in the graph).
You will be surprised by the speed. If not, you have never trained other embeddings :-) The hardest part related to FastRP resides in the fine tuning of the parameters. There are different techniques for optimizing it, I like a library called Optuna to find the best combination. Thanks to the speed of the execution, it is possible even for big graphs to test multiple combinations and Optuna is capable of finding the best combination of the hyperparameters quite soon. This optimization aspect is outside the scope of this blog post, so you have to trust me that these are reasonably good hyper parameters.
In order to reduce the search space for the best combination I have a couple of suggestions that will allow you to use FastRP quite easily (at least in this type of scenarios):
- Don’t use too many iterations, generally 4 is enough. So play around plus or minus 1.
- Generally better to have 0 as iteration weights for the first and second iterations or a close value.
- The vector size (the embedding dimension) should be at least 128. I used 512 to keep it close to the needs of other embedding algorithms I used for testing.
- The normalization strength should be negative and generally lower is better (better not less than -1)
With the embeddings computed, run the following query to immediately test the quality of the embedding:
The query looks for genes similar to “TLR9”, one of the genes involved in the Coronavirus processes. You should get something similar to the following results:
The results of the query show that TLR9 is close to a lot of related TLR* genes. I’m not an expert geneticist but it looks reasonable to me.
Evaluate the results quality
The result of the query is a first validation of what we are trying to show here. Remember that the Knowledge Graph contains only nodes with IDs and relationships among them. There is no information about the type of genes, the disease and so on. The symbol and the chromosome data, visualized in the screen before, has been imported to help evaluate the results. The different relationships among the nodes contain enough latent information that the embedding process is capable of transposing this in a multidimensional space where things (like genes or diseases) that are closer in the real world (because they belong to the same family, or encode the same set of proteins for example) are closer also in this space.
The heuristic evaluation above wouldn’t be enough for an objective measure of the quality of the Knowledge Graph and the embedding technique. As mentioned above, the task selected to help us in this evaluation is the node classification. Since DRKG doesn’t contain any detail about the nodes of any type but just the type (Gene, Disease, Compound…) the classification will use this for training and evaluating the model.
The evaluation will follow the next steps:
- Extract a sample of 10k random nodes from DRKG (we could use the full dataset but would not change the results all that much, it would take just longer)
- Train a one-vs-rest classification model using the 80% of the sample dataset
- Evaluate the quality using the remaining 20% using the F1 value that combines precision and recall.
The following code snippet performs all the tasks above and returns a single number telling us how good our Knowledge Graph is at capturing domain knowledge and (of course) the embedding technique is in encoding in vectors the underlying meaning of the knowledge it represents.
The value printed is the F1 we are looking for. The exact value depends on the embeddings and the random selection, but you should get something close to 0.80 which means 80%.
This is the second confirmation, more quantitative, of the knowledge graph capacity of representing pieces of human knowledge. It demonstrates that such representation is useful not only for humans to analyze it, via graph exploration for example, but also that it is suitable for autonomous agents. The node classification, but it could have been something else like link prediction or graph completion, is a clear example of how to leverage such graph-based representation — and its derivative forms like embeddings — for extracting insights automatically from it.
Counter-proof
So far so good, but I was not completely happy with the results. I wanted to show that this wasn’t a lucky coincidence. I wanted to demonstrate further that this happens only in the case where we are processing a real knowledge graph. For this reason in order to prove my thesis with absolute certainty I run a counter-proof.
If a Knowledge Graph is capable of doing what it is supposed to do because it contains human knowledge let’s generate a graph that looks like a Knowledge Graph but doesn’t contain any real knowledge and see if the node classification performs in the same way as before. Of course I’m expecting it doesn’t.
I followed a very simple process as this:
- Generate a random graph using apoc.generate.ba of the same size in terms of nodes and relationships as our DRKG and with the same average degree;
- Mimic the same graph assigning random labels to the generated nodes but respecting the numbers of the real knowledge graph;
- Run the same process as before and check the quality of the classification.
You’ll find some guidance in the code repository provided as part of this blog post.
As expected, even adjusting the hyperparameters for optimizing the results of the Fast Random Projection I’ve never been able to get a better result than 30%. This value, considering the size of the dominant types — genes and compounds — that takes more than 2/3 of the dataset, is in line with a random selection.
Conclusion
This blog post formally demonstrates how Knowledge Graphs are concretely capable of representing the knowledge available in multiple domains not only in a way that facilitates, at first glance, its exploration and navigation for analysts. The inherent structures and the forces that drive the connection among the entities in the graph coming from the related domain (in our example the biological rules) can be captured and analyzed also by artificial and autonomous agents. The classification represented here is just an example of how machine learning algorithms can be properly fed by graph in such a manner that would be impossible or very hard otherwise. In order to obtain the same accuracy we would have had to collect many common features related to each of the entities we wanted to classify.
It is worth noting here that this effort doesn’t go in the direction of replacing the human capability to analyze this knowledge but it is an empowerment. The capability of processing enormous amounts of data goes beyond human possibilities. Nevertheless, this is why machine learning has been introduced. In any case at the end of these processes, it is always human responsibility to evaluate the insights and, more in general, the results of this analysis and to make informed and wiser decisions based on them.
References
[Brachman and Levesque, 2004] Ronald Brachman and Hector Levesque. 2004. Knowledge Representation and Reasoning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[Ioannidis et al., 2020] Vassilis N. Ioannidis and Xiang Song and Manchanda, Saurav and Li, Mufei and Pan, Xiaoqin and Zheng, Da and Ning, Xia and Zeng, Xiangxiang and Karypis, George. 2020. DRKG — Drug Repurposing Knowledge Graph for Covid-19. https://github.com/gnn4dr/DRKG/
[Negro, 2021] Alessandro Negro. 2021. Graph-Powered Machine learning. Manning Publications.
[Gomez-Perez et al., 2020] Jose Manuel Gomez-Perez, Ronald Denaux and Andres Garcia-Silva. 2020. A Practical Guide to Hybrid Natural Language Processing. Combining Neural Models and Knowledge Graphs for NLP. Springer International Publishing.
[Hamilton, 2020] William L. Hamilton. 2020. Graph Representation Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool.
[Chen et al., 2019] Haochen Chen, Syed Fahad Sultan, Yingtao Tian, Muhao Chen, and Steven Skiena. 2019. Fast and Accurate Network Embeddings via Very Sparse Random Projection. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM ‘19). Association for Computing Machinery, New York, NY, USA, 399–408.