On clustering (please comment)


(Victor Nițu) #1

I’m gonna talk a bit about clustering, the issues we are facing now and questions regarding those.

  1. Clusters are nodes, too?
    Are we going to treat clusters as individual nodes or groups of nodes? This has some impact in the next points.
    Or, to put it differently, does it matter from a user’s point of view if we treat the clusters as entities or groups of entities?

  2. Cluster details
    If we have a cluster, it should be listed as an individual entity in the sidebar. How are we going to treat it, both visually and from a backend point of view? We can, of course, mark it as a cluster and be completely opaque on its members, so you’d have to open the clustering UI to check its members. Or we can list the members and group the cluster details by member nodes.

  3. Cluster figures
    We can’t just add (+) all the figures on all members of a cluster to come up with the cluster details. While this technique holds for amounts and counts, it won’t be reliable for averages, for example.
    We probably need some equations system in which node + node = cluster would reveal some of the operations we should make on specific properties. Or which fields to ignore completely in cluster context (or add?).

  4. Cluster usefulness
    Our long term goal is to use the user provided information on how they clustered things to improve our own deduplication mechanism and maybe train some part of the processing pipeline to analyze data better.
    However, if we do it now (i.e. on the current, Rails/Mongo based backend) we’d likely just be able to persist them and read them back, as any specific cluster treatment is made very hard by the way Mongo works. That’s why we’re going for graph DBs. I think @georgiana_b might be able to warn us about more specific issues with the Rails backend and clustering.

These are some points collected following calls with both @ca1yps0 and @georgiana_b. We should talk more about them before reaching a dead end, but while being aware not to fall in rabbit holes. What are the minimal use cases we can afford building now and would have enough flexibility to build on afterwards? What’s our maximum implementation scenario? We can go both very basic or very advanced with this feature.

And, more importantly: let’s make it doable for parallel backend/frontend work in the following week(s). Minimal changes on current backend, if possible, but forwards compatible with the new one. That’s why I’m asking so many things in advance :slight_smile:

So @elvis people, let’s exchange some ideas!


P.S.: This post is pieced together following phone calls and one to one interactions. Hence putting it here and avoiding me acting as a (crappy) proxy.


#2

Clusters are nodes, too?

yes

If we have a cluster, it should be listed as an individual entity in the sidebar.

yes. I propose being opaque on the members and try to see what info we can merge from all the members.

There is actually only one use case I can see in practice now and that is the cluster being the same company spelled differently or under a completely different name. You therefore won’t need the names. (I can think of more but they are speculative, such as checking the market domination of a few companies by grouping them)

I would stick to basic usage and see whether it makes sense on the long run, or in the case the Digiwhist data would fix this.


(Victor Nițu) #3

Also, it just occured to me that flags are not cumulative. So once clustered, the newly created cluster should have no flags, as they are computed in the context and with the values of the component nodes…

Food for thought :slight_smile:


(Georgiana B) #4

Considering above, here is a humble proposal.

Ideally, when actor A and actor B are clustered, this cluster should be saved on the backend where the whole analysis would be run on their combined contracts as if it was a single entity.
That is possible in the current backend but:

  1. it would require quite a lot of work both on frontend and on backend
  2. it would turn out quite slow due to infrastructure limitations
  3. it would be changed in the new backend anyway to solve 2

However, if we are willing to drop the red flags and average competition (median tenders) from the cluster details for this first stage, we can have clusters without too many changes. In this scenario we need functionality to make the clusters and a way to save the clusters in the graph of a network without losing info about their constituents.
Then to show the cluster details you request the details of all its constituent suppliers/procurers (already possible) and

  • add their amount of money
  • add their number of contracts
  • combine their contracts into a set and show them in the list/timeline

I think this a fine first take on clusters, especially since the point of clusters in first stage is visualization.
What do you think?


(Chut Ko) #5

the original idea about clusters was grouping nodes which we suspect to be the same, just have slightly different names (ABC, aBc, ABC co etc). if this still applies, i would keep it simple and wait for elvis users to suggest next features. maybe people will start using clustering in different way but until then we dont really know.

yes, i would treat cluster as a node and as a fully functional entity (procurer/supplier). i think about cluster being just a special type of node with 2 or more child nodes. and that is how i think about the database structure as well - all clusters and nodes stored in a single table and another table storing parent-child relations between them. then, to get the data displayed in the visualisation, create a projection/view (sorry if i say bullshit, i dont remember the terminology anymore) of all nodes which do not have parent (means they are common(unclustered) nodes or clusters).

considering the above mentioned, for a user, every cluster should have totally same behavior and properties as any common node. it just should be visually clear it is a cluster and its contents can be manipulated.

every cluster node has to have number with it. number is clickable and displays the clustering dialogue. and thats it, i would not go further with functionality than this. i would not even allow to display data of a specific clustered node, only whole cluster.


(Georgiana B) #6

What about the details view for a cluster node?


(Chut Ko) #7

I say screw em :slight_smile: if the clustering is REALLY just about fixing inconsistencies in names, there will not be any need for displaying details of a single node belonging into cluster.

however, if this functionality IS totally necessary, here are few ideas:

  1. for each clustered node inside a cluster dialog, display a button triggering procurer/supplier detail view
  2. in the sidebar, cluster type nodes are a collapse - by clicking on an arrow, child nodes of the cluster appear under it. node detail view triggers when clicked on the node name.
  3. display all types of nodes in the sidebar (nodes, clusters, nodes inside clusters). add filtering option for these types into sidebar. with child node type, display also parent (cluster) name. child node detail view triggers when clicked on the child node name.

problem: for options 2 and 3, its confusing if click on normal node shows the subnetwork and click on child node shows detail.

note: when displaying (any) detailed view, there are often links to other entities displayed in the table. if the entity is inside a cluster, what should be the behaviour? should we rather display name of and link to the cluster?


#8

What about the details view for a cluster node?
Can it be the same as for the single node, just aggregated?

I don’t think Georgiana meant that for a single node within the cluster, they can also just decluster :slight_smile:


(Georgiana B) #9

@Chut_Ko

I say screw em :slight_smile: if the clustering is REALLY just about fixing inconsistencies in names, there will not be any need for displaying details of a single node belonging into cluster.

What you say makes sense: if a user makes a cluster he/she probably wants to view it as a singular entity.
This means we have to show the actor details view (below) for a node that is cluster:

In this comment I pointed out some of the challenges that occur with :point_up: view if the node is a cluster.

@zufanka

they can also just decluster

If I got it right you are proposing an alternative solution: we don’t show actor details for clusters at all; if people want to see details they uncluster it and check out the details of the individual nodes. True?
If so I think that might annoy people because it would render their effort to cluster things useless.


#10

no no, I mean we don’t have to show details on individual nodes within a cluster.
On a cluster itself we show the aggregated info, as you pointed out above.
But if they want to see details of only one node, they need to decluster again.


#11

However, if we are willing to drop the red flags and average competition (median tenders) from the cluster details for this first stage, we can have clusters without too many changes.

Sounds reasonable to me!


(Chut Ko) #12

nice, looks like we agreed on something :slight_smile: if there are no more problems to solve, we could move on.
@zufanka, can you summarize pls? you are the one who understands everything here :slight_smile:


#13

haha, ok I will try. I am sorry to be absent a bit, but I have a huge ass deadline on Friday.

Clusters to be treated like regular nodes with just a little twist of a small white cloud on the cluster with number of nodes it contains.
As soon as I cluster nodes, all of them disappear and merge into one monster node. This means they won’t be available on the network, in the side bar or in the detailed view.
As far as I understood the back end issues, we can merge every detail except the red flags. Then we display the regular node view (not the relationship view) with aggregated info the same as for any regular node (again, except for the red flags)

@georgiana_b @victor @Chut_Ko and (where is?) Ioana , all clear or did I miss anything?


(Georgiana B) #14

Sounds good to me!

I also noticed I put a picture above with relationship details but I actually meant actor details.


(Chut Ko) #15

and what can we do about red flags? can we trigger some reevaluation on the whole cluster? because that is one big important feature of this system


#16

you are right. But as it is tricky to do on the back end for now, we should take it into account but do it in the later version as @georgiana_b suggested.


#17

I think we need to rethink the clustered node icon. @Chut_Ko
The reason is that after you cluster, you are unable to compare the size of the node to the rest of the network.
I also think that clustered nodes don’t have to be differentiated on the front end anyway. It’s the same entity after all.

Can we also have just a node of size X ? Is this possible on the backend after clustering? @georgiana_b
Will the user then be still able to decluster?
@victor @ca1yps0


(Chut Ko) #18

oh, didnt notice this topic. i already proposed a solution in an issue: https://github.com/tenders-exposed/elvis-ember/issues/309

but as im reading your comment, maybe we dont have to have so obvious design for clustered nodes (as proposed in the issue #309). but there should be some indicator… maybe a tiny icon before the node’s label

maybe the proposed design could be rather used for consortia