Find the distance between friends/connections - LinkedIn System Design with

Опубликовано: 08 Апрель 2024
на канале: Gaurav Sen
10,346
185

Have you noticed a small icon on the right of LinkedIn profiles?

It tells you how closely you are related to a user (1st, 2nd or 3rd-degree connection).

These distances are calculated using graph algorithms. Specifically, LinkedIn uses Bi-directional BFS.

This is a two-way algorithm where a search is started from both source and destination, and terminated midway. The total distance traveled from source+destination is the degree of the connection.

But at scale, this solution brings challenges. Let's limit ourselves to finding 3rd-degree connections:

1. Every query needs a graph traversal on our social network.
2. A naive BFS would take O(n^3) for search. A bi-directional BFS needs O(n^2) for searching users + O(n^2) for merging results.
3. To avoid performing these queries repeatedly, we cache the second-degree connections of users. This saves up on the first O(n^2) factor for subsequent queries.

Great, but how do we store all this data in a single cache?

Well, we can't. LinkedIn is forced to scale out, with multiple cache nodes storing second-degree connections. User-Connection data is sharded into caches according to user_id.

But what if a machine crashes? All queries to the shards hosted by that machine will fail.

To avoid this, we replicate shards into different machines. Even if one of the machines fails, the shard can be found on another machine.

And here we meet our main villain: High Latency.

We don't want to hit all the machines with our shards. We want to hit the SMALLEST possible number of machines, that contain all our shards.

For the mathematically inclined, this is a set-cover problem. LinkedIn uses a modified greedy version of the problem to reduce its compute costs in half!

The basic idea is to find machines that host most of the users who are your 2nd-degree connections, and avoid those that have few 2nd-degree connections.

Less effort, more work done!

00:00 Introduction
00:35 Capacity Estimation
01:20 Bruteforce SQL query
04:00 Bidirectional BFS
06:37 Caching connection data
10:00 Potential databases
11:11 Redundancy for fault tolerance
12:12 Cold starts
13:48 Removing connections
14:37 Replicated Caches
16:03 Ideal replication factor?
17:25 Summary

Here is the paper: https://www.usenix.org/system/files/c...

References:
Sharding:    • What is DATABASE SHARDING?  
System Design at InterviewReady: https://interviewready.io/
Designing Data-Intensive Applications Book: https://amzn.to/3SyNAOy

You can follow me on:
Github: https://github.com/InterviewReady/sys...
Twitter:   / gkcs_  

#SystemDesign #InterviewReady #SocialNetworks