linkedin architecture
The Stack
Environment
java, groovy, scala (newest 2%), ruby, C++
Contianers
tomcat, jetty
Data Layer
oracle, MySQL, voldemart, lucene, memcache
Offline Processing
hardoop and splunk
DATA COLLECTION
bulk of challenge
majority data of collectors comes from data store exstensive memcache
requiement is speed
option1 : push arch (inbox (stil used but old))
each mem has an inbox fo noticatios receid from there connections/followees
N writes per update (where N may be very large)
Very fast to read
diifcult to scale, bu use for private or target systems
opt 2: Pull Architecture (new arch)
each member has an "ativty space" that cotains teir caciotns on linkedin
1 write per update (no broadcast)
require up to N reds to collect N straems
can optmize to nimize the number of reads?
not all N members jae update ot satisfy the query
not all updates can/'need to be displayed on the screen
some updates/members are mor impt than others
Queuing
activeMQ
frameworks
spring
Capacity
35M/week updates
14M/week emails
Storage Model
L1: temporal
oracle
compbined clob/varchar storage
optimistic locking
1 fdd to update, 1 write (merge) to update
Size boun by # number of pdate sna retention poicy
L2: Tenured
access less frequency
simple key-value storage is sufficient (each update has unique id)
oracle today transitoning to voldmort
Member filter
need to void fething N feed (too expensive)
filter ontains an in-memory summar o usr activty
filter only returns false-o;never false-neg
esy to measre heuristic; forn n member s how many had good contant
Commenting
users can create discussions around updtes
leverage existing forum serice
denormalize a discussion summary onto the tuenred update, resole first/last comments on retrieval
full discussion can be rtieved dynamically
Twitter Sync
parnership with twitter
vi-directional flow of status updates
export status updates, import tweets
users register their twitter account
authorize via OAuth
email delivery
multiple concurrent mail generating tasks
each tasks hos non overlappting id range generators to void overlap and allow parallelixaiton
controlled by task scheduer
sets delivery time
conrols task execution status , suspend/resume, etc.
cache common content to Notifierm which packgase the email
user priority jsp framework
david heinke now part of linkedin head of engineering/operations came from yahoo
0 Responses to linkedin architecture
Something to say?