Facebook’s Hadoop and Hive Data Mining

HitchcockThe second cloud computing track was on massive data processing in clusters and 15clouds, including a presentation by Facebook on their use of Hadoop and custom development of Hive to facilitate their own operations.

Data mining the 180 TB of Facebook data, which grows at 2 TB a day is no small task, so the team uses a cluster of 350 8-core machines to crunch data and figure out which popular features deserve further investment, demographics, and whatever else they can fish out of the sea of personal information users provide.

In fact, just querying the data that came out of Hadoop proved challenging enough that they wrote their own semi-SQL interface, Hive, to meet their needs and then open sourced it. Giving open source software back to the community really puts them above the many companies who merely consume it.

I was curious if this data fed back into the live site, but since it’s still under development, they only use it for offline analysis. Which makes me curious what kind of software is driving the real time recommendations. You know, the one that starts serving you ads for singles sites as soon as you change your relationship status ;)

Share |

No Responses to “Facebook’s Hadoop and Hive Data Mining”

Leave a Reply

Comment Policy