<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.developer.yahoo.net/~d/styles/itemcontent.css"?><rss xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
   <channel>
      <title>Yahoo! Hadoop Blog</title>
      <link>http://developer.yahoo.com/blogs/hadoop/</link>
      <description>News and information about Hadoop and related distributed computing work going on at Yahoo!</description>
      <language>en</language>
      <copyright>Copyright 2010</copyright>
      <lastBuildDate>Wed, 01 Sep 2010 11:00:00 -0800</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

            <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.developer.yahoo.net/YDNHadoop" /><feedburner:info uri="ydnhadoop" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
         <title>August HUG Recap</title>
         <description>&lt;p&gt;Thanks to the around 175 developers who came to Yahoo! recently for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal, and conversations continued long after the formal sessions.&lt;/p&gt;

&lt;p&gt;&lt;img alt="IMG00068.jpg" src="http://developer.yahoo.com/blogs/hadoop/IMG00068.jpg" width="400" height="300" /&gt;
&lt;br&gt;
&lt;em&gt;Hundreds of Hadoop Fans Flock to Yahoo! for the Hadoop User Group&lt;/em&gt;&lt;/P&gt;

&lt;p&gt;The event started with Arun Murthy from Yahoo! describing the best practices for developing MapReduce applications. Arun introduced the concept of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. Finally, Arun talked about the anti-patterns of applications running on the Apache Hadoop clusters.&lt;/p&gt;

&lt;p&gt;&lt;object id="__sse5084658" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=bestpracticeshug20100818-100829163143-phpapp01&amp;stripped_title=hug-august-2010-best-practices" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse5084658" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=bestpracticeshug20100818-100829163143-phpapp01&amp;stripped_title=hug-august-2010-best-practices" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 1:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/ygpBTPX8bVs&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/ygpBTPX8bVs&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/ygpBTPX8bVs&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 2:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/ap3IkD-pHMI&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/ap3IkD-pHMI&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/ap3IkD-pHMI&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Next, Stefan Groschupf, the co-founder and CTO of Datameer, discussed the challenges in social media analytics and how to overcome these using big data analytics built on Hadoop in his “Social Media: What’s Really the Buzz?” talk. The demo was very helpful in visualizing the true thought leads and influencers in social media conversations. These leaders and influences are becoming increasingly important, so that companies can better understand who is having an impact on their customers' buying decisions. This talk gave a very good perspective of how the power of Hadoop can be used to crunch large amounts of data and then visually rendered. &lt;/p&gt;

&lt;p&gt;&lt;object id="__sse5084664" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=whatsthebuzzbayarea-100829163359-phpapp01&amp;stripped_title=hug-august-2010whats-the-buzzbayarea" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse5084664" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=whatsthebuzzbayarea-100829163359-phpapp01&amp;stripped_title=hug-august-2010whats-the-buzzbayarea" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 1:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/ou5octe53iw&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/ou5octe53iw&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/ou5octe53iw&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 2:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/oguC-pax41o&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/oguC-pax41o&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/oguC-pax41o&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 3:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/i92ezFQxxp0&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/i92ezFQxxp0&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/i92ezFQxxp0&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Finally, Matei Zaharia from UC Berkeley talked about Mesos: A Flexible Cluster Resource Manager.  The talk highlighted Mesos features and how organizations can consolidate multiple application workloads into a single cluster. The demo showed off the benefits of Mesos and highlighted its ability to run multiple isolated instances of Hadoop on the same cluster. The fault tollerance of Mesos was successfully demonstrated too. Subsequent to the mail session, Matei and team talked about Spark, MapReduce-like framework that adds support for iterative jobs. Spark functional programming model similar to MapReduce was demonstrated capable of caching data between iterations making it very efficient for interactive analysis of big datasets. Spark in addition was demonstrated on the same cluster running alongside Hadoop.&lt;/p&gt;

&lt;p&gt;&lt;object id="__sse5084659" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mesos-hug-aug-2010-100829163141-phpapp01&amp;stripped_title=hug-august-2010-mesos" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse5084659" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mesos-hug-aug-2010-100829163141-phpapp01&amp;stripped_title=hug-august-2010-mesos" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 1:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/lE3jR6nM3bw&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/lE3jR6nM3bw&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/lE3jR6nM3bw&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 2:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/V-MiIFEGSdE&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/V-MiIFEGSdE&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/V-MiIFEGSdE&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;p&gt;Part 3:&lt;br&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/g7mOFM3Oe8Q&amp;hl=en&amp;fs=1"&gt;&lt;/param&gt; &lt;http://www.youtube.com/v/g7mOFM3Oe8Q&amp;hl=en&amp;fs=1%22%3e%3c/param%3e&gt; &lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/g7mOFM3Oe8Q&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;


&lt;p&gt;We at Yahoo! embrace Hadoop, and are looking for exciting technologies and experiences you want to share. Please contact me via the &lt;a href="http://www.meetup.com/hadoop/"&gt;Hadoop Bay Area User Group Meetup page&lt;/a&gt;.&lt;/p&gt;


&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="50" width="50" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src='http://developer.yahoo.com/blogs/hadoop/susheel_kaushik.jpg' alt="Susheel Kaushik"&gt;
Susheel Kaushik&lt;br/&gt;
Director, Product Management&lt;BR&gt;
Cloud Computing at Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=d94wQsU4Z3k:k9YvPk6hW7E:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=d94wQsU4Z3k:k9YvPk6hW7E:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=d94wQsU4Z3k:k9YvPk6hW7E:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=d94wQsU4Z3k:k9YvPk6hW7E:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=d94wQsU4Z3k:k9YvPk6hW7E:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=d94wQsU4Z3k:k9YvPk6hW7E:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=d94wQsU4Z3k:k9YvPk6hW7E:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/d94wQsU4Z3k" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/d94wQsU4Z3k/august_hug_recap.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/09/august_hug_recap.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Hadoop User Group</category>
        
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop user group</category>
        
         <pubDate>Wed, 01 Sep 2010 11:00:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/09/august_hug_recap.html</feedburner:origLink></item>
            <item>
         <title>Apache Hadoop: Best Practices and Anti-Patterns</title>
         <description>&lt;p&gt;Apache Hadoop is a software framework to build large-scale, shared storage and computing infrastructures. Hadoop clusters are used for a variety of research and development projects, and for a growing number of production processes at Yahoo!, EBay, Facebook, LinkedIn, Twitter, and other companies in the industry. It is a key component in several business critical endeavors representing a very significant investment and technology component. Thus, appropriate usage of the clusters and Hadoop is critical in ensuring that we reap the best possible return on this investment.&lt;/p&gt;
    
&lt;p&gt;This blog post represents compendium of &lt;em&gt;best practices&lt;/em&gt; for applications running on Apache Hadoop. In fact, we introduce the notion of a&lt;em&gt;Grid Pattern&lt;/em&gt; which, similar to a &lt;a href="http://redirect.corp.yahoo.com/?url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FDesign_pattern_%28computer_science%29"&gt;Design Pattern&lt;/a&gt;, represents a general reusable solution for applications running on the Grid.&lt;/p&gt;
    
&lt;p&gt;This blog post enumerates characteristics of &lt;em&gt;well behaved&lt;/em&gt; applications and provides guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at this document is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, &lt;em&gt;well-behaved&lt;/em&gt; in the multi-tenant environment of the Apache Hadoop clusters, and unlikely to fall afoul of most policies and limits.&lt;/p&gt;
    
&lt;p&gt;This blog post also attempts to highlight some of the &lt;em&gt;anti-patterns&lt;/em&gt; for applications running on the Apache Hadoop clusters.&lt;/p&gt;
    
&lt;h3&gt;Overview&lt;/h3&gt;
    
&lt;p&gt;Applications processing data on Hadoop are written using the Map-Reduce paradigm.&lt;/p&gt;
    
&lt;p&gt;A Map-Reduce job usually splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.&lt;/p&gt;
    
&lt;p&gt;Map-Reduce applications specify the input/output locations and supply &lt;em&gt;map&lt;/em&gt; and &lt;em&gt;reduce&lt;/em&gt; functions via implementations of appropriate Hadoop interfaces, such as &lt;em&gt;Mapper&lt;/em&gt; and &lt;em&gt;Reducer&lt;/em&gt;. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable, etc.) and configuration to the JobTracker, which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.&lt;/p&gt;
 
&lt;p&gt;The Map/Reduce framework operates exclusively on &lt;em&gt;&amp;lt;key, value&amp;gt;&lt;/em&gt; pairs &amp;mdash; that is, the framework views the input to the job as a set of &lt;em&gt;&amp;lt;key, value&amp;gt;&lt;/em&gt; pairs and produces a set of &lt;em&gt;&amp;lt;key, value&amp;gt;&lt;/em&gt; pairs as the output of the job, conceivably of different types.&lt;/p&gt;
    
&lt;p&gt;Here is the typical data-flow in a Map-Reduce application:&lt;/p&gt;

&lt;img alt="Map Reduce data flow" src="http://developer.yahoo.com/blogs/hadoop/MapRed.png" width="282" height="227" /&gt;

 &lt;p&gt;The vast majority of Map-Reduce applications executed on the Grid do not directly implement the low-level Map-Reduce interfaces; rather they are implemented in a higher-level language, such as &lt;a href="http://redirect.corp.yahoo.com/?url=http%3A%2F%2Fhadoop.apache.org%2Fpig%2F"&gt;Pig&lt;/a&gt;.&lt;/p&gt;
    
&lt;p&gt;&lt;a href="http://twiki.corp.yahoo.com/view/CCDI/Oozie"&gt;Oozie&lt;/a&gt; is the preferred workflow management and scheduling solution on the Grid. Oozie supports multiple interfaces for applications (Hadoop Map-Reduce, Pig, Hadoop Streaming, Hadoop Pipes, etc.) and supports scheduling of applications based on either time or data-availability.&lt;/p&gt;
    
&lt;h3&gt;Grid Patterns&lt;/h3&gt;
    
&lt;p&gt;This section covers the best practices for Map-Reduce applications running on the Grid.&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Input&lt;/strong&gt;
    
&lt;p&gt;Hadoop Map-Reduce is optimized to process large amounts of data. The &lt;em&gt;maps&lt;/em&gt; typically process data in an embarrassingly parallel manner, typically at least 1 HDFS block of data, usually 128MB.&lt;/p&gt;
 &lt;ul&gt;&lt;li class="bullist"&gt; By default, the framework processes at most 1 HDFS file per-map. This means that if an application needs to processes a very large number of input  files, it is better to process multiple files per-map via a special input-format such as &lt;i&gt;MultiFileInputFormat&lt;/i&gt;. This is true even for applications processing a small number of tiny input files, processing multiple files per map is significantly more efficient.&lt;/li&gt;&lt;li class="bullist"&gt; If the application needs to process a very large amount of data, even if they are present in large-sized files, it is more efficient to process more  than 128MB of data per-map (see section on &lt;a href="http://twiki.corp.yahoo.com/view/Hadoop/BestPractices#Maps"&gt;Maps&lt;/a&gt;).&lt;/li&gt;&lt;/ul&gt;
    
&lt;p&gt;&lt;em&gt;Grid Pattern: Coalesce processing of multiple small input files into smaller number of maps and use larger HDFS block-sizes for processing very large data-sets.&lt;/em&gt;
&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Maps&lt;/strong&gt;
    
&lt;p&gt;The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. Thus, if you expect 10TB of input data and have a block-size of 128MB, you'll end up with 82,000 maps.&lt;/p&gt;
    
&lt;p&gt;Task setup takes awhile, so it is best if the maps take at least a minute to execute for large jobs.&lt;/p&gt;
    
&lt;p&gt;As explained in the section above on &lt;a href="http://twiki.corp.yahoo.com/view/Hadoop/BestPractices#Input"&gt;input&lt;/a&gt; of applications, it is more efficient to process multiple-files per map for jobs with very large number of small input files.&lt;/p&gt;
    
&lt;p&gt;Even if an application is processing large input files, such that each map is processing a whole HDFS block of data, it is more efficient to process large chunks of data per-map. For example, one way to process more data per map is to have the application process input data with larger HDFS block size, e.g., 512M or even higher, if appropriate.
 &lt;/p&gt;
   
&lt;p&gt;As an extreme example the Map-Reduce development team used ~66,000 maps to accomplish the  &lt;a href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html"&gt;PetaSort&lt;/a&gt;, that is, 66,000 maps to process 1PB of data (12.5G per map).&lt;/p&gt;
    
&lt;p&gt;The bottom-line is that having too many maps or lots of maps with very short run-time is anti-productive.&lt;/p&gt;
    
&lt;p&gt;&lt;em&gt;Grid Pattern: Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;Also, when processing larger blocks per-map, it is important ensure they have sufficient memory for the sort-buffer to speed up the map-side sort (please see the documentation for &lt;code&gt;io.sort.mb&lt;/code&gt; and &lt;code&gt;io.sort.record.percent&lt;/code&gt;). The performance of the application can improve dramatically if it can be arranged such that the majority of the map-output can be held in the map's sort-buffer, this will entail larger heap-sizes for the map JVM. It is important to remember that the in-memory footprint of deserialized input might significantly vary from the on-disk footprint; for example, certain class of Pig applications result in 3x-4x blow up of on-disk data in-memory. In such cases, applications might need significantly large heap-sizes for the JVM to ensure the map-input-records and map-output-records can be kept in memory.&lt;/p&gt;
    
&lt;p&gt;&lt;em&gt;Grid Pattern: Ensure maps are sized so that all of map-outputs can be sorted in one pass by keeping all of them in the sort-buffer.&lt;/em&gt;&lt;/p&gt;
    
&lt;p&gt;Having the right number of maps has the following advantages for applications:&lt;/p&gt;
&lt;ul&gt;&lt;li class="bullist"&gt;It reduces the scheduling overhead; having fewer maps means task-scheduling is easier and availability of free-slots in the cluster is higher.&lt;/li&gt;&lt;li class="bullist"&gt;It means the map-side is more efficient; provided there is &lt;em&gt;sufficient&lt;/em&gt; memory to accommodate the map-outputs in the sort-buffer in the map.&lt;/li&gt;&lt;li class="bullist"&gt;It reduces the number of seeks required to &lt;em&gt;shuffle&lt;/em&gt; the map-outputs from the maps to the reduces &amp;mdash; remember that each map produces output for each reduce, thus the number of seeks is &lt;code&gt;m * r&lt;/code&gt; where m is #maps and r is #reduces.&lt;/li&gt;&lt;li class="bullist"&gt;Each shuffled segment is larger, resulting in reducing the overhead of connection-establishment when compared to the 'real' work done, that is, moving bytes across the network.&lt;/li&gt;&lt;li class="bullist"&gt;It means that the reduce-side merge of the sorted map-outputs is more efficient, since the branch-factor for the merge is lesser, that is, fewer merges are needed since there are fewer sorted segments of map-outputs to merge.&lt;/li&gt;&lt;/ul&gt;
    
&lt;p&gt;The caveat to the above guidelines is that processing too much data per-map is bad for failure recovery, a single failed map might hurt the latency of the application.&lt;/p&gt;
    
&lt;p&gt;&lt;em&gt;Grid Pattern: Applications should use fewer maps to process data in parallel, as few as possible without having really bad failure recovery cases.&lt;/em&gt;&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Combiner&lt;/strong&gt;

&lt;p&gt;Applications, which use Combiners appropriately, reap benefits of the map-side aggregation effected by them. The primary benefit of the Combiner is that, when used appropriately, it significantly cuts down the amount of data shuffled from the maps to the reduces.&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Shuffle&lt;/strong&gt;
    
&lt;p&gt;Applications that use Combiners appropriately reap benefits of the map-side aggregation effected by them. The primary benefit of the Combiner is that, when used appropriately, it significantly cuts down the amount of data shuffled from the maps to the reduces.&lt;/p&gt;
    
&lt;p&gt;It is important to remember that Combiner has a performance penalty since it entails an extra serialization/de-serialization of map-output records. Applications that cannot aggregate the map-output bytes by 20-30% should not be using combiners. Applications can use the combiner input/output records counters to measure the efficacy of the Combiner.&lt;/p&gt;
   
&lt;p&gt;&lt;em&gt;Grid Pattern: Combiners help the shuffle phase of the applications by reducing network traffic. However, it is important to ensure that the Combiner does provide sufficient aggregation.&lt;/em&gt;&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Reduces&lt;/strong&gt;
    
&lt;p&gt;The efficiency of reduces is driven by a large extent by the performance of the &lt;a href="http://twiki.corp.yahoo.com/view/Hadoop/BestPractices#Shuffle"&gt;shuffle&lt;/a&gt;.&lt;/p&gt;
    
&lt;p&gt;The number of reduces configured for the application (r) is, obviously, a crucial factor.&lt;/p&gt;
    
&lt;p&gt;Having too many or too few reduces is anti-productive:&lt;/p&gt;
&lt;ul&gt;&lt;li class="bullist"&gt;Too few reduces cause undue load on the node on which the reduce is scheduled &amp;mdash; in extreme cases we have seen reduces processing over 100GB per-reduce. This also leads to very bad failure-recovery scenarios since a single failed reduce has a significant, adverse, impact on the latency of the job.&lt;/li&gt;&lt;li class="bullist"&gt;Too many reduces adversely affects the shuffle crossbar. Also, in extreme cases it results in too many small files created as the output of the job &amp;mdash; this hurts both the NameNode and performance of subsequent Map-Reduce applications who need to process lots of small files.&lt;/li&gt;&lt;/ul&gt;
    
&lt;p&gt;&lt;em&gt;Grid Pattern: Applications should ensure that each reduce should process at least 1-2 GB of data, and at most 5-10GB of data, in most scenarios.&lt;/em&gt;&lt;/p&gt;
   
&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;
    
&lt;p&gt;A key factor to remember is that the number of output artifacts of an application is linear w.r.t the number of configured reduces. As discussed in the section on &lt;a href="http://twiki.corp.yahoo.com/view/Hadoop/BestPractices#Reduces"&gt;reduces&lt;/a&gt;, picking the right number of reduces is very important.&lt;/p&gt;
    
&lt;p&gt;Some other factors to consider:&lt;/p&gt;
&lt;ul&gt;&lt;li class="bullist"&gt; Consider compressing the application's output with an appropriate compressor (compression speed v/s efficiency) to improve HDFS write-performance.&lt;/li&gt;&lt;li class="bullist"&gt;Do not write out more than one output file per-reduce, using side-files is usually avoidable. Typically applications write small side-files to capture statistics and the like; counters might be more appropriate if the number of statistics collected is small.&lt;/li&gt;&lt;li class="bullist"&gt;Use an appropriate file-format for the output of the reduces. Writing out large amounts of compressed textual data with a codec such as  zlib/gzip/lzo is counter-productive for downstream consumers. This is because zlib/gzip/lzo files cannot be split and processed and the Map-Reduce framework is forced to process the entire file in a single map, in the downstream consumer applications. This results in a bad load imbalance and failure recover scenarios on the maps. Using file-formats such as SequenceFile or TFile alleviates these problems since they are both compressed and splittable.&lt;/li&gt;&lt;li class="bullist"&gt;Consider using a larger output block size (&lt;code&gt;dfs.block.size&lt;/code&gt;) when the individual output files are large (multiple GBs).&lt;/li&gt;&lt;/ul&gt;
    
&lt;p&gt;&lt;em&gt;Grid Pattern: Application outputs to be few large files, with each file spanning multiple HDFS blocks and appropriately compressed.&lt;/em&gt;&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Distributed Cache&lt;/strong&gt;

&lt;p&gt;DistributedCache distributes application-specific, large, read-only files efficiently. DistributedCache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. It can also be used as a rudimentary software distribution mechanism for use in the map and/or reduce tasks. It can be used to distribute both jars and native libraries and they can be put on the classpath or native library path for the map/reduce tasks.&lt;/p&gt;
    
&lt;p&gt;The DistributedCache is designed to distribute a small number of medium-sized artifacts, ranging from a few MBs to few tens of MBs. One  &lt;a href="http://redirect.corp.yahoo.com/?url=http%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FMAPREDUCE-989"&gt;drawback&lt;/a&gt; of the current implementation of the DistributedCache is that there is no way to specify map or reduce specific artifacts.&lt;/p&gt;
    
&lt;p&gt;In rare cases, it might be more appropriate for the tasks themselves to do the HDFS i/o to copy the artifacts than rely on the DistributedCache, for example, if an application has a small number of reduces and need very large artifacts (e.g. greater than 512M) in the distributed-cache.&lt;/p&gt;
    
&lt;p&gt;&lt;em&gt;Grid Pattern: Applications should ensure that artifacts in the distributed-cache should not require more i/o than the actual input to the application tasks.&lt;/em&gt;&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Counters&lt;/strong&gt;
    
&lt;p&gt;Counters represent global counters, defined either by the Map/Reduce framework or applications. Applications can define arbitrary Counters and update them in the map and/or reduce methods. These counters are then globally aggregated by the framework.&lt;/p&gt;
    
&lt;p&gt;Counters are appropriate for tracking few, important, global bits of information. They are definitely not meant to aggregate very fine-grained statistics of applications.&lt;/p&gt;
    
&lt;p&gt;Counters are very expensive since the JobTracker has to maintain every counter of every map/reduce task for the entire duration of the application.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Grid Pattern: Applications should not use more than 10, 15 or 25 custom counters.&lt;/em&gt;&lt;/p&gt;
 
&lt;p&gt;&lt;strong&gt;Compression&lt;/strong&gt;
    
&lt;p&gt;Hadoop Map-Reduce provides facilities for the application-writer to specify compression for both intermediate map-outputs and the output of the application, that is, output of the reduces.&lt;/p&gt;
 &lt;ul&gt;&lt;li class="bullist"&gt;Intermediate Output Compression: As explained in the section on &lt;a href="http://twiki.corp.yahoo.com/view/Hadoop/BestPractices#Shuffle"&gt;shuffle&lt;/a&gt;, compression of the intermediate map-outputs with an appropriate compression codec yields better performance by saving on network traffic between the maps and the reduces. Lzo is a reasonably optimal choice for compressing map-outputs since it provides reasonable compression at very high CPU efficiencies.&lt;/li&gt;&lt;li class="bullist"&gt;Application Output Compression: As explained in the section on &lt;a href="http://twiki.corp.yahoo.com/view/Hadoop/BestPractices#Output"&gt;application output&lt;/a&gt;, compression of the outputs with an appropriate compression codec and file-format yields better latency for application. Zlib/Gzip might be an appropriate choice in a majority of cases since it provides high compression at reasonable speeds; bzip2 is usually too slow to be used.&lt;/li&gt;&lt;/ul&gt;
    
&lt;p&gt;&lt;strong&gt;Total Order Outputs&lt;/strong&gt;
    
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Sampling&lt;/em&gt;&lt;/strong&gt;
    
&lt;p&gt;Occasionally, applications need to produce totally ordered output, that is, fully-sorted. In such cases, a common anti-pattern is for applications is to use a single-reducer, forcing a single, global aggregation. Clearly, it is very inefficient - this not only puts a significant amount of load on the single node on which the reduce task is executing, but also has very bad failure recovery.&lt;/p&gt;
    
&lt;p&gt;A much better approach is to sample the input and use that to drive a &lt;em&gt;sampling partitioner&lt;/em&gt; rather than the default &lt;em&gt;hash partitioner&lt;/em&gt;. Thus, one can derive benefits of better load balancing and failure recovery.&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Joining Fully Sorted Data-Sets&lt;/em&gt;&lt;/strong&gt;
    
&lt;p&gt;Another design pattern on the Grid concerns the join of two fully-sorted data-sets whose cardinality is not an exact multiple of the other; for example, one data-set has 512 buckets while the other has 200 buckets. &lt;/p&gt;
    
&lt;p&gt;In such cases, ensuring the input data-sets have a total-order (as described in the previous section) means that the application can use the cardinality of either of the data-sets i.e. 512 or 200 buckets in the above example. Pig handles these joins in the efficient manner described here.&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;HDFS Operations &amp;amp; JobTracker Operations&lt;/strong&gt;
    
&lt;p&gt;The NameNode is a precious resource, applications need to be careful about performing HDFS operations in the Grid. In particular, applications are discouraged from doing non-I/O operations, that is, metadata operations such as stat'ing large directories, recursive stats, and more, from the map/reduce tasks at runtime.&lt;/p&gt;
 
&lt;p&gt;Similarly, applications should not contact the JobTracker for cluster statistics, etc., from the backend.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Grid Pattern: Applications should not perform any metadata operations on the file-system from the backend, they should be confined to the job-client during job-submission. Furthermore, applications should be careful not to contact the JobTracker from the backend.&lt;/em&gt;&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;User Logs&lt;/strong&gt;

&lt;p&gt;The user task-logs, that is, &lt;code&gt;stdout&lt;/code&gt; and &lt;code&gt;stderr&lt;/code&gt;, of map/reduce tasks are stored on the local-disk of the compute node on which the task is executed. &lt;/p&gt;
    
&lt;p&gt;Since the nodes are part of the shared infrastructure the Map-Reduce framework implements limits on the amount of task-logs stored on the node.&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Web-UI&lt;/strong&gt;

&lt;p&gt;The Hadoop Map-Reduce framework provides a rudimentary web-ui to track the running jobs, their progress, history of completed jobs, and so on, via the JobTracker.&lt;/p&gt;
    
&lt;p&gt;It is important to remember that the web-ui is meant to be used for humans and not for automated processes.&lt;/p&gt;
    
&lt;p&gt;Implementing automated processes to screen-scrape the web-ui is strictly prohibited. Some parts of the web-ui, such as browsing of job-history, are very resource-intensive on the JobTracker and could lead to severe performance problems when they are screen-scraped.&lt;/p&gt;
    
&lt;p&gt;If there is a need for automated statistics gathering it is better to consult the Grid Solutions, Grid SE, or the Map-Reduce development teams.&lt;/p&gt;
    
&lt;p&gt;&lt;strong&gt;Workflows&lt;/strong&gt;

&lt;p&gt;&lt;a href="http://twiki.corp.yahoo.com/view/CCDI/Oozie"&gt;Oozie&lt;/a&gt; is the preferred workflow-management and scheduling system for the Grid. Oozie manages workflows and provides scheduling either based on time or availability of data. Increasingly, latency sensitive production job &lt;em&gt;pipelines&lt;/em&gt; are being scheduled and managed through Oozie. &lt;/p&gt;
    
&lt;p&gt;A key factor to keep in mind when designing Oozie workflows is that Hadoop is better suited for batch processing of very large amounts of data. As such, it is advisable for workflows to be comprise of fewer number of medium-to-large sized Map-Reduce jobs, in terms of processing, rather than large number of small Map-Reduce jobs. As an extreme case we have seen single workflows consisting of hundreds and thousands of jobs. This is an obvious &lt;em&gt;anti-pattern&lt;/em&gt;. The Hadoop framework, currenty, is not really suited for pipelines of this nature. It would be better to &lt;em&gt;collapse&lt;/em&gt; these hundreds/thousands of Map-Reduce jobs into fewer jobs crunching more data &amp;mdash; this will help both performance and latency of the workflows.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Grid Pattern: A single Map-Reduce job in a workflow should process at least a few tens of GB of data.&lt;/em&gt;&lt;/p&gt;
    
&lt;h3&gt;Anti-Patterns&lt;/h3&gt;

&lt;p&gt;This section attempts to cover some of the common &lt;b&gt;anti-patterns&lt;/b&gt; of applications running on the Grid. These are, usually, not in keeping with the &lt;em&gt;spirit&lt;/em&gt; of a large-scale, distributed, batch, data processing system.&lt;/p&gt;
    
&lt;p&gt;This is meant to be a warning to the application developers since the Grid software stack is being &lt;em&gt;hardened&lt;/em&gt;, particularly in the upcoming 20.Fred release, and the Grid stack will be less forgiving of transgressions to the point of rejecting applications which exhibit some of the anti-patterns described here:&lt;/p&gt;
 &lt;ul&gt;&lt;li class="bullist"&gt;Applications not using a higher-level interface such as Pig unless really necessary.&lt;/li&gt;&lt;li class="bullist"&gt;Processing thousands of small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file.&lt;/li&gt;&lt;li class="bullist"&gt;Processing very large data-sets with small HDFS block size, that is, 128MB, resulting in tens of thousands of maps. &lt;/li&gt;&lt;li class="bullist"&gt;Applications with a large number (thousands) of maps with a very small runtime (e.g., 5s).&lt;/li&gt;&lt;li class="bullist"&gt;Straightforward aggregations without the use of the Combiner. &lt;/li&gt;&lt;li class="bullist"&gt;Applications with greater than 60,000-70,000 maps.&lt;/li&gt;&lt;li class="bullist"&gt;Applications processing large data-sets with very few reduces (e.g., 1). &lt;/li&gt;&lt;ul&gt;&lt;li class="bullist"&gt;Pig scripts processing large data-sets without using the PARALLEL keyword&lt;/li&gt;&lt;li class="bullist"&gt; Applications using a single reduce for total-order amount the output records&lt;/li&gt;&lt;/ul&gt; &lt;li class="bullist"&gt;Applications processing data with very large number of reduces, such that each reduce processes less than 1-2GB of data. &lt;/li&gt;&lt;li class="bullist"&gt;Applications writing out multiple, small, output files from each reduce. &lt;/li&gt;
&lt;li class="bullist"&gt;Applications using the DistributedCache to distribute a large number of artifacts and/or very large artifacts (hundreds of MBs each).&lt;/li&gt;&lt;li class="bullist"&gt;Applications using tens or hundreds of counters per task.&lt;/li&gt;&lt;li class="bullist"&gt; Applications performing metadata operations (e.g. listStatus) on the file-system from the map/reduce tasks.&lt;/li&gt; &lt;li class="bullist"&gt; Applications doing screen scraping of JobTracker web-ui for status of queues/jobs or worse, job-history of completed jobs.&lt;/li&gt;&lt;li class="bullist"&gt;Workflows comprising hundreds or thousands of small jobs processing small amounts of data.&lt;/li&gt;&lt;/ul&gt;


&lt;p&gt;Arun Murthy, Architect, Hadoop Team at Yahoo!&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=B40vEh_7dAA:EDLDk_bVK2A:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=B40vEh_7dAA:EDLDk_bVK2A:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=B40vEh_7dAA:EDLDk_bVK2A:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=B40vEh_7dAA:EDLDk_bVK2A:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=B40vEh_7dAA:EDLDk_bVK2A:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=B40vEh_7dAA:EDLDk_bVK2A:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=B40vEh_7dAA:EDLDk_bVK2A:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/B40vEh_7dAA" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/B40vEh_7dAA/apache_hadoop_best_practices_a.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/08/apache_hadoop_best_practices_a.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Developer Note</category>
        
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
        
         <pubDate>Wed, 18 Aug 2010 17:00:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/08/apache_hadoop_best_practices_a.html</feedburner:origLink></item>
            <item>
         <title>Pig and Hive at Yahoo!</title>
         <description>&lt;p&gt;Yahoo! has begun evaluating Hive for use as part of its Hadoop stack.  Since, in many peoples' minds, Hive and Pig are roughly equivalent and Pig Latin is very close to SQL, this has led to some confusion.  Why are we interested in using both technologies?  

&lt;p&gt;As we have looked at our workloads and analyzed our use cases, we have come to the conclusion that the different use cases require different tools. In this post, I will walk through our thinking on why both of these tools belong in our toolkit, and when each is appropriate.

&lt;h3&gt;Data preparation and presentation&lt;/h3&gt;

&lt;p&gt;Let me begin with a little background on processing and using large data. Data processing often splits into three separate tasks:  data collection, data preparation, and data presentation. I will not discuss the data collection phase, because I want to focus on Pig and Hive, neither of which play a role in that phase.

&lt;p&gt;The data preparation phase is often known as ETL (Extract Transform Load) or the &lt;em&gt;data factory&lt;/em&gt;. "Factory" is a good analogy because it captures the essence of what is being done in this stage: Just as a physical factory brings in raw materials and outputs products ready for consumers, so a data factory brings in raw data and produces data sets ready for data users to consume. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. Users in this phase are generally engineers, data specialists, or researchers.

&lt;p&gt;The data presentation phase is usually referred to as the &lt;em&gt;data warehouse&lt;/em&gt;. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves.  In this phase, users may be engineers using the data for their systems, analysts, or decisionmakers.

&lt;p&gt;Given the different workloads and different users for each phase, we have found that different tools work best in each phase.  Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

&lt;h3&gt;Data factory use cases&lt;/h3&gt;

&lt;p&gt;In our data factory we have observed three distinct workloads: pipelines, iterative processing, and research.

&lt;p&gt;&lt;strong&gt;Pipelines&lt;/strong

&lt;p&gt;The classical data pipelines bring in a data feed, and clean and transform it.  A common example of such a feed is logs from Yahoo!'s web servers.  These logs undergo a cleaning step where bots, company internal views, and clicks are removed.  We also do transformations such as, for each click,
finding the page view that preceded that click.

&lt;p&gt;See my previous &lt;a href="http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html"&gt;post&lt;/a&gt; that discusses why we have found Pig to be the best tool for implementing these pipelines. We use it together with our workflow tool, &lt;a href="http://yahoo.github.com/oozie/"&gt;Oozie&lt;/a&gt;, to construct pipelines, some of which run tens of thousands of Pig jobs a day.

&lt;p&gt;&lt;strong&gt;Iterative processing&lt;/strong

&lt;p&gt;A closely related use case is iterative processing.  In this case, there is usually one very large data set that is maintained. Typical processing on
that data set involves bringing in small new pieces of data that will change the state of the large data set.  

&lt;p&gt;For example, consider a data set that contained all the news stories currently known to Yahoo! News.  You can envision this as a huge graph, where each story is a node.  In this graph, there are links between stories that reference the same events.  Every few minutes a new set of stories comes in, and the tools need to add these to the graph, find related stories and create links, and remove old stories that these new stories supersede.

&lt;p&gt;What sets this off from the standard pipeline case is the constant inflow of small changes. These require the use of an incremental processing model to process this data in a reasonable amount of time. 

&lt;p&gt;For example, if the process has already done a join against the graph of all news stories, and a small set of new stories arrives, re-running the join across the whole set will not be desirable. It will take hours or days.

&lt;p&gt;Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach.  This will take only a few minutes.  Standard database operations can be implemented in this incremental way in Pig Latin, making Pig a good tool for this use case.

&lt;p&gt;&lt;strong&gt;Research&lt;/strong&gt;

&lt;p&gt;A third use case is research. Yahoo! has many scientists who use our grid tools to comb through the petabytes of data we have. Many of these researchers want to quickly write a script to test a theory or gain deeper insight.  

&lt;p&gt;But, in the data factory, data may not be in a nice, standardized state yet. This makes Pig a good fit for this use case as well, since it supports data with partial or unknown schemas, and semi-structured or unstructured data.

&lt;p&gt;Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.

&lt;h3&gt;Data warehouse use cases&lt;/h3&gt;

&lt;p&gt;In the data warehouse phase of processing, we see two dominant use cases: business-intelligence analysis and ad-hoc queries.

&lt;p&gt;In the first case, users connect the data to business intelligence (BI) tools &amp;mdash; such as MicroStrategy  &amp;mdash; to generate reports or do further analysis.

&lt;p&gt;In the second case, users run ad-hoc queries issued by data analysts or decisionmakers.  

&lt;p&gt;In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.  

&lt;p&gt;The Hadoop subproject &lt;a href="http://hadoop.apache.org/hive/"&gt;Hive&lt;/a&gt; provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as &lt;a href="http://wiki.apache.org/hadoop/Hive/HiveODBC"&gt;ODBC&lt;/a&gt;.

&lt;h3&gt;Pig + Hive = Hadoop toolkit&lt;/h3&gt;

&lt;p&gt;The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well.  

&lt;p&gt;Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse &amp;mdash; as soon as the factory is finished, it is available in the warehouse.  

&lt;p&gt;It will also enable us to share &amp;mdash; across both the factory and the warehouse &amp;mdash; metadata, monitoring, and management tools; support and operations teams; and hardware. 

&lt;p&gt;So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.

&lt;p&gt;Alan Gates
Architect, Yahoo! Grid Team&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kjtPcXDYn-E:TRX2a-A-y-A:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kjtPcXDYn-E:TRX2a-A-y-A:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=kjtPcXDYn-E:TRX2a-A-y-A:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kjtPcXDYn-E:TRX2a-A-y-A:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kjtPcXDYn-E:TRX2a-A-y-A:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kjtPcXDYn-E:TRX2a-A-y-A:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=kjtPcXDYn-E:TRX2a-A-y-A:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/kjtPcXDYn-E" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/kjtPcXDYn-E/pig_and_hive_at_yahoo.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/08/pig_and_hive_at_yahoo.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Developer Note</category>
        
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
        
         <pubDate>Wed, 18 Aug 2010 11:45:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/08/pig_and_hive_at_yahoo.html</feedburner:origLink></item>
            <item>
         <title>Hadoop Bay Area User Group – August 18, Yahoo! HQ</title>
         <description>Hi Hadoopers,
&lt;p&gt;
I'm excited to invite you to the next Hadoop Bay Area User Group, August 18th, 6 p.m., at the Yahoo! Sunnyvale campus.
&lt;/p&gt;
&lt;p&gt;
The Hadoop community is growing &amp;mdash; we had more than 200 attendees at the &lt;a href="http://yhoo.it/dA69r9 "&gt;last meetup&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
We invite you to attend whether you are an active submitter, developing Hadoop-based applications, or completely new to the &lt;a href="http://hadoop.apache.org/"&gt;Apache Hadoop&lt;/a&gt; world. In addition to interesting presentations, you will enjoy food, beer, and great networking.
&lt;/p&gt;
&lt;p&gt;
The August 18 event comprises two sessions:
&lt;ul&gt;&lt;li class="bullist"&gt;Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to &lt;a href=http://bit.ly/B7t&gt;Design Pattern&lt;/a&gt;, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.&lt;/li&gt;
&lt;li class="bullist"&gt;Stefan Groschupf, the co-founder and CTO of &lt;strong&gt;&lt;a href="http://www.datameer.com/"&gt;Datameer&lt;/a&gt;&lt;/strong&gt;, will discuss challenges in social media analytics and how to overcome these using big data analytics built on Hadoop, in his “Social Media:  What’s Really the Buzz?” talk. Identifying true thought leads and influencers in social media conversations are becoming increasingly important, so that companies can better understand who is having an impact on their customers' buying decisions. Rather than counting mentions in limited subsets of social media data, organizations need a solution that can uncover complex relationships buried in massive volumes of social media data and a way to bring in data from multiple online data sources to determine the quality and effectiveness of user commentary.&lt;/li&gt;&lt;/ul&gt;
&lt;/p&gt;
For those of you who are not able to attend in person the session's slides and video recording will be posted on this blog after the event. Stay tuned!
&lt;/p&gt;

&lt;p&gt;
&lt;strong&gt;Please RSVP at the &lt;a href="http://www.meetup.com/hadoop/calendar/14081410/"&gt;Hadoop Bay Area Meetup page&lt;/a&gt;&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;
Looking forward to see you all soon.
&lt;/p&gt;
&lt;p&gt;
Susheel Kaushik&lt;br&gt;
Director, Product Management&lt;br&gt;
Cloud Computing at Yahoo! 
&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iBoO1EgNSYo:kV3JJwN-tc0:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iBoO1EgNSYo:kV3JJwN-tc0:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=iBoO1EgNSYo:kV3JJwN-tc0:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iBoO1EgNSYo:kV3JJwN-tc0:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iBoO1EgNSYo:kV3JJwN-tc0:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iBoO1EgNSYo:kV3JJwN-tc0:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=iBoO1EgNSYo:kV3JJwN-tc0:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/iBoO1EgNSYo" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/iBoO1EgNSYo/hug_aug2010.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/08/hug_aug2010.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop user group</category>
        
         <pubDate>Wed, 11 Aug 2010 17:43:32 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/08/hug_aug2010.html</feedburner:origLink></item>
            <item>
         <title>July HUG Recap</title>
         <description>&lt;p&gt;Thanks to the over 200 developers who came to Yahoo! recently for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal, and conversations continued long after the formal sessions.

&lt;p&gt;&lt;img alt="Hundreds of Hadoop Fans Flock to Yahoo! for the Hadoop User Group" src='http://developer.yahoo.com/blogs/hadoop/JulyHug.jpg' width="425" height="355" /&gt;&lt;br&gt;
&lt;em&gt;Hundreds of Hadoop Fans Flock to Yahoo! for the Hadoop User Group&lt;/em&gt;

&lt;p&gt;The event started with Nitin Motgi from Yahoo! describing the challenge of content optimization at scale and how Yahoo! is leveraging the power of Hadoop Stack to conquer this challenge. Hadoop, along with Hive and HBase, is the technology that enables mass content personalization for Yahoo! users. Nitin talked about the high-level architecture and modeling challenges. He concluded with a summary on the learnings so far.

&lt;p&gt;&lt;object id="__sse4819529" width="425" height="355"&gt;&lt;param name="movie" value='http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=1content-optimization-hug-2010-07-21-100722173047-phpapp02&amp;stripped_title=1-content-optimizationhug20100721' /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4819529" src='http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=1content-optimization-hug-2010-07-21-100722173047-phpapp02&amp;stripped_title=1-content-optimizationhug20100721' type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;

&lt;p&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value='http://www.youtube.com/v/pZ0Hh8Mae98&amp;amp;hl=en_US&amp;amp;fs=1'&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src='http://www.youtube.com/v/pZ0Hh8Mae98&amp;amp;hl=en_US&amp;amp;fs=1' type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;

&lt;p&gt;&lt;img src='http://us.i1.yimg.com/us.yimg.com/i/nt/ic/ut/bsc/vidcam12_1.gif' border='0' hspace='10'&gt;&lt;a href='http://ydn.zenfs.com/video/hug_7-18-10_ContentOptimization.mp4'&gt;Download video&lt;/a&gt;&lt;br&gt;

&lt;p&gt;Anil Madan from &lt;a href="http://www.eBay.com/"&gt;eBay&lt;/a&gt; discussed the adoption of the Hadoop Stack at eBay and their future adoption plans. He also touched upon how eBay is leveraging the Hadoop clusters to enhance search relevance and extend catalog coverage for eBay. He described in detail eBay's data-sourcing patters. Anil concluded with the opportunities for the road ahead. 

&lt;p&gt;&lt;object id="__sse4819530" width="425" height="355"&gt;&lt;param name="movie" value='http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=2hadoopebay-hug-2010-07-21-100722173042-phpapp02&amp;stripped_title=2-hadoope-bayhug20100721' /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4819530" src='http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=2hadoopebay-hug-2010-07-21-100722173042-phpapp02&amp;stripped_title=2-hadoope-bayhug20100721' type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;

&lt;p&gt;&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/Kc1UnZ05GWs&amp;amp;hl=en_US&amp;amp;fs=1"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/Kc1UnZ05GWs&amp;amp;hl=en_US&amp;amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;

&lt;p&gt;&lt;img src='http://us.i1.yimg.com/us.yimg.com/i/nt/ic/ut/bsc/vidcam12_1.gif' border='0' hspace='10'&gt;&lt;a href='http://ydn.zenfs.com/video/hug_7-18-10_eBay.mp4'&gt;Download video&lt;/a&gt;&lt;br&gt;

&lt;p&gt;Next was Doug Cutting of &lt;a href="http://cloudera.com/"&gt;Cloudera&lt;/a&gt;, who introduced Avro, a generic serialization system. He talked about the salient features of the system, contrasted the existing serialization system, and highlighted how Avro can help solve the backward-compatibility issues with Hadoop. He walked us through samples on Avro schemas and Avro Map Reduce APIs. Do view the presentation and video for a quick overview and summary of Avro. 

&lt;p&gt;&lt;object id="__sse4819527" width="425" height="355"&gt;&lt;param name="movie" value='http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=3avro-hug-2010-07-21-100722173037-phpapp02&amp;stripped_title=3-avro-hug20100721' /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4819527" src='http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=3avro-hug-2010-07-21-100722173037-phpapp02&amp;stripped_title=3-avro-hug20100721' type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;

&lt;p&gt;&lt;object width="480" height="385"&gt;&lt;param name="movie" value='http://www.youtube.com/v/EBV4C-P3G94&amp;amp;hl=en_US&amp;amp;fs=1'&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src='http://www.youtube.com/v/EBV4C-P3G94&amp;amp;hl=en_US&amp;amp;fs=1' type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;

&lt;p&gt;&lt;img src='http://us.i1.yimg.com/us.yimg.com/i/nt/ic/ut/bsc/vidcam12_1.gif' border='0' hspace='10'&gt;&lt;a href='http://ydn.zenfs.com/video/hug_7-18-10_avro.mp4'&gt;Download video&lt;/a&gt;&lt;br&gt;

&lt;p&gt;We at Yahoo! embrace Hadoop, and we understand and relate to the hurdles eBay is facing, and stumble upon the issues that Avro promises to solve. As always, we are looking for exciting technologies and experiences you want to share. Please contact me via the &lt;a href="http://www.meetup.com/hadoop/"&gt;Hadoop Bay Area User Group Meetup page&lt;/a&gt;.


&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="50" width="50" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src='http://developer.yahoo.com/blogs/hadoop/susheel_kaushik.jpg' alt="Susheel Kaushik"&gt;
Susheel Kaushik&lt;br/&gt;
Director, Product Management&lt;BR&gt;
Cloud Computing at Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=uLlu-JUZ3UI:D9Gn2Ar0TD0:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=uLlu-JUZ3UI:D9Gn2Ar0TD0:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=uLlu-JUZ3UI:D9Gn2Ar0TD0:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=uLlu-JUZ3UI:D9Gn2Ar0TD0:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=uLlu-JUZ3UI:D9Gn2Ar0TD0:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=uLlu-JUZ3UI:D9Gn2Ar0TD0:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=uLlu-JUZ3UI:D9Gn2Ar0TD0:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/uLlu-JUZ3UI" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/uLlu-JUZ3UI/july_hug_recap.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/08/july_hug_recap.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Hadoop User Group</category>
        
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop user group</category>
        
         <pubDate>Wed, 04 Aug 2010 14:00:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/08/july_hug_recap.html</feedburner:origLink></item>
            <item>
         <title>Hadoop Archive: File Compaction for HDFS</title>
         <description>&lt;p&gt;Thanks to &lt;a href="mailto:tsz@yahoo-inc.com"&gt;Tsz-Wo Nicholas Sze&lt;/a&gt; and &lt;a href="mailto:mahadev@yahoo-inc.com"&gt;Mahadev Konar&lt;/a&gt; for this article.&lt;/p&gt;
    
&lt;h2&gt;The Problem of Many Small Files&lt;/h2&gt;
    
&lt;p&gt;The &lt;a href="http://developer.yahoo.com/hadoop/tutorial/module2.html"&gt;Hadoop Distributed File System (HDFS)&lt;/a&gt;  is designed to store and process large (terabytes) data sets. At Yahoo!, for example, a large production cluster may have 14 PB disk spaces and store 60 millions of files.&lt;/p&gt;
    
&lt;p&gt;However, storing a large number of small files in HDFS is inefficient. We call a file &lt;em&gt;small&lt;/em&gt; when its size is substantially less than the HDFS block size, which is 128 MB by default. Files and blocks are &lt;em&gt;name objects&lt;/em&gt; in HDFS and they occupy namespace. The namespace capacity of the system is naturally limited by the physical memory in the NameNode.&lt;/p&gt; 

&lt;p&gt;When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only one block. These small files use up 95% of the namespace but only occupy 30% of the disk space.&lt;/p&gt;
    
&lt;p&gt;&lt;em&gt;Hadoop Archive (HAR)&lt;/em&gt; is an effective solution to the problem of many small files. HAR packs a number of small files into large files so that the original files can be accessed in parallel transparently (without expanding the files) and efficiently. &lt;/p&gt;

&lt;p&gt;HAR increases the scalability of the system by reducing the namespace usage and decreasing the operation load in the NameNode. This improvement is orthogonal to memory optimization in NameNode and distributing namespace management across multiple NameNodes.&lt;/p&gt; 

&lt;p&gt;Hadoop Archive is also MapReduce-friendly — it allows parallel access to the original files by MapReduce jobs.&lt;/p&gt;

 &lt;h2&gt;What is Hadoop Archive?&lt;/h2&gt;
    
&lt;p&gt;Hadoop Archive has three components: a data model that defines the archive format, a FileSystem interface that allows transparent access, and a tool for creating archives with MapReduce jobs.&lt;/p&gt;
    
&lt;h3&gt;The Data Model: har format&lt;/h3&gt;

&lt;img alt="har.png" src='http://developer.yahoo.com/blogs/hadoop/har.png' width="600" /&gt;

&lt;p&gt;Figure 1: Archiving small files&lt;/p&gt;

&lt;p&gt;The Hadoop Archive's data format is called &lt;em&gt;har&lt;/em&gt;, with the following layout:&lt;/p&gt;
    
&lt;p&gt;&lt;code&gt;foo.har/_masterindex    //stores hashes and offsets&lt;br /&gt;
 foo.har/_index                   //stores file statuses&lt;br /&gt;
 foo.har/part-[1..n]     //stores actual file data&lt;/code&gt;&lt;/p&gt;
    
&lt;p&gt;The file data is stored in multiple part files, which are indexed for keeping the original separation of data intact.  Moreover, the part files can be accessed by MapReduce programs in parallel.  The index files also record the original directory tree structures and the file statuses.  In Figure 1, a directory containing many small files is archived into a directory with large files and indexes.&lt;/p&gt;
    
&lt;h3&gt;HarFileSystem – A first-class FileSystem providing transparent access&lt;/h3&gt;

&lt;p&gt;Most archival systems, such as tar, are tools for archiving and de-archiving. Generally, they do not fit into the actual file system layer and hence are not transparent to the application writer in that the user had to de-archive the archive before use.&lt;/p&gt;

&lt;p&gt;Hadoop Archive is integrated in the Hadoop’s &lt;em&gt;FileSystem&lt;/em&gt; interface.  The &lt;em&gt;HarFileSystem&lt;/em&gt; implements the FileSystem interface and provides access via the &lt;code&gt;har://&lt;/code&gt; scheme.  This exposes the archived files and directory tree structures transparently to the users.  Files in a har can be accessed directly without expanding it.  For example, we have the following command to copy a HDFS file to a local directory:&lt;/p&gt;
    
&lt;p&gt;&lt;code&gt;hadoop fs –get hdfs://namenode/foo/file-1 localdir&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Suppose an archive &lt;code&gt;bar.har&lt;/code&gt; is created from the foo directory.  Then, the command to copy the original file becomes&lt;/p&gt;
    
&lt;p&gt;&lt;code&gt;hadoop fs –get har://namenode/bar.har#foo/file-1 localdir&lt;/code&gt;&lt;/p&gt;
    
&lt;p&gt;Users only have to change the URI paths.  Alternatively, users may choose to create a symbolic link (from &lt;code&gt;hdfs://namenode/foo&lt;/code&gt; to &lt;code&gt;har://namenode/bar.har#foo&lt;/code&gt; in the example above), then even the URIs do not need to be changed. In either case, &lt;code&gt;HarFileSystem&lt;/code&gt; will be invoked automatically for providing access to the files in the har.  Because of this transparent layer, har is compatible with the Hadoop APIs, MapReduce, the shell command -ine interface, and higher-level applications like Pig, Zebra, Streaming, Pipes, and DistCp.&lt;/p&gt;
    
&lt;h3&gt;The Archiving Tool: A MapReduce program for creating har&lt;/h3&gt;
    
&lt;p&gt;To create har efficiently in parallel, we implemented an archiving tool using MapReduce.  The tool can be invoked by the command&lt;/p&gt;
    
&lt;p&gt;&lt;code&gt;hadoop archive -archiveName &amp;lt;name&amp;gt; &amp;lt;src&amp;gt;* &amp;lt;dest&amp;gt;&lt;/code&gt;&lt;/p&gt;
    
&lt;p&gt;A list of files is generated by traversing the source directories recursively, and then the list is split into map task inputs.  Each map task creates a part file (about 2 GB, configurable) from a subset of the source files and outputs the metadata.  Finally, a reduce task collects metadata and generates the index files.&lt;/p&gt;
    
&lt;h3&gt;Future tasks&lt;/h3&gt;
    
&lt;p&gt;Currently, a har cannot be modified after it has been created.  Supporting modifiable har is one candidate for future work.  Once HFDS supports variable length blocks, har could possibly be created by moving the blocks metadata without copying the actual data.  Then, har creation would be nearly instantaneous.&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;
Thanks to &lt;a href="mailto:tsz@yahoo-inc.com"&gt;Tsz-Wo Nicholas Sze&lt;/a&gt; and &lt;a href="mailto:mahadev@yahoo-inc.com"&gt;Mahadev Konar&lt;/a&gt; for this article.&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=x52CnnYOwRE:lMcCLSWSEBw:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=x52CnnYOwRE:lMcCLSWSEBw:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=x52CnnYOwRE:lMcCLSWSEBw:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=x52CnnYOwRE:lMcCLSWSEBw:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=x52CnnYOwRE:lMcCLSWSEBw:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=x52CnnYOwRE:lMcCLSWSEBw:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=x52CnnYOwRE:lMcCLSWSEBw:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/x52CnnYOwRE" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/x52CnnYOwRE/hadoop_archive_file_compaction.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/07/hadoop_archive_file_compaction.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Developer Note</category>
        
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
                  <category domain="http://www.sixapart.com/ns/types#tag">HDFS</category>
        
         <pubDate>Tue, 27 Jul 2010 13:00:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/07/hadoop_archive_file_compaction.html</feedburner:origLink></item>
            <item>
         <title>Hadoop 0.20.S Virtual Machine Appliance</title>
         <description>At Yahoo!, we recently implemented a stronger notion of security for the Hadoop platform, based on Kerberos as underlying authentication system. We also successfully enabled this feature within Yahoo! on our internal data processing&amp;nbsp; clusters. I am sure many Hadoop developers and enterprise users are looking forward to get hands-on experience with this enterprise-class Hadoop Security feature.
&lt;br/&gt;&lt;br/&gt;
In the past, we've aided developers and users get started with Hadoop by hosting a comprehensive &lt;a href="http://developer.yahoo.com/hadoop/tutorial/"&gt;Hadoop tutorial on YDN&lt;/a&gt;, along with a pre-configured single node &lt;a href="http://p.yimg.com/c/ycs/ydn/hadoop/hadoop-vm-appliance-0-18-0_v1.zip"&gt;Hadoop (0.18.0) Virtual Machine appliance&lt;/a&gt;.
&lt;br/&gt;&lt;br/&gt;
This time, we decided to upgrade this Hadoop VM with a pre-configured single node Hadoop 0.20.S cluster, along with required Kerberos system components. We have also included &lt;a href="http://hadoop.apache.org/pig/docs/r0.7.0/"&gt;Pig (version 0.7.0)&lt;/a&gt;, a high level SQL-like data processing language used at Yahoo!.
&lt;br/&gt;&lt;br/&gt;
This blog post describes how to get started with the Hadoop 20.S VM appliance. The basic information about downloading, setting up VM Player, and using the Hadoop VM is same as described in the &lt;a href="http://developer.yahoo.com/hadoop/tutorial/module3.html"&gt;tutorial
module-3&lt;/a&gt; &amp;mdash; except the user has to use the following information and links to download the latest VM Player and&amp;nbsp; Hadoop 0.20.S VM Image. You should also review the following information for security-specific commands that need to be performed before running M/R or Pig jobs.
&lt;br/&gt;&lt;br/&gt;
For more details on deploying and configuring Yahoo! Hadoop 0.20.S security distribution, look for continuing announcements and details on &lt;a href="http://developer.yahoo.com/hadoop/"&gt;Hadoop-YDN&lt;/a&gt;.
&lt;br/&gt;&lt;br/&gt;
&lt;h3&gt;Installing and Running the Hadoop 0.20.S Virtual Machine:&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;big style="font-weight: bold;"&gt;&lt;span
 style="color: rgb(255, 102, 0);"&gt;Virtual Machine and Hadoop
environment&lt;/span&gt;:&lt;/big&gt; See &lt;a
 href="http://developer.yahoo.com/hadoop/tutorial/module3.html#vm"&gt;details&lt;/a&gt;
here. &lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;big style="font-weight: bold;"&gt;&lt;span
 style="color: rgb(255, 102, 0);"&gt;Install VMware Player&lt;/span&gt;:&lt;/big&gt;&lt;span
 style="font-weight: bold;"&gt; &lt;/span&gt;See &lt;a
 href="http://developer.yahoo.com/hadoop/tutorial/module3.html#vmware-install"&gt;details&lt;/a&gt;
here. To download latest VMware Player for Windows/Linux, go to &lt;a
 href="http://www.vmware.com/products/player/"&gt;Vmware&lt;/a&gt; site&lt;/li&gt;&lt;/ul&gt;
&lt;ul style="font-weight: bold;"&gt;&lt;li&gt;&lt;big&gt;&lt;span style="color: rgb(255, 102, 0);"&gt;Setting up the Virtual Environment for Hadoop 0.20.S&lt;/span&gt;:&lt;/big&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;div style="margin-left: 50px;"&gt;Copy the [&lt;a
 href="http://shared.zenfs.com/hadoop-vm-appliance-0-20-S.zip"&gt;Hadoop
0.20.S Virtual Machine&lt;/a&gt;] into a location on your hard drive.
&lt;BR&gt;It is a zipped vmware folder (hadoop-vm-appliance-0-20-S,
appriox ~400MB), which includes a few files: a .vmdk file that is a
snapshot of the virtual machine's hard drive, and a .vmx file that 
contains the configuration information to start the virtual machine.
After unzipping the vmware folder zip file, to start the virtual
machine, double-click on the hadoop-appliance-0.20.S.vmx file.&amp;nbsp; Note: Uncompressed Size of
hadoop-vm-appliance-0-20-S folder is ~2GB. Also, based on that data you upload
for testing,
VM disk is configured to grow up to 20GB).&lt;br&gt;
&lt;br&gt;
When you start
the virtual machine for the first time, VMware Player will recognize
that the virtual machine image is not in the same location it
used to be. You should inform VMware Player that you copied this
virtual machine image (choose "I copied it"). VMware Player will then generate new session
identifiers for this instance of the virtual machine. If you later move
the VM image to a different location on your own hard drive, you
should tell VMware Player that you have moved the image.&lt;br&gt;
&lt;br&gt;

After you select this option and click OK, the virtual machine
should begin booting normally. You will see it perform the standard
boot procedure for a Linux system. It will bind itself to an IP address
on an unused network segment, and then display a prompt allowing a user
to log in.&lt;br&gt;
&lt;br&gt;
&lt;span style="font-weight: bold;"&gt;Note: &lt;/span&gt;The IP address
displayed on the login screen can be used to connect to VM instance over SSH. The Login
screen also displays information about starting/stopping
Hadoop daemons, users/passwords, and how to shutdown the VM.&lt;br&gt;
&lt;br&gt;
&lt;span style="font-weight: bold;"&gt;Note: &lt;/span&gt;It is much more convenient to
access the VM via SSH. See &lt;a
 href="http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-ssh"&gt;details&lt;/a&gt;
here.&lt;br&gt;
&lt;/div&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;big&gt;&lt;span style="color: rgb(255, 102, 0);"&gt;&lt;/span&gt;&lt;span
 style="font-weight: bold;"&gt;&lt;span style="color: rgb(255, 102, 0);"&gt;Virtual
Machine User
Accounts&lt;/span&gt;:&lt;/span&gt;&lt;/big&gt;&lt;/li&gt;

&lt;/ul&gt;
&lt;div style="margin-left: 40px;"&gt;The virtual machine comes pre-configured
with two user accounts:
"root" and&amp;nbsp; "hadoop-user". The hadoop-user account has sudo
permissions to perform system-management functions, such as shutting
down the virtual machine. The vast majority of your interaction with
the virtual machine will be as hadoop-user. To log in as hadoop-user,
first click inside the virtual machine's display. The virtual machine
will take control of your keyboard and mouse. To escape back into
Windows at any time, press CTRL+ALT at the same time. The hadoop-user
user's password is &lt;code&gt;hadoop&lt;/code&gt;. To log in as root, the password is &lt;code&gt;root&lt;/code&gt;.&lt;br&gt;
&lt;/div&gt;
&lt;ul style="font-weight: bold;"&gt;
  &lt;li&gt;&lt;big&gt;&lt;span style="color: rgb(255, 102, 0);"&gt;Hadoop Environment&lt;/span&gt;:&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div style="margin-left: 40px;"&gt;
Linux&amp;nbsp;&amp;nbsp;&amp;nbsp; : Ubuntu 8.04&lt;br&gt;

&lt;/div&gt;
&lt;div style="margin-left: 40px;"&gt;Java&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
: JRE 6 Update 7 (See License info @ /usr/jre16/)&lt;br&gt;
Hadoop : 0.20.S&amp;nbsp; (installed @ /usr/local/hadoop,&amp;nbsp;
/home/hadoop-user/hadoop is symlink to install directory)&lt;br&gt;
Pig&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : 0.7.0 (pig jar is
installed @ /usr/local/pig,&amp;nbsp;
/home/hadoop-user/pig-tutorial/pig.jar&amp;nbsp; is&amp;nbsp; symlink to&amp;nbsp;

one in install directory)&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;div style="margin-left: 40px;"&gt;Login: hadoop-user, Passwd: hadoop
(sudo privileges are granted for
hadoop-user). The other usrers are hdfs and mapred (passwd: hadoop).&lt;br&gt;
&lt;br&gt;
Hadoop VM starts all the required hadoop and Kerberos daemons during
the boot-up process, but in case the user needs to stop/restart, &lt;br&gt;
&lt;ul&gt;
  &lt;li&gt;To start/stop/restart hadoop: login as hadoop-user and run 'sudo
/etc/init.d/hadoop [start | stop | restart]' ('sudo /etc/init.d/hadoop'
gives the usage)&lt;/li&gt;
  &lt;li&gt;To format the HDFS &amp;amp; clean all state/logs: login as
hadoop-user
and run 'sudo reinit-hadoop'&lt;/li&gt;

  &lt;li&gt;To start/stop/restart Kerberos KDC Server: login as hadoop-user
and
run 'sudo
/etc/init.d/krb5-kdc [start | stop | restart]'&lt;/li&gt;
  &lt;li&gt;To start/stop/restart Kerberos ADMIN Server: login as hadoop-user
and run 'sudo
/etc/init.d/krb5-admin-server [start | stop | restart]'&lt;br&gt;
  &lt;/li&gt;
&lt;/ul&gt;
To shut down the Virtual Machine: login as hadoop-user and run command 'sudo
poweroff'&lt;br&gt;
&lt;br&gt;
Environment for 'hadoop-user' (set in /home/hadoop-user/.profile)&lt;br&gt;
&amp;nbsp; $HADOOP_HOME=/usr/local/hadoop&lt;br&gt;

&amp;nbsp; $HADOOP_CONF_DIR=/usr/local/etc/hadoop-conf&lt;br&gt;
&amp;nbsp; $PATH=/usr/local/hadoop/bin:$PATH&lt;br&gt;
&lt;/div&gt;
&lt;ul style="font-weight: bold;"&gt;
  &lt;li&gt;&lt;big&gt;&lt;span style="color: rgb(255, 102, 0);"&gt;Running M/R Jobs&lt;/span&gt;:&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div style="margin-left: 40px;"&gt;Running M/R jobs in Hadoop 0.20.S is
pretty much same as running
them in non-secure version of Hadoop. Except before running any Hadoop
Jobs or HDFS commands, the hadoop-user needs to get
the Kerberos authentication token using the command 'kinit'; the password
is
&lt;code&gt;hadoopYahoo1234&lt;/code&gt;.&lt;br&gt;

&lt;/div&gt;
&lt;br&gt;
&lt;div style="margin-left: 40px;"&gt;For example:&lt;br&gt;
hadoop-user@hadoop-desk:~$ cd hadoop&lt;br&gt;
hadoop-user@hadoop-desk:~$ kinit&lt;br&gt;
Password for hadoop-user@LOCALDOMAIN:&amp;nbsp; hadoopYahoo1234&lt;br&gt;
hadoop-user@hadoop-desk:~/hadoop$ bin/hadoop jar
hadoop-examples-0.20.104.1.1006042001.jar  pi 10 1000000&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;div style="margin-left: 40px;"&gt;For automated runs of hadoop jobs, a
keytab file is created under the hadoop-user's home directory (/home/hadoop-user/hadoop-user.keytab).
This will allow user to execute the "kinit" without having to manually enter the password. So
for automated runs of hadoop commands or M/R, Pig jobs through the cron
daemon, users can invoke the following command to get the Kerberos
ticket. Use command 'klist' to view the Kerberos ticket and its validity. &lt;br&gt;

&lt;/div&gt;
&lt;br&gt;
&lt;div style="margin-left: 40px;"&gt;For example:&lt;br&gt;
hadoop-user@hadoop-desk:~$ cd hadoop&lt;br&gt;
hadoop-user@hadoop-desk:~$ kinit -k -t
/home/hadoop-user/hadoop-user.keytab hadoop-user/localhost@LOCALDOMAIN&lt;br&gt;
hadoop-user@hadoop-desk:~/hadoop$ bin/hadoop jar
hadoop-examples-0.20.104.1.1006042001.jar pi 10 1000000&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;ul style="font-weight: bold;"&gt;
  &lt;li&gt;&lt;big&gt;&lt;span style="color: rgb(255, 102, 0);"&gt;Running Pig Tutorial&lt;/span&gt;:&lt;/big&gt;&lt;/li&gt;

&lt;/ul&gt;
&lt;div style="margin-left: 40px;"&gt;The Pig tutorial is installed at 
"/home/hadoop-user/pig-tutorial". Example commands to run the Pig
script are given in "example.run.cmd.sh". The Data needed for Pig scripts
are already copied to HDFS. See more details about the &lt;a
 href="http://hadoop.apache.org/pig/docs/r0.7.0/tutorial.html"&gt;Pig
Tutorial &lt;/a&gt;at Pig@Apache&lt;br&gt;
&lt;ul&gt;
  &lt;li&gt;hadoop-user@hadoop-desk:~$ cd pig-tutorial&lt;/li&gt;
  &lt;li&gt;hadoop-user@hadoop-desk:~$ sh example.run.cmd.sh&lt;/li&gt;
&lt;/ul&gt;

&lt;/div&gt;
&lt;ul style="font-weight: bold;"&gt;
  &lt;li&gt;&lt;big&gt;&lt;span style="color: rgb(255, 102, 0);"&gt;Shutting down the VM&lt;/span&gt;:&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div style="margin-left: 40px;"&gt;When you are done with the virtual
machine, you can turn it off by logging in as the hadoop-user and running
the command 'sudo poweroff'. The virtual machine will shut itself down
in an orderly fashion and the window it runs in will disappear. &lt;br&gt;
&lt;/div&gt;
&lt;br/&gt;
&lt;p&gt;
Last but not least, I would like to thank Devaraj Das and Jianyong Dai from the Yahoo! Hadoop &amp;amp; Pig Develoment team for their help in setting up and configuring Hadoop 0.20.S and Pig respectively.
&lt;/p&gt;
&lt;p&gt;
&lt;em&gt;
&lt;span style="font-weight: bold;"&gt;Notice: &lt;/span&gt;
Yahoo! does not offer any support for the Hadoop Virtual Machine.&lt;BR&gt;
The software include cryptographic software that is subject to U.S. export control laws and applicable export and import laws of other countries. BEFORE using any software made available from this site, it is your responsibility to understand and comply with these laws. This software is being exported in accordance with the Export Administration Regulations. As of June 2009, you are prohibited from exporting and re-exporting this software to Cuba, Iran, North Korea, Sudan, Syria and any other countries specified by regulatory update to the U.S. export control laws and regulations. Diversion contrary to U.S. law is prohibited.
&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="60" width="45" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/gogate.jpg "alt="Suhas Gogate"&gt;
Suhas Gogate&lt;br/&gt;
Technical Yahoo!, Cloud Solutions Team, Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=xm8uFE1QUO4:ofwca13JcfA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=xm8uFE1QUO4:ofwca13JcfA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=xm8uFE1QUO4:ofwca13JcfA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=xm8uFE1QUO4:ofwca13JcfA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=xm8uFE1QUO4:ofwca13JcfA:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=xm8uFE1QUO4:ofwca13JcfA:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=xm8uFE1QUO4:ofwca13JcfA:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/xm8uFE1QUO4" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/xm8uFE1QUO4/hadoop_020s_virtualmachine.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/06/hadoop_020s_virtualmachine.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
                  <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
                  <category domain="http://www.sixapart.com/ns/types#tag">pig</category>
                  <category domain="http://www.sixapart.com/ns/types#tag">sandbox</category>
                  <category domain="http://www.sixapart.com/ns/types#tag">vm appliance</category>
        
         <pubDate>Tue, 29 Jun 2010 06:00:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/06/hadoop_020s_virtualmachine.html</feedburner:origLink></item>
            <item>
         <title>Managing Big Data: Architectural Approaches for making batch data available online</title>
         <description>&lt;p&gt;
This is the beginning of an ongoing series of blog posts on “Managing Big Data”.  This series will focus on techniques that Yahoo uses to process large volumes of data, ranging from initial collection of data to the end usage of that data.
&lt;/p&gt;
&lt;h3&gt;Introduction&lt;/h3&gt;
&lt;p&gt;
Over  the last several years there are two important trends that require additional thought when putting together an architecture for a hosted service.  At Yahoo!, the ability to analyze and process enormous amounts of data is increasingly important. It’s a foundational layer for improving our consumer experiences and for sharing audience insights with advertisers. 
&lt;/p&gt;
&lt;p&gt;
From a technology perspective, the two trends I'd like to focus on are:
&lt;/p&gt;
&lt;p&gt;
1. &lt;b&gt;Batch processing&lt;/b&gt; -- the increasing awareness of batch processing and the recent uptick   in use of the map/reduce paradigm for that purpose.
&lt;/p&gt;
&lt;p&gt;
2.  &lt;b&gt;NoSQL stores&lt;/b&gt; – The rise of so called "NoSQL" stores and their use to serve up data to online users (typically inside of the user's request/response cycle).
&lt;/p&gt;
&lt;p&gt;
Both of these trends represent significant advances in the way that hosted systems are developed.  But in order to derive the most value for an entire system, developers must think about how these two areas will work together in some holistic manner.
&lt;/p&gt;
Let's look at a specific scenario to make this more concrete:

&lt;h3&gt;Making batch data available to the online system&lt;/h3&gt;
&lt;p&gt;
&lt;img alt="Data Available Online" src="http://developer.yahoo.com/blogs/hadoop/dataavailableonline2.png" width="600" height="151" /&gt;

&lt;/p&gt; 
&lt;p&gt;
Let's assume for the moment that you're building a new e-commerce site.   And let's assume that one of the significant features of this e-commerce site is to provide user recommendations for which items a user may be interested in purchasing.  This overall feature will decompose into batch component (to determine the recommendations to give to a user) and an online component (that will present the recommendations to end user).
&lt;/p&gt;
&lt;p&gt;
For the batch system we may choose to use a map/reduce framework like &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; .  The batch recommendation component will rely on various types of raw event data.  Some examples include: the products the end user has viewed, the products they have purchased and types of searches the user has performed, etc.  We will create a model using this data; perhaps based on a simple behavior model (e.g. if the user looks at a significant amount of sports equipment than recommend products in the sports equipment category) or a collaborative filtering model (e.g. if other users purchased the same products as this end user, recommend other products that they purchased).
&lt;/p&gt;
&lt;p&gt;
Once we have decided on the data inputs and the model for making recommendations, we'll like produce the output of the batch processing as a set of recommendations for each user on a Hadoop cluster.  
&lt;/p&gt;
&lt;p&gt;
There are two approaches we can then make for making the batch data available online in a NoSQL store.
&lt;/p&gt;
&lt;p&gt;
&lt;p&gt;
&lt;h4&gt;1.Full updates&lt;/h4&gt;
&lt;/p&gt;
&lt;p&gt;
In this approach as part of the batch processing, we will recreate the set of recommendations for all users.  It also may be possible to create a native version of the online store format in the batch system.  This is generally only possible if there is no scenario where the data (user recommendations for products in this case) is updated in the online store.  There must also exist a library to write the into the online store's file format (not all NoSQL stores have such as library).   
&lt;/p&gt;
&lt;p&gt;
This approach is attractive for a couple of reasons:  
&lt;/p&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt;There should not be any consistency issues between your offline and online representation of the data as you are always creating an entirely new copy of the online data at some interval (for instance once a day).&lt;/li&gt;
&lt;li&gt;This approach should have the least impact of your online store performance if you can perform a "swap" with new data set.  How this would work is that the newly created set of recommendations are pushed in the native NoSQL store format to the online stores physical boxes.  Once there each of the NoSQL stores are "upgraded" to start using the new copy of the recommendations and stop using the old version.&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;h4&gt;2. Incremental/Delta updates&lt;/h4&gt;
&lt;/p&gt;
&lt;p&gt;
Taking an incremental update approach between your offline and online store entails creating a new set incremental data changes in your batch system.  For instance, creating a new set of user recommendations for newly joined users or users that have had some recent activity (which would change the recommendations).  This incremental data must then be pushed to the online store.  
&lt;/p&gt;
&lt;p&gt;
This approach is attractive for a couple of reasons:  
&lt;/p&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt;Latency.  By processing just the incremental updates of your recommendations is possible to update the online stores on a frequent basis.  For instance, it may be possible to run a batch job on Hadoop every 30 minutes that produces a new set of product recommendations.  These product recommendations can then be pushed to the online store at a 30 minute interval.  The full update approach to moving the offline data online will more likely have an update frequency of several hours or even once a day (due to the large amount of data that needs to be processed and transferred to the online stores).&lt;/li&gt;
&lt;li&gt;Size of updates.  Depending on the size of the data, it's also possible that incremental updates maybe the only viable solution.  For instance if the number of users we would like to create recommendations for is in the 10s of millions and the number of recommendations for each user is large, the data set may be too large to recalculate and push the entire set. &lt;/li&gt; 
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
One weakness of the incremental update approach is that it can have a performance impact on your online store.  Therefore, when you apply updates to the online store you may need to consider some form of throttling of the updates as they are applied to the online store. 
&lt;/p&gt;
&lt;h3&gt;Key takeaways&lt;/h3&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt;Consider the type of data you have and whether pushing full updates or delta updates is more appropriate for your type of data as this fundamentally affects the architecture for making your batch data available online. At Yahoo we use both approaches depending on the specific scenario. &lt;/li&gt;
&lt;li&gt;Throttling your updates to your online store is important consideration to maintain your online stores latency and availability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
In a future post I'll also address taking data that's available in an online NoSQL store and making that data available in batch system like Hadoop.
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="60" width="45" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/dirkr.jpg" alt="Dirk Reinshagen"&gt;
Dirk Reinshagen&lt;br/&gt;
Cloud Architect&lt;BR&gt;
Cloud Computing at Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=hIIeoIwz8yM:PCwaCyxFxjU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=hIIeoIwz8yM:PCwaCyxFxjU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=hIIeoIwz8yM:PCwaCyxFxjU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=hIIeoIwz8yM:PCwaCyxFxjU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=hIIeoIwz8yM:PCwaCyxFxjU:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=hIIeoIwz8yM:PCwaCyxFxjU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=hIIeoIwz8yM:PCwaCyxFxjU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/hIIeoIwz8yM" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/hIIeoIwz8yM/managing_big_data_architectura.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/06/managing_big_data_architectura.html</guid>
        
        
         <pubDate>Thu, 24 Jun 2010 10:05:16 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/06/managing_big_data_architectura.html</feedburner:origLink></item>
            <item>
         <title>Hadoop and the fight against shape-shifting spam</title>
         <description>&lt;p&gt;
At a recent &lt;a href="http://www.meetup.com/hadoop/"&gt;Hadoop User Group&lt;/a&gt; meeting, I made a presentation on &lt;a href="http://bit.ly/dpq67T"&gt;how we leverage hadoop for spam mitigation in Yahoo! Mail&lt;/a&gt;. A number of people followed up requesting additional details of our architecture and engineering strategy. 
&lt;BR&gt;
In this post, I am going to try and capture our antispam engineering story, how it came to be hadoop centric and how well the new architecture has worked. I will also highlight the results we have been able to achieve. Finally, I will provide an update on when we will be releasing these updates to wide production. 
&lt;/p&gt;
&lt;p&gt;
At the &lt;a href="http://www.slideshare.net/hadoopusergroup/mail-antispam?from=ss_embed"&gt;Hadoop User group presentation&lt;/a&gt;, I had delved into the details of two interesting antispam algorithms. The first was "frequent itemset mining", the second was what we called the "connected components" algorithm. Both these algorithms are implemented as part of our tools portfolio. They are used by engineers, product managers and operations analysts to get a compact summary of the major trends in spam. Both these tools were implemented as part of a new engineering strategy we put in place second quarter of last year. 
&lt;/p&gt;

&lt;p&gt;
Our new strategy called for pointed improvements in the ability of systems to digest massive amounts of data. The first portion of the strategy, implemented by the end of 2009, targeted our reporting systems, tool chains and existing abuse reputation algorithms. The proposal was to increase the granularity of the data being handled, increase the response time to detect an attack and do to do more early detection of spam attacks. In our analysis, we quickly realized that even the small changes we were proposing to our reporting systems, tools and algorithms required us to scale our existing systems well beyond the limits that they were meant to scale. 
&lt;/p&gt;
&lt;p&gt;
Also, we found that engineering, product and customer advocacy teams were all hungry for data and it would be great to support additional requirements around ad hoc joins across data streams and support a general "slice" and "dice" approach to data engineering. Our first revisions to the sender and content classifiers also made it abundantly clear that we needed massive storage and massive compute.
&lt;/p&gt;
&lt;p&gt;
We took the simple approach of putting ALL the data that we would possibly want to query or develop algorithms on, on a hadoop grid and let the grid scale to the storage and compute requirements. To give you some idea of the scale involved here, let me provide some ball park metrics. 
&lt;BR&gt;
By the end of this quarter, we will be loading close to 4TB of antispam data on our hadoop grids every day and we will be querying several days of data for report generation and running automated classifiers and algorithms at a frequency of a few minutes. We have not run into any scalability problems so far. In general we have found that with proper data organization, hadoop is able to scale linearly with data and compute requirements.
&lt;/p&gt;
&lt;p&gt;
I will complete this section by saying that this strategy has had a huge impact on spam complaints. See for you self; I am enclosing the graph of our spam complaints from last year. The big dip corresponds to when we shifted our reports, algorithms and filters to the hadoop grid. Need I say more?
&lt;/p&gt;
&lt;p align="center"&gt;&lt;img src="http://developer.yahoo.com/blogs/hadoop/mail_pic.png" width="500" height="300"/&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;
While making changes to how an existing system works is interesting and clearly the first step, the second and more interesting step is the development of brand new, distributed reputation algorithms using hadoop. Once again, our new engineering strategy called for the rewiring of all algorithms to run in parallel and increase the level of feature engineering across heuristic, statistical and machine learning systems. We needed to do this across reputation algorithms for IPs, domains, from addresses(senders), receivers(users) and content. Once again, we realized that much of the complexity was in massive data engineering. We needed to ensure that we used every bit of data that would help us make a spam/notspam decision. We also had to choose an appropriate model that could interpret this vast amount of data without getting overwhelmed.
&lt;/p&gt;
&lt;p&gt;
The term "massive feature engineering" should be familiar to people in the area of machine learning. In more common engineering terms, we needed to associate several pieces of meta data to every entity we needed to classify and we needed to choose algorithms that would parallelize well. We have been hard at work the last 5 months doing this new "hadoop engineering". By the end of this month, we hope to release our first hadoop based, massively feature engineered, distributed sender classification algorithm. Code named zeroB, our initial tests make this a very compelling replacement for our current sender management system. It is 25% more accurate while being faster and cheaper to run and maintain than the current version. 
&lt;/p&gt;
&lt;p&gt;
Indeed I have now come to believe that Hadoop has tremendous applicability to the abuse and security domains as a whole. Both these domains have the proverbial problem of finding the needle in the hay stack and hadoop is well equipped for this task. With the amount of spam that large mail systems like yahoo see, it is truly important to employ powerful frameworks like Hadoop to ensure the problem remains tractable.
&lt;/p&gt;
&lt;p&gt;
Yahoo mail was recently voted by the renowned &lt;a href="http://www.fraunhofer.de/en/
"&gt;fraunhaufer institute&lt;/a&gt; as the &lt;a href="http://www.ymailblog.com/blog/2010/05/it%E2%80%99s-official-no-one-fights-spam-harder-smarter-or-better-than-yahoo-mail/"&gt;best free mail service for spam management&lt;/a&gt;. This recognition clearly demonstrates that our new hadoop based strategy is working and working very well. This is just the tip of the success though. In the next 3 months, we are rolling out many of these new systems to our wide install base in the United States and I am eagerly waiting to see the effect this has on spam. It has been fun and exciting building these systems on top of hadoop but it has been even more exciting to see us winning the war on spam. 
&lt;/p&gt;
&lt;p&gt;
Join us next week for the &lt;a href="http://developer.yahoo.com/events/hadoopsummit2010/agenda.html#77"&gt;Yahoo! keynote&lt;/a&gt;  at &lt;a href="http://www.hadoopsummit.org"&gt;Hadoop Summit 2010&lt;/a&gt; to hear more about Hadoop and Mail Antispam.

&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="60" width="55" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/vishr.jpg" alt="Vishwanath Ramarao"&gt;
Vishwanath Ramarao&lt;BR&gt;Director of Anti-Spam Engineering&lt;BR&gt;Yahoo! Mail&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=jE_tedTWxA0:aMFVEZ2Ej1o:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=jE_tedTWxA0:aMFVEZ2Ej1o:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=jE_tedTWxA0:aMFVEZ2Ej1o:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=jE_tedTWxA0:aMFVEZ2Ej1o:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=jE_tedTWxA0:aMFVEZ2Ej1o:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=jE_tedTWxA0:aMFVEZ2Ej1o:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=jE_tedTWxA0:aMFVEZ2Ej1o:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/jE_tedTWxA0" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/jE_tedTWxA0/hadoop_and_yahoo_maill_anti_sp.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/06/hadoop_and_yahoo_maill_anti_sp.html</guid>
        
        
         <pubDate>Tue, 15 Jun 2010 13:58:40 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/06/hadoop_and_yahoo_maill_anti_sp.html</feedburner:origLink></item>
            <item>
         <title>Enabling Hadoop Batch Processing Systems to Consume Streaming Data</title>
         <description>&lt;p&gt;At Yahoo!, the ability to analyze and process enormous amounts of data is increasingly important. It’s a foundational layer for improving our consumer experiences and for sharing audience insights with advertisers. 
&lt;/p&gt;
&lt;p&gt;
In the last few years, I have been a part of a project to design, build, and run a low-latency, large-scale, distributed event data collection system at Yahoo!. When we started off, the goal seemed relatively unambitious, to collect web-access event data across all of the web-servers across all the data centers and bring it to a central location for processing. This perception soon changed after we realized that this involved around 20000 machines and over 20 data centers across the world amounting to over 40 billion events per day that helped fill-up over 10 TB of disk space. To add to the mix, the data had to be available within 15 minutes with an expected completeness of 99% across trans-oceanic fiber optic cable.
&lt;/p&gt;
&lt;p&gt;
We decided to collect the data in a streaming fashion. This enabled us to feed the data at very low latencies to stream processing applications. However, there was an existing batch processing application that required all the data for the entire day to be available with near 100% completeness.
&lt;/p&gt;
&lt;p&gt;
In order to achieve this, the data was collected in a streaming fashion and put into files that contained events belonging to that particular time period, the default being a minute and hence called minute files. Once the data was collected for the minute, the minute files were closed and the data was made available to the consuming application using a queue. Now, this worked reasonably well when there was one application but had problems when the consuming application wanted to reprocess or perform partial updates. In addition to this, the queue essentially kept the consuming application state making the collection and processing systems tightly coupled. This made is increasingly hard when it came to supporting multiple applications because the state for each batch for each application needed to be kept.
&lt;/p&gt;
&lt;p&gt;
This made it even more interesting when the Hadoop initiative at Yahoo! began. All batch processing application were now running on the grid while the data was still consumed by legacy applications. How would we feed the old and the new systems with the same data without duplication?
&lt;/p&gt;
&lt;p&gt;
What we needed was a loosely coupled or completely decoupled method of communicating the files to be processed to downstream batch processing applications.  The solution we came up with was a simple but elegant one called a List of Files (LoF) repository.
&lt;/p&gt;
&lt;p&gt;
The LoF repository contains an entry for each minute file collected and its associated attributes such as the start time, the end time, the size in bytes, the number of events, the collection pipeline instance name, and other relevant data. An API to access this data called the LoF API was provided to be able to query the repository for a set of files that satisfied certain attribute constraints. For example, a query might request “all the files that belong to the period 12:00 to 12:05 collected from the sports web servers”. The repository did not need to keep state of which files this application had processed or maintain any queue. This allowed the application to maintain its state and multiple applications were as simple as the single application case. To simplify the usage the API was made available in the form of a RESTful web-service.
&lt;/p&gt;
&lt;p align="center"&gt;
&lt;img src="http://farm3.static.flickr.com/2172/4515679033_224eb7c26e_o.png" width="720" height="540"/&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
Different applications had different completeness requirements from the data collected. For example, a low-latency behavioral targeting application would typically be happy with 95% of the data within 1 minute of the data, while a revenue realization or tracking application would need 100% of the data within 15 minutes. In order to support this, the API returned a completeness metric along with the list of files returned to indicate the percentage of data the list represented. The application could use this information to commence processing based on its own completeness requirements.
&lt;/p&gt;
&lt;p&gt;
Given the distributed nature of the web-servers, data was often delayed or unavailable due to network outages or temporary host unavailability. This meant that applications requiring higher levels of completeness were routinely delayed beyond their SLAs. To solve this we provided a simple timestamp based cursor facility to enable incremental processing. The cursor was essentially returned with the list to indicate the timestamp at which the list was generated. The subsequent query would provide the previously returned cursor along with the subsequent query to indicate the time of the last fetch and the query would return all the files later that that timestamp.
&lt;/p&gt;
&lt;p&gt;
This is what the web-service request looks like:
&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;
The response to this is of the form: 
&lt;BR&gt;
collector1.yahoo.com  1234388520  1234 /col1/1200.gz&lt;br&gt;
collector2.yahoo.com  1234388520  3232 /col2/1200.gz 
&lt;p&gt;
A subsequent request to get incremental data would use the cursor timestamp returned to fetch additional files as follows:
&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;
which would get a response similar to:
&lt;BR&gt;
collector1.yahoo.com  1234388580  3232 /col1/1201.gz
&lt;/p&gt;
I would like to conclude by saying that the LoF API has enabled the same data to be made available to different application with varying completeness and latency requirements in a simple and elegant manner. Moreover, it has enabled the collection system, which uses a stream-based paradigm, to easily feed multiple largely batch-oriented systems in a relatively seamless manner. Keeping the design simple enabled us to solve a reasonably complex problem.
&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="50" width="50" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/akon_dey.jpg" alt="Akon Dey"&gt;
Akon Dey&lt;BR&gt;
Architect,  Event Data Collection System at Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;
&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=5GkFh2s2ZX4:V0IrCi60u34:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=5GkFh2s2ZX4:V0IrCi60u34:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=5GkFh2s2ZX4:V0IrCi60u34:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=5GkFh2s2ZX4:V0IrCi60u34:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=5GkFh2s2ZX4:V0IrCi60u34:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=5GkFh2s2ZX4:V0IrCi60u34:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=5GkFh2s2ZX4:V0IrCi60u34:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/5GkFh2s2ZX4" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/5GkFh2s2ZX4/enabling_hadoop_batch_processi_1.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/06/enabling_hadoop_batch_processi_1.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Developer Note</category>
        
        
         <pubDate>Wed, 09 Jun 2010 12:40:47 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/06/enabling_hadoop_batch_processi_1.html</feedburner:origLink></item>
            <item>
         <title>Hadoop Summit 2010 - Agenda is available!</title>
         <description>&lt;p&gt; 
I’m happy to share the agenda for the upcoming &lt;a href="http://www.hadoopsummit.org/"&gt;Hadoop Summit&lt;/a&gt; – June 29th, Hyatt, Santa Clara.
&lt;/p&gt;
 &lt;p&gt;
We received over 70 great submissions for talks. It was a very impressive combination of development tool overviews, application case studies and innovative research. 
&lt;/p&gt;
&lt;p&gt;
We had the difficult task of selecting just a handful of presentations from this overwhelming collection of great quality abstracts and speakers. The variety of topics across numerous industry verticals, served as a clear evidence of how far this technology has evolved over the past year. Hadoop is really going mainstream!
&lt;/p&gt;
&lt;p&gt;
 Our goal was to create a diverse agenda that covers topics for experienced Hadoop users as well as people who recently began to explore this technology. We wanted to focus on the Hadoop eco-system of tools and solutions as well as “real life” users experience. 
&lt;/p&gt;
 &lt;p&gt;
I want to thank all the people who submitted talks and encourage speakers that were not selected to submit their great presentations to our monthly &lt;a href="http://www.meetup.com/hadoop/"&gt;Hadoop User Groups&lt;/a&gt;.
&lt;/p&gt; 
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;
&lt;a href="http://www.hadoopsummit.org/agenda.html" target="_blank"&gt;Detailed agenda and abstracts available at http://www.hadoopsummit.org/agenda.html&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
At  Yahoo!, we are embracing Hadoop at the very core of our business. As you will hear at the summit, we are continuing to invest heavily in both the technology and the community to make it even better. We love being at the center of the discussion and debates around Hadoop and learning from other’s experiences.
&lt;/p&gt;
&lt;p&gt;
We are planning to go bigger next year, with a broader event that will allow more opportunities for speakers as well as sponsors.
&lt;/p&gt;
&lt;p&gt; 
We hope you all join us at the summit – if you have not registered yet, please &lt;a href="http://hadoopsummit2010.eventbrite.com/"&gt; REGISTER TODAY!&lt;/a&gt;.  Space is limited and we don’t want you to miss the opportunity to see the great variety of talks outlined above first-hand.
&lt;/p&gt; 
&lt;p&gt; 
Special thanks to our Sponsors:
&lt;ul&gt;
&lt;li&gt;Evening reception sponsor: &lt;a href="http://www.lightspeedvp.com/"&gt;LightSpeed Venture Partners&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Breakfast and breaks sponsor: &lt;a href="http://www.ibm.com/us/en/"&gt;IBM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Lunch co-sponsors: &lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;Amazon Web Services&lt;/a&gt;, &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt;, &lt;a href="http://datameer.com/"&gt;Datameer&lt;/a&gt;, &lt;a href="http://www.karmasphere.com/"&gt;Karmasphere&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
 
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="60" width="45" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/dekel.jpg" alt="Dekel Tankel"&gt;
Dekel Tankel&lt;br/&gt;
Director, Product Management&lt;BR&gt;
Cloud Computing at Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=GxOQDwM9S9E:gj7JPljIFCA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=GxOQDwM9S9E:gj7JPljIFCA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=GxOQDwM9S9E:gj7JPljIFCA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=GxOQDwM9S9E:gj7JPljIFCA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=GxOQDwM9S9E:gj7JPljIFCA:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=GxOQDwM9S9E:gj7JPljIFCA:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=GxOQDwM9S9E:gj7JPljIFCA:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/GxOQDwM9S9E" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/GxOQDwM9S9E/hadoop_summit_2010_agenda_is_a.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/05/hadoop_summit_2010_agenda_is_a.html</guid>
        
        
         <pubDate>Thu, 27 May 2010 12:49:17 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/05/hadoop_summit_2010_agenda_is_a.html</feedburner:origLink></item>
            <item>
         <title>Pig, Cascalog &amp; HBase Among Highlights of May Hadoop Meet-Up</title>
         <description>&lt;p&gt;
Hi Hadoopers
&lt;/p&gt;
&lt;p&gt;
Thanks to close to 300 developers who came this week to Yahoo! for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal and conversations continued long after the formal sessions.
&lt;/p&gt;
&lt;p&gt;
&lt;img alt="&gt;Hundreds of Hadoop Fans Flock to Yahoo! for  the May Hadoop User Group" src="http://developer.yahoo.com/blogs/hadoop/pic1.jpg" width="425" height="355" /&gt;&lt;br&gt;
&lt;em&gt;Hundreds of Hadoop Fans Flock to Yahoo! for  the May Hadoop User Group&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;
A few lucky winners received  free tickets to the upcoming  &lt;a href="http://bit.ly/hadoopsummit"&gt;Hadoop Summit 2010&lt;/a&gt; (June 29th, at the Hyatt Regency, Santa Clara). Congratulations to those winners – everyone else please &lt;b&gt;&lt;a href="http://hadoopsummit2010.eventbrite.com/"&gt;register here&lt;/a&gt;&lt;/b&gt;
&lt;/p&gt;
&lt;p&gt;
The event started with Alan Gates from Yahoo! who described the new features and work done in &lt;a href="http://hadoop.apache.org/pig/"&gt;Pig&lt;/a&gt; 0.6 and 0.7 including the Hadoop’s compatibility plan, described in more details in &lt;a href="http://bit.ly/9yRDlH"&gt;this post&lt;/a&gt;.
&lt;/p&gt;
&lt;object id="__sse4180319" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=pig0607-100520145518-phpapp01&amp;stripped_title=pig-06-07" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4180319" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=pig0607-100520145518-phpapp01&amp;stripped_title=pig-06-07" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/OSQCNtdCmZo&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/OSQCNtdCmZo&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
&lt;a href="http://nathanmarz.com/blog/"&gt;Nathan Marz&lt;/a&gt; from &lt;a href="http://www.backtype.com/"&gt;BackType&lt;/a&gt; presented a cool demo of how easy it is to query existing data stores using &lt;a href="http://github.com/nathanmarz/cascalog"&gt;Cascalog&lt;/a&gt;, a query language for Hadoop. Nathan described how queries can be written as regular &lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt; code and combined with &lt;a href="http://www.cascading.org/"&gt;Cascading&lt;/a&gt;. Be sure to watch the demo as part of the video below.
&lt;/p&gt;
&lt;p&gt;
&lt;object id="__sse4180321" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=cascalog-hug-ppt-100520145523-phpapp02&amp;stripped_title=hadoop-user-group-yahoo-extraordinarily-rapid-and-robust-data-analysis-with-cascalog-nathan-marz-backtype" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4180321" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=cascalog-hug-ppt-100520145523-phpapp02&amp;stripped_title=hadoop-user-group-yahoo-extraordinarily-rapid-and-robust-data-analysis-with-cascalog-nathan-marz-backtype" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/pmtSlfX-JnA&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/pmtSlfX-JnA&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
Next was Dmitriy Ryaboy, an engineer at &lt;a href="http://twitter.com/"&gt;Twitter&lt;/a&gt; and a &lt;a href="http://hadoop.apache.org/pig/"&gt;Pig&lt;/a&gt; committer. Dmitriy walked us through the extensive use of Hadoop eco-system at Twitter. He explained what are the challenges they face in processing 55 million tweets a day and why they chose to use Hadoop, Pig and &lt;a href="http://hadoop.apache.org/hbase/"&gt;HBase&lt;/a&gt;. Dmitriy introduced the &lt;a href="http://www.github.com/kevinweil/elephant-bird"&gt;Elephant Bird&lt;/a&gt; libraries and shared interesting tips for dealing with Big Data.
&lt;/p&gt;
&lt;p&gt;
&lt;object id="__sse4180890" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=pig-hbase-twitter-hug-05-2010-100520160540-phpapp01&amp;stripped_title=yahoo-hadoop-user-group-may-meetup-hbase-and-pig-the-hadoop-ecosystem-at-twitter-dmitriy-ryaboy-twitter" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4180890" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=pig-hbase-twitter-hug-05-2010-100520160540-phpapp01&amp;stripped_title=yahoo-hadoop-user-group-may-meetup-hbase-and-pig-the-hadoop-ecosystem-at-twitter-dmitriy-ryaboy-twitter" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/ID4walexE28&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/ID4walexE28&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
We concluded with Tom White from &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt; who walked us through the release plans for Apache Hadoop 0.21 including the Source Compatibility project described in the &lt;a href="http://bit.ly/9yRDlH"&gt;Yahoo! hadoop blog&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object id="__sse4180320" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=apachehadoop21hug-100520145531-phpapp02&amp;stripped_title=apache-hadoop21hug" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4180320" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=apachehadoop21hug-100520145531-phpapp02&amp;stripped_title=apache-hadoop21hug" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/vxEEMMv4ehM&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/vxEEMMv4ehM&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
We at Yahoo! are embracing Hadoop – we share the challenges presented by Twitter for  processing massive data sets and continue to invest heavily in the technology and the community. We love to hear about the growing ecosystem and solutions like &lt;a href="http://github.com/nathanmarz/cascalog"&gt;Cascalog&lt;/a&gt;. 
&lt;/p&gt;
&lt;p&gt;
Please join us at the &lt;a href="http://bit.ly/hadoopsummit"&gt;Hadoop Summit &lt;/a&gt; to continue the conversation.
&lt;/p&gt;
&lt;p&gt;
As always, we are looking for exciting technologies and experiences you want to share.
Please contact me via the &lt;a href="http://www.meetup.com/hadoop/"&gt;Hadoop Bay Area User Group Meetup page&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
Note that we will not have a meetup in June due to the &lt;a href="http://bit.ly/hadoopsummit"&gt;Hadoop Summit &lt;/a&gt;. See you all on &lt;b&gt;July 21st, 2010&lt;/b&gt;. Registration is available &lt;a href="http://www.meetup.com/hadoop/calendar/13546804/"&gt;here&lt;/a&gt;, agenda will be published soon&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="60" width="45" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/dekel.jpg" alt="Dekel Tankel"&gt;
Dekel Tankel&lt;br/&gt;
Director, Product Management&lt;BR&gt;
Cloud Computing at Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iLGZwkyGLm8:HIQ2h9BqvpQ:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iLGZwkyGLm8:HIQ2h9BqvpQ:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=iLGZwkyGLm8:HIQ2h9BqvpQ:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iLGZwkyGLm8:HIQ2h9BqvpQ:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iLGZwkyGLm8:HIQ2h9BqvpQ:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=iLGZwkyGLm8:HIQ2h9BqvpQ:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=iLGZwkyGLm8:HIQ2h9BqvpQ:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/iLGZwkyGLm8" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/iLGZwkyGLm8/sss.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/05/sss.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Fri, 21 May 2010 18:20:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/05/sss.html</feedburner:origLink></item>
            <item>
         <title>Towards Enterprise-Class Compatibility for Apache Hadoop</title>
         <description>&lt;p&gt;
At Yahoo!, the first users of Apache Hadoop were researchers developing new algorithms or manually shifting through huge data sets.  These users threw away most of their code after a few weeks or months, and the little code they carried forward was not subject to rigorous quality procedures.  Thus, these early users cared more about new features and scalability improvements in Hadoop than they did about backward compatibility.
&lt;/p&gt;
&lt;p&gt;
This early focus on bigger-and-better helped Hadoop become the powerful platform it is today.  However, over the years, both inside and outside of Yahoo!, Hadoop is increasingly being used to run large, long-lived, enterprise-class applications.  Porting these applications to non-compatible upgrades of Hadoop is an arduous, expensive task that distracts teams from finding new and better ways of using Hadoop to bring value to their companies.  Today, Hadoop users are demanding backwards compatibility and interface stability; these features are necessary for the next growth phase of Hadoop, as it gains wider enterprise adoption.
&lt;/p&gt;
&lt;h3&gt;Interface Classification&lt;/h3&gt;
&lt;p&gt;
Over the last year, as part of our plan to provide stronger backward compatibility, we have tagged interfaces in Hadoop to denote their compatibility contract for future releases. An interface can be a Java API, a configuration variable, the parameters or output of a command, metrics variables, and so on. Java APIs are tagged using Java Annotations; other types of interfaces, such as configuration options and output formats, are tagged using informal documentation conventions.  The upcoming release 0.21 of Hadoop will be the first to expose this classification.
&lt;/p&gt;
&lt;p&gt;
The classification system is derived from &lt;a href="http://www.opensolaris.org/os/community/arc/policies/interface-taxonomy/#Advice"&gt;OpenSolaris&lt;/a&gt;  and from our own internal system at Yahoo.  The system distinguishes two important aspects of an interface from the perspective of backward compatibility, the audience of the interface, and the stability of the interface:
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
The &lt;b&gt;audience&lt;/b&gt; (or scope or visibility) of an interface denotes the potential customers of the interface. In addition to the more obvious public and private designations, the audience taxonomy also includes a &lt;em&gt;limited-private&lt;/em&gt; category for hooks exposed to peer frameworks or systems.
&lt;/li&gt;
&lt;li&gt;
The &lt;b&gt;stability&lt;/b&gt; of an interface denotes when changes can be made to the interface that break compatibility.  Again, a binary choice between &lt;em&gt;stable&lt;/em&gt; (guaranteed not to change incompatibly) and &lt;em&gt;unstable&lt;/em&gt; (guaranteed to change) is not sufficient.  Interfaces may also be marked as evolving, intended for use by early adopters validating their suitability. An &lt;em&gt;evolving&lt;/em&gt; interface is marked as &lt;em&gt;public&lt;/em&gt; only after it’s  been used internally in the system and is close to being stable. These dimensions allow early adopters to gauge the risk in relying on a new interface.
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
The following chart should help you understand what this tagging system means to you:
&lt;/p&gt;
&lt;center&gt;
    &lt;table border=0 cellspacing=0 cellpadding=0&gt;
      &lt;tr&gt;&lt;td&gt;
&lt;img alt="What this tagging system means to you" src="http://developer.yahoo.com/blogs/hadoop/Sanjaytable1.GIF" width="400" height="90" /&gt;

      &lt;/td&gt;&lt;/tr&gt;
    &lt;/table&gt;
  &lt;/center&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
If you are an &lt;b&gt;application developer&lt;/b&gt;, stick to &lt;em&gt;public-stable&lt;/em&gt; interfaces. If you are early adopter, you may use a &lt;em&gt;public-evolving&lt;/em&gt; interface, but be aware that the interface may change slightly in the future, forcing a change to your application.
&lt;/li&gt;
&lt;li&gt;
If you are a &lt;b&gt;framework developer&lt;/b&gt; on Hadoop, you can of course safely use any of the public interfaces, but can also use &lt;em&gt;limited-private&lt;/em&gt; interfaces targeted to your framework. For example the Hadoop RPC layer provides &lt;em&gt;limited-private&lt;/em&gt; interfaces for HDFS and MapReduce.
&lt;/li&gt;
&lt;li&gt;
Stay away from the private interfaces unless you are an &lt;b&gt;implementer&lt;/b&gt; of the actual sub-component.  Pay attention to any private interfaces that are stable.  While the vast majority of private interfaces are marked as unstable, a few are marked as stable to warn that breaking their compatibility should be done after only serious consideration (and community discussion). For example the internal HDFS and MapReduce protocols are marked stable to support rolling-upgrades (a feature we would like to add to Hadoop in the near future).
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
&lt;p&gt;
More details are provided in the &lt;a href="https://issues.apache.org/jira/secure/attachment/12445002/api_classification.pdf"&gt;Apache Hadoop API classification document&lt;/a&gt;.
&lt;/p&gt;
&lt;h3&gt;The Larger Plan for Compatibility&lt;/h3&gt;
&lt;p&gt;
This classification system for interfaces is part of a larger, multi-step plan for backwards compatibility, which has been in work for the last two years (&lt;a href="https://issues.apache.org/jira/browse/HADOOP-5071"&gt;HADOOP-5071&lt;/a&gt;).
&lt;/p&gt;
&lt;p&gt;
The first step was to clean up the package structure of the Hadoop source base (&lt;a href="https://issues.apache.org/jira/browse/HADOOP-2884"&gt;HADOOP-2884&lt;/a&gt;). While the high level structure was fine, finer-grained packages better reflect the abstractions of the underlying system architecture. For example, the HDFS package structure did not distinguish the internal abstraction of block storage and namespace, and there were several cases of layer violations within and across packages.  In cleaning-up the package structure, we not only provided a better foundation for future work, but also started the process of identifying and separating public interfaces from the private and limited-private ones. We also changed the documentation to reflect this separation.
&lt;/p&gt;
&lt;p&gt;
At the same time, key Hadoop interfaces were redesigned to allow them to evolve compatibly without limiting innovation in the framework. These key interfaces, which also happened to be where Hadoop has faced its greatest compatibility issues, were the APIs to the HDFS file system (&lt;a href="https://issues.apache.org/jira/browse/HADOOP-4952"&gt;HADOOP-4952&lt;/a&gt;) ) and the MapReduce framework (&lt;a href="https://issues.apache.org/jira/browse/HADOOP-1230"&gt;HADOOP-1230&lt;/a&gt;).We will discuss the redesign of these interfaces in a future blog.
&lt;/p&gt;
&lt;p&gt;
Finally, we have also started to address the wire compatibility problem, i.e. the compatibility of Hadoop components across RPC boundaries and versions.  Hadoop, surprisingly, does not offer wire compatibility.  To address this limitation, Yahoo started the &lt;a href="http://avro.apache.org/"&gt;Avro project&lt;/a&gt; (led by Doug Cutting), which was open-sourced in Apache last year and will be incorporated into Hadoop’s RPC system over time..
&lt;/p&gt;
&lt;h3&gt;Acknowledgments&lt;/h3&gt;
&lt;p&gt;
The source re-factoring was done by Sanjay Radia and Raghu Angadi. The interface classification project was initiated by Sanjay Radia and derived mostly from OpenSolaris’ classification scheme.  Jakob Homan designed the use of annotations to tag the interfaces.  Suresh Srinivas, Tom White, Arun Murthy, Owen O'Malley, and Sanjay Radia defined the annotations for the various parts of the Hadoop system.  Tom White helped drive the labeling effort to its conclusion in 0.21, particularly its integration with Hadoop’s javadoc. Doug Cutting was the lead for the Avro project.
&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="50" width="50" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/sradia.jpg" alt="Sanjay Radia"&gt;
Sanjay Radia&lt;BR&gt;
Hadoop Team, Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;
&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=W8bhXlxHJuQ:dHTo0vcE7zA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=W8bhXlxHJuQ:dHTo0vcE7zA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=W8bhXlxHJuQ:dHTo0vcE7zA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=W8bhXlxHJuQ:dHTo0vcE7zA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=W8bhXlxHJuQ:dHTo0vcE7zA:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=W8bhXlxHJuQ:dHTo0vcE7zA:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=W8bhXlxHJuQ:dHTo0vcE7zA:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/W8bhXlxHJuQ" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/W8bhXlxHJuQ/towards_enterpriseclass_compat.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/05/towards_enterpriseclass_compat.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Developer Note</category>
        
        
         <pubDate>Wed, 19 May 2010 11:13:24 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/05/towards_enterpriseclass_compat.html</feedburner:origLink></item>
            <item>
         <title>Scalability of the Hadoop Distributed File System</title>
         <description>&lt;p&gt;In his fictional story &lt;a href="http://jubal.westnet.com/hyperdiscordia/library_of_babel.html"&gt;"The Library of Babel"&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Jorge_Luis_Borges"&gt;Jorge Luis Borges&lt;/a&gt; describes a vast storage universe composed of all possible manuscripts uniformly formatted as 410-page books. Most are random meaningless sequences of symbols. But the rest excitingly forms a complete and an indestructible knowledge system, which stores any text written in the past or to be written in the future, thus providing solutions to all problems in the world. Just find the right book.&lt;/p&gt;

&lt;p&gt;The same characteristic fascinates us in modern storage growth: The aggregation of information directly leads to proportional growth of new knowledge discovered out of it. A skeptic may doubt that further reward in knowledge mining will not justify the effort in information aggregation. What if by building sophisticated storage systems we are chasing the 19th century’s &lt;a href="http://www.uctc.net/access/30/Access%2030%20-%2002%20-%20Horse%20Power.pdf"&gt;horse manure problem&lt;/a&gt;, when at the dawn of the automobile era the scientific world was preoccupied with the growth of the horse population that threatened to bury the streets of London nine feet deep in manure? &lt;/p&gt;

&lt;p&gt;The historical judgment will turn one way or another. In the meantime, we want to know how far we can go with existing systems even if only out of pure curiosity. 
In my  &lt;em&gt;&lt;a href="http://www.usenix.org/publications/login/2010-04/index.html"&gt;USENIX ;login:&lt;/a&gt;&lt;/em&gt; article &lt;a href="http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf"&gt;“HDFS Scalability: the Limits to Growth,”&lt;/a&gt; I studied scalability and performance limitations imposed on a distributed file system by the single-node namespace server architecture. The study is based on experience with largest deployments of the &lt;a href="http://hadoop.apache.org/hdfs/index.html"&gt;Hadoop Distributed File System (HDFS)&lt;/a&gt; currently in production at Yahoo!.&lt;/p&gt;

&lt;p&gt;The first part of the study focuses on how the single name-server architecture limits the number of namespace objects (files and blocks) and how this translates to the limitation of the physical storage capacity growth.&lt;/p&gt;

&lt;p&gt;The second part explores the limits for linear performance growth of HDFS clusters bound by natural performance restrictions of a single namespace server. As the cluster grows, the linear increase in the demand for processing resources puts a higher workload on the single namespace server. When the workload exceeds a certain threshold,  it saturates the server, turning it into a single-node bottleneck for linear scaling of the entire cluster.&lt;/p&gt;

&lt;p&gt;In 2006, the Hadoop group at Yahoo! formulated long-term target &lt;a href="http://wiki.apache.org/hadoop/DFS_requirements"&gt;requirements&lt;/a&gt; for HDFS and outlined a list of projects intended to bring the requirements to life.
Table 1 summarizes the targets and compares them with the current achievements:&lt;/p&gt;

&lt;center&gt;
    &lt;table border=1 cellspacing=1 cellpadding=4&gt;
      &lt;tr&gt;
         &lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;Target&lt;/td&gt;&lt;td&gt;Deployed&lt;/td&gt; 
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Capacity&lt;/td&gt;&lt;td&gt;10 PB&lt;/td&gt;&lt;td&gt;14 PB&lt;/td&gt; 
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Nodes&lt;/td&gt;&lt;td&gt;10,000&lt;/td&gt;&lt;td&gt;4,000&lt;/td&gt; 
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Clients&lt;/td&gt;&lt;td&gt;100,000&lt;/td&gt;&lt;td&gt;15,000&lt;/td&gt; 
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Files&lt;/td&gt;&lt;td&gt;100,000,000&lt;/td&gt;&lt;td&gt;60,000,000&lt;/td&gt; 
      &lt;/tr&gt;
    &lt;/table&gt;

&lt;p&gt; &lt;b&gt;Table 1: Targets for HDFS vs. actual deployed values as of 2009&lt;/b&gt;&lt;/p&gt;
 &lt;/center&gt;

&lt;p&gt;The bottom line is that we achieved the target in Petabytes and got close to the target in the number of files. But this is done with a smaller number of nodes and the need to support a workload close to 100,000 clients has not yet materialized. The question now is whether the goals are feasible with the current system architecture.&lt;/p&gt;

&lt;h3&gt;Namespace Limitations&lt;/h3&gt;

&lt;p&gt;HDFS is based on an architecture where the namespace is decoupled from the data. The namespace forms the file system metadata, which is maintained by a dedicated server called the &lt;em&gt;name-node&lt;/em&gt;. The data itself resides on other servers called &lt;em&gt;data-nodes&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The namespace consists of files and directories. Files are divided into large (128 MB) blocks. To provide data reliability, HDFS uses block replication. Each block by default is replicated to three data-nodes. Once the block is created its replication is maintained by the system automatically. The block copies are called &lt;em&gt;replicas&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The name-node keeps the entire namespace in RAM. This architecture has a natural limiting factor: the memory size; that is, the number of namespace objects (files and blocks) the single namespace server can handle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://issues.apache.org/jira/browse/HADOOP-1687"&gt;Estimates&lt;/a&gt; show that the name-node uses less than 200 bytes to store a single metadata object (a file inode or a block). According to statistics on Y! clusters, a file on average consists of 1.5 blocks. Which means that it takes 600 bytes (1 file object + 2 block objects) to store an average file in name-node’s RAM&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM.&lt;/p&gt;

&lt;h3&gt;Storage Capacity vs. Namespace Size&lt;/h3&gt;

&lt;p&gt;With 100 million files, each having an average of 1.5 blocks, we will have 200 million blocks in the file system. If the maximal block size is 128 MB and every block is replicated 3 times, then the total disk space required to store these blocks is about 60 PB.&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;100 million file namespace needs 60 PB of total storage capacity on the cluster&lt;/p&gt;
&lt;p&gt;
As a rule of thumb, the correlation between the representation of the metadata in RAM and physical storage space required to store data referenced by this namespace is:
&lt;/p&gt;
&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;
 1 GB metadata ≈ 1 PB physical storage
&lt;/p&gt;

&lt;p&gt;The rule should not be treated the same as, say, the &lt;a href="http://mathworld.wolfram.com/PythagoreanTheorem.html"&gt;Pythagorean Theorem&lt;/a&gt;, because the correlation depends on cluster parameters (the block to file ratio and the block size). But it can be used as a practical estimate for configuring cluster resources.&lt;/p&gt;

&lt;h3&gt;Cluster Size and Node Configuration&lt;/h3&gt;

&lt;p&gt;Next we can estimate the number of data-nodes the cluster should have to accommodate the namespace of a certain size. On Yahoo’s clusters, data-nodes are usually equipped with four disk drives of size 0.75 – 1 TB, and configured to use 2.5 – 3.5 TB of that space per node. The remaining space is allocated for MapReduce transient data, system logs, and the OS.&lt;/p&gt;

&lt;p&gt;If we assume that an average data-node capacity is 3 TB, then we will need on the order of 20,000 nodes to store 60 PB of data. To keep it consistent with the target requirement of 10,000 nodes, each data-node should be configured with eight 1TB hard drives.&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;To accommodate data referenced by a 100 million file namespace, a HDFS cluster needs 10,000 nodes equipped with 8TB of total hard drive capacity per node&lt;/p&gt;

&lt;p&gt;Note that these estimates are true under the assumption that the block per file ratio of 1.5 and the block size remain the same. If the ratio or the block size increases, a gigabyte of RAM will support more petabytes of physical storage and vice versa. Sadly, based on practical observations, the block to file ratio tends to decrease during the lifetime of a file system, meaning that the object count (and therefore the memory footprint) of a single namespace server grows faster than the physical data storage. That makes the object-count problem &amp;mdash; which becomes file-count problem when the average file to block ratio is close to 1 &amp;mdash; the real bottleneck for cluster scalability.&lt;/p&gt;

&lt;h3&gt;The Internal Load&lt;/h3&gt;

&lt;p&gt;Apart from the hierarchical namespace the name-node’s metadata includes a list of registered data-nodes, and a block to data-node mapping, which determines physical block locations.&lt;/p&gt;

&lt;p&gt;A data-node identifies block replicas in its possession to the name-node by sending block reports. The first block report is sent at the startup. It reveals the block locations, which otherwise are not known to the name-node at startup time. Subsequently, block reports are sent periodically every 1 hour by default and serve as a sanity check providing that the name-node has an up-to-date view of block replica distribution on the cluster.&lt;/p&gt;

&lt;p&gt;During normal operation, data-nodes periodically send heartbeats to the name-node to indicate that the data-node is alive. The default heartbeat interval is 3 seconds. If the name-node does not receive a heartbeat from a data-node in 10 minutes, it pronounces the data-node dead and schedules its blocks for replication on other nodes.&lt;/p&gt;

&lt;p&gt;The block reports and heartbeats form the internal load of the cluster. If the internal load is too high, the cluster becomes dysfunctional, able to process only a few, if any, external client operations such as ls, read, or write. The internal load depends on the number of data-nodes. Assuming that the cluster is built of 10,000 data-nodes having 8 hard drives with 6 TB of effective storage capacity each, we estimate that the name-node will need to handle
&lt;ul&gt;&lt;li class="bullist"&gt;3 block reports per second, each reporting 60,000 replicas&lt;/li&gt;&lt;li class="bullist"&gt;10,000 heartbeats per second&lt;/li&gt;&lt;/ul&gt;&lt;/p&gt;

&lt;p&gt;Using the standard HDFS benchmark called NNThroughputBenchmark, we measure the actual name-node performance with respect to the two internal operations. Table 2 summarizes the results. Note that the block report throughput is measured in the number of blocks processed by the name-node per second.&lt;/p&gt;

&lt;center&gt;
    &lt;table border=1 cellspacing=1 cellpadding=4&gt;
      &lt;tr&gt;
         &lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;Throughput&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Number of blocks processed in block reports per second&lt;/td&gt;&lt;td&gt;639,713&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Number of heartbeats per second&lt;/td&gt;&lt;td&gt;300,000&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/table&gt;

&lt;p&gt;&lt;b&gt;Table 2: Block report and heartbeat throughput&lt;/b&gt;&lt;/p&gt;
&lt;/center&gt;

&lt;p&gt;The implication of these results is:&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;The internal load for block reports and heartbeat processing on a 10,000 node HDFS cluster with the total storage capacity of 60 PB will consume 30% of the total name-node processing capacity.&lt;/p&gt;

&lt;p&gt;The internal load is proportional to the number of nodes in the cluster and the average number of blocks on a node. Thus, if a node had only 30,000 blocks &amp;mdash; half of the estimated amount &amp;mdash; then the name-node will dedicate only 15% of its processing resources to the internal load. Vice versa, if the average number of blocks per node grows, then the internal load will grow proportionally. The latter means that the decrease in block to file ratio (more small files with the same file system size) increases the internal load and therefore negatively affects the performance of the system.&lt;/p&gt;

&lt;h3&gt;Reasonable Load Expectations&lt;/h3&gt;

&lt;p&gt;We have learned by now that the name-node can use 70% of its time to process external client requests. Even with a handful of clients one can saturate the name-node performance by letting the clients send requests to the name-node with very high frequency. The name-node most probably would become unresponsive, potentially sending the whole cluster into a tailspin because internal load requests do not have priority over regular client requests.&lt;/p&gt;

&lt;p&gt;In practice, the extreme load bursts are uncommon. Regular Hadoop clusters run MapReduce jobs, and jobs perform conventional file reads or writes. To get or put data from or to HDFS, a client first accesses the name-node and receives block locations, and then directly talks to data-nodes to transfer file data. Thus the frequency of name-node requests is bound by the rate of data transfer from data-nodes.&lt;/p&gt;

&lt;p&gt;Typically a map task of a MapReduce jobs reads one block of data. We estimate how much time it takes for a client to retrieve a block replica and, based on that, evaluate the expected read load on the name-node  &amp;mdash; namely, how many “get block location” requests the name-node should expect per second from 100,000 clients. We apply the same technique to evaluate the write load on the cluster.&lt;/p&gt;

&lt;h3&gt;Performance Limitations&lt;/h3&gt;

&lt;p&gt;The read and write throughputs have been &lt;a href="http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html"&gt;measured&lt;/a&gt; by the DFSIO benchmark and are summarized in Table 3.&lt;/p&gt;

&lt;center&gt;
    &lt;table border=1 cellspacing=1 cellpadding=4&gt;
      &lt;tr&gt;
         &lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;Throughput&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Average read throughput &lt;/td&gt;&lt;td&gt;66 MB/s&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Average write throughput&lt;/td&gt;&lt;td&gt;40 MB/s&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/table&gt;
&lt;p&gt;&lt;b&gt;Table 3: HDFS read and write throughput&lt;/b&gt;&lt;/p&gt;
&lt;/center&gt;

&lt;p&gt;Another series of throughput results produced by NNThroughputBenchmark (Table 4) measures the number of open (the same as get block location) and create operations processed by the name-node per second:&lt;/p&gt;

&lt;center&gt;
    &lt;table border=1 cellspacing=1 cellpadding=4&gt;
      &lt;tr&gt;
         &lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;Throughput&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Get block locations &lt;/td&gt;&lt;td&gt;126,119 ops/s&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
         &lt;td&gt;Create new block&lt;/td&gt;&lt;td&gt;5,600 ops/s&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/table&gt;
&lt;p&gt; &lt;b&gt;Table 4: Open and create throughput&lt;/b&gt;&lt;/p&gt;
&lt;/center&gt;

&lt;p&gt;The read throughput in Table 3 indicates that 100,000 readers are expected to produce 68,750 get-block-location requests to the name-node per second. And Table 4 confirms that the name-node is able to process that many requests even if only 70% of its processing capacity is dedicated to this task.&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;A 10,000 node HDFS cluster with the internal load at 30% will be able to handle an expected read-only load produced by 100,000 HDFS clients.&lt;/p&gt;

&lt;p&gt;The write performance is not as optimistic. According to Table 3, 100,000 writers are expected to provide an average load of 41,667 create block requests per second on the name-node. This is way above 3,920 creates per second &amp;mdash; 70% of possible processing capacity of the name-node (see Table 4).&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:5px 10px 5px 10px;"&gt;A reasonably expected write-only load produced by 100,000 HDFS clients on a 10,000 node HDFS cluster will exceed the throughput capacity of a single name-node.&lt;/p&gt;

&lt;p&gt;We have seen that a 10,000-node HDFS cluster with single name-node is expected to handle a workload of 100,000 readers well. However, even 10,000 writers can produce enough workload to saturate the name-node, making it a bottleneck for linear scaling.&lt;/p&gt;

&lt;p&gt;Such a large difference in performance is attributed to get-block-locations (read workload) being a memory-only operation, while creates (write workload) requires journaling, which is bounded by the local hard drive performance.&lt;/p&gt;

&lt;p&gt;There are ways to improve the single name-node performance. But any solution intended for single namespace server optimization lacks scalability.&lt;/p&gt;

&lt;p&gt;Looking into the future &amp;mdash; especially taking into account that the ratio of small files tend to grow &amp;mdash; the most promising solutions seem to be based on distributing the namespace server itself both for workload balancing and for reducing the single server memory footprint.&lt;/p&gt;

&lt;p&gt;I hope you will benefit from the information provided above. Please refer to the &lt;a href="http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf"&gt;USENIX ;login:&lt;/a&gt; article for more details.&lt;/p&gt;

&lt;p style="overflow:hidden;background:#eee;padding-top:10px;margin:.5em 0;padding:15px 10px 5px 10px;border-top:1px solid #ccc;"&gt;&lt;img height="50" width="50" style="margin-top:-10px;float: left; margin-right: 6px;border:1px solid #999;background:#fff;padding:5px;" src="http://developer.yahoo.com/blogs/hadoop/konstantin_shvachko.jpeg" alt="Konstantin Shvachko"&gt;
Konstantin V. Shvachko&lt;BR&gt;
Hadoop Distributed File System (HDFS) Team, Yahoo!&lt;/p&gt;&lt;div style="clear:both"&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kZjz5IcyZuI:bPhdZFWH6Lw:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kZjz5IcyZuI:bPhdZFWH6Lw:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=kZjz5IcyZuI:bPhdZFWH6Lw:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kZjz5IcyZuI:bPhdZFWH6Lw:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kZjz5IcyZuI:bPhdZFWH6Lw:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=kZjz5IcyZuI:bPhdZFWH6Lw:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=kZjz5IcyZuI:bPhdZFWH6Lw:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/kZjz5IcyZuI" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/kZjz5IcyZuI/scalability_of_the_hadoop_dist.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Developer Note</category>
        
        
         <pubDate>Wed, 05 May 2010 16:00:00 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html</feedburner:origLink></item>
            <item>
         <title>Hundreds of Hadoop Fans Flock to Yahoo! for  the April Hadoop User Group Meet-Up</title>
         <description>&lt;p&gt;
Hi Hadoopers
&lt;/p&gt;
&lt;p&gt;
Thanks to more than 250 developers who came tonight to Yahoo! for our monthly Hadoop User Group meeting. With &lt;a href="http://www.facebook.com/f8"&gt;Facebook's F8&lt;/a&gt;  developer conference and the downpour of April showers it was nice to see such turnout. 
&lt;/p&gt;
&lt;p&gt;
A few lucky winners received  free tickets to the upcoming  &lt;a href="http://bit.ly/hadoopsummit"&gt;Hadoop Summit 2010&lt;/a&gt; (June 29th, at the Hyatt Regency, Santa Clara). Congratulations to those winners – everyone else please &lt;b&gt;&lt;a href="http://hadoopsummit2010.eventbrite.com/"&gt;register here&lt;/a&gt;&lt;/b&gt;
&lt;/p&gt;
&lt;p&gt;
The event started with Vishwanath Ramarao, Director of anti-spam engineering for &lt;a href="http://mail.yahoo.com"&gt;Yahoo! Mail&lt;/a&gt;. Vish described the intricate cat-and-mouse games played with spammers, and how Yahoo! uses Hadoop to abstract away the complexity of large scale data analysis and provide deep insight into spammer campaigns.
&lt;/p&gt;
&lt;p&gt;
&lt;object width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mailantispam-100421192035-phpapp02&amp;stripped_title=mail-antispam" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mailantispam-100421192035-phpapp02&amp;stripped_title=mail-antispam" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/fVL93-OF1gc&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/fVL93-OF1gc&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
Next was a presentation from John Sichi, lead engineer for &lt;a href="http://www.facebook.com"&gt;Facebook's&lt;/a&gt; data infrastructure team. John provided an overview of Facebook's recent integration between &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt;, &lt;a href="http://hadoop.apache.org/hbase/"&gt;HBase&lt;/a&gt; and &lt;a href="http://hadoop.apache.org/hive/"&gt;Hive&lt;/a&gt; and the motivation for it - "Data, data, and more data".
&lt;/p&gt;
&lt;p&gt;
&lt;object width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=hivehbase-hadoopapr2010-100421192116-phpapp02&amp;stripped_title=hive-h-basehadoopapr2010" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=hivehbase-hadoopapr2010-100421192116-phpapp02&amp;stripped_title=hive-h-basehadoopapr2010" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/0EFrXf_rgBg&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/0EFrXf_rgBg&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
We concluded with Ken Krugler, the founder of &lt;a href="http://bixolabs.com/"&gt;Bixo Labs&lt;/a&gt;.  Ken described the Public Terabyte Dataset project - a large-scale web crawl that uses SimpleDB, Hadoop, Cascading and Bixo in the &lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;Amazon's EMR cloud&lt;/a&gt;. 
&lt;p&gt;
&lt;object width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=bixohugtalk-100421230117-phpapp01&amp;stripped_title=bixo-hug-talk" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=bixohugtalk-100421230117-phpapp01&amp;stripped_title=bixo-hug-talk" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/VIIi8DjQbzI&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/VIIi8DjQbzI&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
&lt;p&gt;
We will publish shortly video recordings of the sessions  on this blog. Stay tuned!
&lt;/p&gt;
&lt;p&gt;
We at Yahoo! are embracing Hadoop – as illustrated by the Yahoo! Mail case study. The ability to process massive data sets is core to our business and we are continuing to invest heavily in the technology and the community. We love to hear about the growing ecosystem of solutions and frameworks built around Hadoop. 
&lt;/p&gt;
&lt;p&gt;
Please join us at the &lt;a href="http://bit.ly/hadoopsummit"&gt;Hadoop Summit &lt;/a&gt; to continue the conversation.
&lt;/p&gt;
&lt;p&gt;
As always, we are looking for exciting technologies and experiences you want to share.
Please add presentation requests at the &lt;a href="http://www.meetup.com/hadoop/"&gt;Hadoop Bay Area User Group Meetup page&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
See you all on &lt;b&gt;May 19th, 2010&lt;/b&gt;. Registration is available &lt;a href="http://www.meetup.com/hadoop/calendar/13048582/"&gt;here&lt;/a&gt;, agenda will be published soon&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;
Dekel Tankel&lt;br/&gt;
Director, Product Management&lt;BR&gt;
Cloud Computing at Yahoo!
&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=DzrwuneSQUg:ZZfLgr5YwUw:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=DzrwuneSQUg:ZZfLgr5YwUw:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=DzrwuneSQUg:ZZfLgr5YwUw:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=DzrwuneSQUg:ZZfLgr5YwUw:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=DzrwuneSQUg:ZZfLgr5YwUw:PhkjNP4BSzk"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?d=PhkjNP4BSzk" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.developer.yahoo.net/~ff/YDNHadoop?a=DzrwuneSQUg:ZZfLgr5YwUw:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/YDNHadoop?i=DzrwuneSQUg:ZZfLgr5YwUw:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/YDNHadoop/~4/DzrwuneSQUg" height="1" width="1"/&gt;</description>
         <link>http://feeds.developer.yahoo.net/~r/YDNHadoop/~3/DzrwuneSQUg/hundreds_of_hadoop_fans_at_the.html</link>
         <guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/2010/04/hundreds_of_hadoop_fans_at_the.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">Hadoop User Group</category>
        
        
         <pubDate>Wed, 21 Apr 2010 12:48:13 -0800</pubDate>
      <feedburner:origLink>http://developer.yahoo.com/blogs/hadoop/2010/04/hundreds_of_hadoop_fans_at_the.html</feedburner:origLink></item>
      
   </channel>
</rss>
