Changes

Jump to navigation Jump to search
4,812 bytes added ,  16:59, 10 April 2019
no edit summary
This article is part of the [[Advanced User's Guide]]. It is about the usage of BaseX for processing and storing the live data stream of Twitter. We illustrate some statistics about the Twitter data and the performance of BaseX.
 
As [http://twitter.com Twitter] attracts more and more users (over 140 million active users in 2012) and is generating large amounts of data (over 340 millions of short messages ('tweets') daily), it became a really exciting data source for
all kind of analytics. Twitter provides the developer community with a set of [https://dev.twitter.com/start APIs] for retrieving the data about its users and their communication, including the [https://dev.twitter.com/docs/streaming-apis Streaming API] for data-intensive applications, the [https://dev.twitter.com/docs/using-search Search API] for querying and filtering the messaging content, and the [https://dev.twitter.com/docs/api REST API] for accessing the core primitives of the Twitter platform.
This article is about the use of = BaseX for processing and storing the live data stream of as Twitter. We illustrate some statistics about the Twitter data and the performance of BaseX.Storage=
=TwittersFor retrieving the Twitter stream we connect with the Streaming API to the endpoint of Twitter and receive a never ending tweet stream. As Twitter delivers the tweets as [http://www.json.org/ JSON] objects the objects has to beconverted into XML fragments. For this purpose the parse function of the [[JSON Module|XQuery JSON Module]] is used. In the examples section both versions are shown ([[#Example Tweet (JSON)|tweet as JSON]] and [[#Example Tweet (XML)|tweet as XML]]). For storing the tweets including the meta-data, we use the standard ' Streaming Data='insert'' function of [[Updates|XQuery Update]].
=Twitter’s Streaming Data= Each tweet object in the data stream contains the tweet message itself and over 60 data fields (for further information see the [https://dev.twitter.com/docs/platform-objects fields description]).The following figure section shows the amount of data, that is delivered by the Twitter Streaming API] to the connected endpoints with the 10% gardenhose access per hour
on the 6th of the months February, March, April and May. It is the pure public live stream without any filtering applied.
 
==Statistics==
[[File:Tweets.png]]
Statistics about the data:{| class="wikitable" width="50%"|-! Day! Description! Amount|-| Mon, 6-Feb-2012| Total tweets| 30.824.976<br/>|-|-| | Average tweets per hour| 1.284.374<br/>|-| | Average tweets per minute| 21.406<br/>|-| | Average tweets per second| 356<br/>|-| Tue, 6-Mar-2012| Total tweets| 31.823.776<br/>|-| | Average tweets per hour| 1.325.990<br/>|-| | Average tweets per minute| 22.099<br/>|-| | Average tweets per second| 368<br/>|-| Fri, 6-Apr-2012| Total tweets| 34.638.976<br/>|-| | Average tweets per hour| 1.443.290<br/>|-| | Average tweets per minute| 24.054<br/>|-| | Average tweets per second| 400<br/>|-| Sun, 6-May-2012| Total tweets| 35.982.976<br/>|-| | Average tweets per hour| 1.499.290<br/>|-| | Average tweets per minute| 24.988<br/>|-| | Average tweets per second| 416<br/>|-|} ==Example Tweet (JSON)==
Example Tweet (JSON):<pre>
Example Tweet (XML){ "contributors":null, "text": "Using BaseX for storing the Twitter Stream", "geo": null, "retweeted": false, "in_reply_to_screen_name": null, "possibly_sensitive": false, "truncated": false, "entities": { "urls": [ ], "hashtags": [ ], "user_mentions": [ ] }, "in_reply_to_status_id_str": null, "id": 1984009055807*****, "in_reply_to_user_id_str": null, "source": "&lt;a href=\"http:\/\/twitterfeed.com\" rel=\"nofollow\"&gt;twitterfeed&lt;\/a&gt;", "favorited": false, "in_reply_to_status_id": null, "retweet_count": 0, "created_at": "Fri May 04 13:17:16 +0000 2012", "in_reply_to_user_id": null, "possibly_sensitive_editable": true, "id_str": "1984009055807*****", "place": null, "user": { "location": "", "default_profile": true, "statuses_count": 9096, "profile_background_tile": false, "lang": "en", "profile_link_color": "0084B4", "id": 5024566**, "following": null, "protected": false, "favourites_count": 0, "profile_text_color": "333333", "contributors_enabled": false, "verified": false, "description": "http:\/\/basex.org", "profile_sidebar_border_color": "C0DEED", "name": "BaseX", "profile_background_color": "C0DEED", "created_at": "Sat Feb 25 04:05:30 +0000 2012", "default_profile_image": true, "followers_count": 860, "geo_enabled": false, "profile_image_url_https": "https:\/\/si0.twimg.com\/sticky\/default_profile_images\/default_profile_0_normal.png", "profile_background_image_url": "http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png", "profile_background_image_url_https": "https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png", "follow_request_sent": null, "url": "http:\/\/adf.ly\/5ktAf", "utc_offset": null, "time_zone": null, "notifications": null, "friends_count": 2004, "profile_use_background_image": true, "profile_sidebar_fill_color": "DDEEF6", "screen_name": "BaseX", "id_str": "5024566**", "show_all_inline_media": false, "profile_image_url": "http:\/\/a0.twimg.com\/sticky\/default_profile_images\/default_profile_0_normal.png", "is_translator": false, "listed_count": 0 }, "coordinates": null}</pre>
==Example Tweet (XML)== <pre class="brush:xml">&lt;json booleans="retweeted possibly__sensitive truncated favorited possibly__sensitive__editable default__profile profile__background__tile protectedcontributors__enabled verified default__profile__image geo__enabled profile__use__background__image show__all__inline__media is__translator" numbers="id retweet__count statuses__count favourites__count followers__count friends__count listed__count" nulls="contributors geo in__reply__to__screen__name in__reply__to__status__id__str in__reply__to__user__id__str in__reply__to__status__id in__reply__to__user__id place following follow__request__sent utc__offset time__zone notifications coordinates" arrays="urls indices hashtags user__mentions" objects="json entities user"&gt;
&lt;contributors/&gt;
&lt;text&gt;Person Of Interest S01E21 480p HDTV x264-SM mkv: http://t.co/8y4sZGXnUsing BaseX for storing the Twitter Stream&lt;/text&gt;
&lt;geo/&gt;
&lt;retweeted&gt;false&lt;/retweeted&gt;
&lt;truncated&gt;false&lt;/truncated&gt;
&lt;entities&gt;
&lt;urls&gt; &lt;value type="object"&gt; &lt;expanded__url&gt;http://adf.ly/88khx&lt;/expanded__url&gt; &lt;indices&gt; &lt;value type="number"&gt;50&lt;/value&gt; &lt;value type="number"&gt;70&lt;/value&gt; &lt;/indices&gt; &lt;display__url&gt;adf.ly/88khx&lt;/display__url&gt; &lt;url&gt;http://t.co/8y4sZGXn&lt;/url&gt; &lt;/value&gt; &lt;/urls&gt;
&lt;hashtags/&gt;
&lt;user__mentions/&gt;
&lt;/entities&gt;
&lt;in__reply__to__status__id__str/&gt;
&lt;id&gt;1984009055807815681984009055807*****&lt;/id&gt;
&lt;in__reply__to__user__id__str/&gt;
&lt;source&gt;&lt;a href="http://twitterfeed.com" rel="nofollow"&gt;twitterfeed&lt;/a&gt;&lt;/source&gt;
&lt;in__reply__to__user__id/&gt;
&lt;possibly__sensitive__editable&gt;true&lt;/possibly__sensitive__editable&gt;
&lt;id__str&gt;1984009055807815681984009055807*****&lt;/id__str&gt;
&lt;place/&gt;
&lt;user&gt;
&lt;lang&gt;en&lt;/lang&gt;
&lt;profile__link__color&gt;0084B4&lt;/profile__link__color&gt;
&lt;id&gt;5024566055024566**&lt;/id&gt;
&lt;following/&gt;
&lt;protected&gt;false&lt;/protected&gt;
&lt;contributors__enabled&gt;false&lt;/contributors__enabled&gt;
&lt;verified&gt;false&lt;/verified&gt;
&lt;description&gt;http://adfbasex.ly/5ktAforg&lt;/description&gt;
&lt;profile__sidebar__border__color&gt;C0DEED&lt;/profile__sidebar__border__color&gt;
&lt;name&gt;sweetys musicBaseX&lt;/name&gt;
&lt;profile__background__color&gt;C0DEED&lt;/profile__background__color&gt;
&lt;created__at&gt;Sat Feb 25 04:05:30 +0000 2012&lt;/created__at&gt;
&lt;profile__use__background__image&gt;true&lt;/profile__use__background__image&gt;
&lt;profile__sidebar__fill__color&gt;DDEEF6&lt;/profile__sidebar__fill__color&gt;
&lt;screen__name&gt;sweetysmusicBaseX&lt;/screen__name&gt; &lt;id__str&gt;5024566055024566**&lt;/id__str&gt;
&lt;show__all__inline__media&gt;false&lt;/show__all__inline__media&gt;
&lt;profile__image__url&gt;http://a0.twimg.com/sticky/default_profile_images/default_profile_0_normal.png&lt;/profile__image__url&gt;
&lt;/json&gt;</pre>
=BaseX Performance= The test show the time BaseX needs to insert large amounts of real tweets into a database. We can derive that BaseX scales very well and can keep upwith the incoming amount of tweets in the stream. Some lower values can occur, cause the size of the tweets differ according to the meta-data contained in the tweet object.<br />Note: The {{Option|AUTOFLUSH}} option is set to <code>FALSE</code>. System Setup: Mac OS X 10.6.8, 3.2 GHz Intel Core i3, 8 GB 1333 MHz DDR3 RAM <br/>BaseX Version: BaseX 7.3 beta == Insert with XQuery Update == These tests show the performance of BaseX performing inserts with XQuery Update as single updates per tweet or bulk updates with different amount of tweets.The initial database just contained a root node <code><tweets/></code> and all incoming tweets are inserted after converting from JSON to XML into the root node.The time needed for the inserts includes the conversion time. === Single Updates === {| class="wikitable" width="50%"|-! Amount of tweets! Time in seconds! Time in minutes! Database Size (without indexes)|-| 1.000.000| 492.26346| 8.2| 3396 MB<br/>|-|-| 2.000.000| 461.87326| 7.6| 6997 MB<br/>|-|-| 3.000.000| 470.7054| 7.8| 10452 MB<br/>|-|} [[File:insertTweets.png]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu