In this talk I discuss how we built an open-source Spark distribution http://insightedge.io that runs on top of in-memory database.
The agenda of the talk:
Video recording (in Russian):
Slides:
]]>We take a fundamental supply and demand economic model of price determination in a market. We will then compute price in real-time based on the current supply and demand.
To make our demo even more fun, we will create a taxi price surge use case. We will consider the transportation business domain, and taxi companies like Uber or Lyft in particular.
In taxi services, the order requests and available drivers represent the supply and demand data correspondingly. It is interesting that this data is bound to geographical location, which introduces additional complexity. Comparing to business areas like retail, where the product demand is linked to either offline store or a well known list of warehouses, the order requests are geographically distributed.
With services like Uber, the fare rates automatically increase when the taxi demand is higher than drivers around you. The Uber prices are surging to ensure reliability and availability for those who agree to pay a bit more.
The following diagram illustrates the application architecture:
With InsightEdge Geospatial API we are able to efficiently find nearby orders and, therefore, minimize the time required to compute the price. The efficiency comes from the ability to index order request location in the datagrid.
Kafka allows to handle a high throughput of incoming raw events. Even if the computation layer starts processing slower(say during the peak hour), all the events will be reliably buffered in Kafka. The seamless and proven integration with Spark makes it a good choice for streaming applications.
InsightEdge Data Grid also plays a role of a serving layer handling any operational/transactional queries from web/mobile apps.
All the components(Kafka and InsightEdge) can scale out almost linearly;
To scale to many cities, we can leverage data locality principle through a full pipeline (Kafka, Spark, Data Grid) partitioning by the city
or even with a more granular geographical units of scale. In this case the geospatial search query will be limited to a single Data Grid partition. We leave this enhancement out of the scope of the demo.
To simulate the taxi orders we took a csv dataset with Uber pickups in New York City. The demo application consists of following components:
Let’s see how InsightEdge API can be used to calculate the price:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
orders
topiczipWithGridSql()
functionsaveToGrid()
functionThe full source of the application is available on github
In this blog post we created a demo application that processes the data stream using InsightEdge geospatial features.
An alternative approach for implementing dynamic price surging can use machine learning clustering algorithms to split order requests into clusters and calculate if the demand within a cluster is higher than the supply. This streaming application saves the cluster details in the datagrid. Then, to determine the price we execute a geospatial datagrid query to find which cluster the given location belongs to.
]]>There are several compensation models in online advertising industry, probably the most notable is CPC (Cost Per Click), in which an advertiser pays a publisher when the ad is clicked. Search engine advertising is one of the most popular forms of CPC. It allows advertisers to bid for ad placement in a search engine’s sponsored links when someone searches on a keyword that is related to their business offering.
For the search engines like Google, advertising is one of the main sources of their revenue. The challenge for the advertising system is to determine what ad should be displayed for each query that the search engine receives.
The revenue search engine can get is essentially:
revenue = bid * probability_of_click
The goal is to maximize the revenue for every search engine query. Whereis the bid
is a known value, the probability_of_click
is not. Thus predicting the probability of click becomes the key task.
Working on a machine learning problem involves a lot of experiments with feature selection, feature transformation, training different models and tuning parameters.While there are a few excellent machine learning libraries for Python and R, like scikit-learn, their capabilities are typically limited to relatively small datasets that you fit on a single machine.
With the large datasets and/or CPU intensive workloads you may want to scale out beyond a single machine. This is one of the key benefits of InsightEdge, since it’s able to scale the computation and data storage layers across many machines under one unified cluster.
The dataset consists of:
At first, we want to launch InsightEdge.
To get the first data insights quickly, one can launch InsightEdge on a laptop. Though for the big datasets or compute-intensive tasks the resources of a single machine might not be enough.
For this problem we will setup a cluster with four workers and place the downloaded files on HDFS.
Let’s open the interactive Web Notebook and start exploring our dataset.
The dataset is in csv format, so we will use databricks csv library to load it from hdfs into the Spark dataframe:
1 2 |
|
Load the dataframe into Spark memory and cache:
1 2 3 4 5 6 7 |
|
Now that we have the dataset in Spark memory, we can read the first rows:
1
|
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
The data fields are:
Let’s see how many rows are in the training dataset:
1 2 3 |
|
There are about 40M+ rows in the dataset.
Let’s now calculate the CTR (click-through rate) of the dataset. The click-through rate is the number of times a click is made on the advertisement divided by the total impressions (the number of times an advertisement was served):
1 2 3 4 5 |
|
The CTR is 0.169 (or 16.9%) which is quite a high number, the common value in the industry is about 0.2-0.3%. So a high value is probably because non-clicks and clicks are subsampled according to different strategies, as stated by Avazu.
Now, the question is which features should we use to create a predictive model? This is a difficult question that requires a deep knowledge of the problem domain. Let’s try to learn it from the dataset we have.
For example, let’s explore the device_conn_type
feature. Our assumption might be that this is a categorical variable like Wi-Fi, 2G, 3G or LTE. This might be a relevant feature since clicking on an ad with a slow connection is not something common.
At first, we register the dataframe as a SQL table:
1
|
|
and run the SQL query:
1 2 3 4 |
|
We see that there are four connection type categories. Two categories with CTR 18% and 13%, and the first one is almost 90% of the whole dataset. The other two categories have significantly lower CTR.
Another observation we may notice is that features C15 and C16 look like the ad size:
1 2 3 4 5 |
|
We can notice some correlation between the ad size and its performance. The most common one appears to be 320x50px known as “mobile leaderboard” in Google AdSense.
What about other features? All of them represent categorical values, how many unique categories for each feature?
1 2 3 |
|
We see that there are some features with a lot of unique values, for example, device_ip
has 6M+ different values.
Machine learning algorithms are typically defined in terms of numerical vectors rather than categorical values. Converting such categorical features will result in a high dimensional vector which might be very expensive.
We will need to deal with this later.
Looking further at the dataset, we can see that the hour
feature is in YYMMDDHH
format.
To allow the predictive model to effectively learn from this feature it makes sense to transform it into three features: year, month and hour.
Let’s develop the function to transform the dataframe:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
We can now apply this transformation to our dataframe and see the result:
1 2 3 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
It looks like the year and month have only one value, let’s verify it:
1 2 3 4 5 |
|
We can safely drop these columns as they don’t bring any knowledge to our model:
1
|
|
Let’s also convert click
from String to Double type.
1 2 3 4 5 6 |
|
The entire training dataset contains 40M+ rows, it takes quite a long time to experiment with different algorithms and approaches even in a clustered environment. We want to sample the dataset and checkpoint it to the in-memory data grid that is running collocated with Spark. This way we can: * quickly iterate through different approaches * restart the Zeppelin session or launch other Spark applications and pick up the dataset more quickly from memory
Since the training dataset contains the data for the 10 days, we can pick any day and sample it:
1 2 3 |
|
There are 4M+ rows for this day, which is about 10% of the entire dataset.
Now let’s save it to the data grid. This can be done with two lines of code:
1 2 |
|
Any time later in another Spark context we can bring the collection to the Spark memory with:
1
|
|
Also, we want to transform the test
dataset that we will use for prediction in a similar way.
1 2 3 4 5 6 7 8 9 10 |
|
The complete listing of notebook can be found on github. You can import it to Zeppelin and play with it on your own.
Now that we have training and test datasets sampled, initially preprocessed and available in the data grid, we can close Web Notebook and start experimenting with different techniques and algorithms by submitting Spark applications.
For our first baseline approach let’s take a single feature device_conn_type
and logistic regression algorithm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
We will explain a little bit more what happens here.
At first, we load the training dataset from the data grid, which we prepared and saved earlier with Web Notebook.
Then we use StringIndexer and OneHotEncoder to map a column of categories to a column of binary vectors. For example, with 4 categories of device_conn_type
, an input value
of the second category would map to an output vector of [0.0, 1.0, 0.0, 0.0, 0.0]
.
Then we convert a dataframe to an RDD[LabeledPoint]
since the LogisticRegressionWithLBFGS expects RDD as a training parameter.
We train the logistic regression and use it to predict the click for the test dataset. Finally we compute the metrics of our classifier comparing the predicted labels with actual ones.
To build this application and submit it to the InsightEdge cluster:
1 2 |
|
It takes about 2 minutes for the application to complete and output the following:
1
|
|
We get AUROC slightly better than a random guess (AUROC = 0.5), which is not so bad for our first approach, but we can definitely do better.
Let’s try to select more features and see how it affects our metrics.
For this we created a new version of our app CtrDemo2 where we
can easily select features we want to include. We use VectorAssembler to assemble multiple feature vectors into a single features
one:
1 2 3 4 |
|
The results are the following:
device_type
: AUROC = 0.531015564807053+
time_day
and time_hour
: AUROC = 0.5555488992624483+
C15
, C16
, C17
, C18
, C19
, C20
, C21
: AUROC = 0.7000630113145946You can notice how the AUROC is being improved as we add more and more features. This comes with the cost of the training time:
We didn’t include high-cardinality features such as device_ip
and device_id
as they will blow up the feature vector size. One may consider applying techniques such as feature hashing
to reduce the dimension. We will leave it out of this blog post’s scope.
Tuning algorithm parameters is a search problem. We will use Spark Pipeline API with a Grid Search technique. Grid search evaluates a model for each combination of algorithm parameters specified in a grid (do not confuse with data grid).
Pipeline API supports model selection using cross-validation technique. For each set of parameters it trains the given Estimator
and evaluates it using the given Evaluator
.
We will use BinaryClassificationEvaluator
that has AUROC as a metric by default.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
We included two regularization parameters 0.01 and 0.1 in our search grid for now, others are commented out for now.
Output the best set of parameters:
1 2 3 4 5 6 7 |
|
Use the best model to predict test
dataset loaded from the data grid:
1 2 3 |
|
Then the results are saved back to csv on hdfs, so we can submit them to Kaggle, see the complete listing in CtrDemo3.
It takes about 27 mins to train and compare models for two regularization parameters 0.01 and 0.1. The results are:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
This simple logistic regression model has a rank of 1109 out of 1603 competitors in Kaggle.
The future improvements are only limited by data science skills and creativity. One may consider:
The following diagram demonstrates the design of machine learning application with InsightEdge.
The key design advantages are:
In this blog post we demonstrated how to use machine learning algorithms with InsightEdge. We went through typical stages:
We didn’t have a goal to build a perfect predictive model, so there is great room for improvement.
In the architecture section we discussed how the typical design may look like, what are the benefits of using InsightEdge for machine learning.
The Zeppelin notebook can be found here and submittable spark apps here
]]>hashCode()
of java.lang.Enum
implemented?
Surprisingly or not it’s
1 2 3 |
|
it returns the Object
’s hashCode
which is an internal address of the object to a certain extend. From the first glance it totally makes sense since Enum
values are singletons.
Now imagine you are building distributed system. Distributed systems use hashCode
to
The same Enum
instance would give you a different hashCode
value in different JVMs/hosts, screwing up your Hadoop job or put/lookup in distributed storage. Just something I faced recently.
We will gradually build an embedded domain-specific language (DSL) for specifying grammars in a EBNF-like notation in Scala. We will use monadic parser combinators approach. As a result we should be able to parse JSON document using our library.
Sources used in the slides can be found on github
]]>Before considering any architecture, let’s refresh what requirements are essential for typical SaaS application:
(we skip other fundamental requirements such as security as they are not primary focus of the article)
Okay, how do we usually deal with scalability concerns?
When it comes to application services(or application servers), we prefer making them stateless, so we can simply run multiple copies and distribute load between them achieving scaling out (horizontal) capabilities. And what about availability? Pretty the same, run redundant copies of your service. So far so good.
Okay, but having hundreds of customers(tenants) multiplied on number of application servers per customer requires unreasonable overhead on memory, CPU and other hardware resources. Usually customers are different in size, usage patterns and timezones. Under these conditions generated load is spread unequally and lead to suboptimal resources utilization. Further cost overhead comes from licenses of underlying software(databases, application servers, operating systems, etc).
Multitenancy comes to the rescue. Multitenancy implies the ability to serve multiple tenants with a single application instance thus spreading the load more equally and amortizing infrastructure overhead. Though multitenancy is not free, the downside is the increased engineering complexity that requires additional development effort. On the other hand some of these issues can be partially addressed with virtualization, it looks attractive since doesn’t require any significant architecture redesign.
Having scalable application server layer is only part of the problem, with growing amount of customers the data storage has to be scalable as well. Designing multitenant data storage is another huge topic, we will not dig into consideration details. While NoSQL solutions are able to scale out of the box, with relational databases we have to scale them manually allocating database instance per one or several customers, thus application server instance may talk to multiple databases.
Remember, in dynamic scalable SaaS environment application servers, databases and load balancer instances come and go while relying on each other combining a distributed cluster. The load balancer should know about application server instances, and application server instance should talk to databases. So when the database for new customer is provisioned, some or all application servers have to be notified about database layer changes and load balancer has to be notified of all changes in a farm of application servers. In short, we need an ability to link the pieces of multi-tier application together in realtime.
Configuration management tools like Chef, Puppet, etc are able to configure a node based on centralized configuration, though they are not designed to be responsive and propagate configuration changes quickly. Additionally they are not designed to detect failures or tolerate network partitions.
If you think about multi-tier configuration in a more abstract way, you will easily recognize service registry and service discovery patterns that people use many years building distributed systems.
Even more, with the rise of container based virtualization and Docker in particularly, service discovery becomes very important part. Containers need an ability to discovery each other adopting to the current environment.
Let’s see how SaaS configuration will look from service discovery perspective.
Okay, but can we just use simple database with inserts and selects to register and discovery? Well, there are a few concerns with that:
Now that we have discussed the high level design and requirements of configuration store, we can briefly mention open source tools: Zookeeper, Consul, etcd, Eureka, SmartStack, Doozer, Serf, etc.
I would highlight three of them definitely worth checking:
Let’s build a simple demo application to proof the concept described above. We will try to simplify things as much as possible sacrificing correctness and errors handling sometimes, but still suitable for illustration purpose. Note, one should leverage existing Curator service discovery recipes when building production quality applications. The source code is available on github
For this PoC we choose tenant aware model rather than multitenancy to demonstrate how to incorporate custom logic with service discovery. In this model client (tenant) is routed to a configurable number of application services while they use common database.
Beware that this model has its drawbacks such as weaker scale-in (reducing the quantity of servers) and cost saving capabilities since the load is not equally spread comparing to true multitenancy. On the other hand this can be compensated with container virtualization in some sort. Also this model implies support of multiple release versions of application and has better support of application level caching. Again, the tenancy model itself is not a subject here rather than a centralized configuration.
Our stack is:
We define Zookeeper data model as a following hierarchical structure.
/app
/client-{id}
/db
/app-server-slots
Znode /db
contains database details such as connection url. Znode /app-server-slots
defines the maximum number of application server instances we want to run for given client.
Here is an example with 3 clients, the value of znode follows =
sign.
/app
/client-1
/db = jdbc://db-client-1:5555
/app-server-slots = 2
/client-2
/db = jdbc://db-client-2:5555
/app-server-slots = 2
/client-3
/db = jdbc://db-client-3:5555
/app-server-slots = 1
Service registration is implemented using so called ephemeral znodes. Unlike standard znodes they exist as long as the session that created the znode is active.
When application server starts it registers itself creating ephemeral znode under corresponding client. Respectively when application server is brought down or it failures for some reason, the ephemeral znode is automatically deleted. The value of znode contains http service location.
/app
/client-1
/db = jdbc://db-client-1:5555
/app-server-slots = 2
/app-server#0000000001 = host1:56003
We use HAProxy as a load balancer. To reconfigure HAProxy in runtime we created a simple agent that runs alongside HAProxy process and watches for any configuration changes. Once it detects any changes in Zookeeper, it rewrites HAProxy config and send a command to reload it.
At first we start Zookeeper zkServer.sh start
Then we create some data model to play with by running ZkSchemaBuilder.scala
. You can browse zookeeper data with zkCli.sh
tool.
Start HAProxy /haproxy/start.sh
and HAProxy agent running HAProxyAgent.scala
Starting HAProxy agent Thread[main,5,main]
/client-1
/client-2
/client-3
Rewriting HAProxy config /Users/fe2s/Projects/zk-tenant/haproxy/haproxy.conf
Reloading HAProxy config
At this point we should be able to hit http://localhost:8080/
though it will return 503 since there are no actual backend services running. Let’s fix it.
Run Boot.scala
to start application server with http service.
[INFO] [06/01/2015 20:22:16.084] [on-spray-can-akka.actor.default-dispatcher-2] [akka://on-spray-can/user/IO-HTTP/listener-0] Bound to Oleksiis-Mac-mini.local/192.168.0.100:57581
registering service available at Oleksiis-Mac-mini.local:57581
looking for a client with free slots
client client-1 slots: 2 occupied: 0
Found client client-1 with available slot(s) ... registering
Configuring HttpService with dbUrl jdbc://db-client-1:5555
We see that http service was brought up for client-1 that had 2 available slots.
Now run Boot.scala
several more times to start more servers and check haproxy/haproxy.conf
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend http-in
bind *:8080
acl client-1-path path_beg /client-1
acl client-2-path path_beg /client-2
acl client-3-path path_beg /client-3
use_backend client-1-backend if client-1-path
use_backend client-2-backend if client-2-path
use_backend client-3-backend if client-3-path
backend client-1-backend
balance roundrobin
server app Oleksiis-Mac-mini.local:57581
server app Oleksiis-Mac-mini.local:57588
backend client-2-backend
balance roundrobin
server app Oleksiis-Mac-mini.local:57591
server app Oleksiis-Mac-mini.local:57595
backend client-3-backend
balance roundrobin
server app Oleksiis-Mac-mini.local:57598
As we see HAProxy agent observed and propagated all configuration changes to haproxy.conf. We use HAProxy acl feature to route http requests by url prefix, i.e. requests starting with /client-1
will be routed to a farm of application servers serving for client-1.
Now if we hit http://localhost:8080/client-1/test we should get a response “OK. Http service configured with db url: jdbc://db-client-1:5555”. Voila!
You should also notice that killing an application server process will result in immediate reconfiguration of HAProxy.
]]>@Configuration
classes or XML namespace, filters supportfindByNameAndAge
)Sort
and Pageable
In this article we describe how we use GigaSpaces XAP in-memory datagrid to address this challenge. Code sources are available on github
Real-time processing is becoming more and more popular. Spark Streaming is an extension of the core Spark API that allows scalable, high-throughput, fault-tolerant stream processing of live data streams.
Spark Streaming has many use cases: user activity analytics on web, recommendation systems, censor data analytics, fraud detection, sentiment analytics and more.
Data can be ingested to Spark cluster from many sources like HDS, Kafka, Flume, etc and can be processed using complex algorithms expressed with high-level functions like map
, reduce
, join
and window
. Finally, processed data can be pushed out to filesystems or databases.
Spark cluster keeps intermediate chunks of data (RDD) in memory and, if required, rarely touches HDFS to checkpoint stateful computation, therefore it is able to process huge volumes of data at in-memory speed. However, in many cases the overall performance is limited by slow input and output data sources that are not able to stream and store data with in-memory speed.
In this pattern we address performance challenge by integrating Spark Streaming with XAP. XAP is used as a stream data source and a scalable, fast, reliable persistent storage.
Let’s discuss this in more details.
On XAP side we introduce the concept of stream. Please find XAPStream
– an implementation that supports writing data in single and batch modes and reading in batch mode. XAPStream
leverages XAP’s FIFO
(First In, First Out) capabilities.
Here is an example how one can write data to XAPStream
. Let’s consider we are building a Word Counter application and would like to write sentences of text to the stream.
At first we create a data model that represents a sentence. Note, that the space class should be annotated with FIFO
support.
1 2 3 4 5 6 |
|
Complete sources of Sentence.java can be found here
In order to ingest data from XAP to Spark, we implemented a custom ReceiverInputDStream
that starts the XAPReceiver
on Spark worker nodes to receive the data.
XAPReceiver
is a stream consumer that reads batches of data in multiple threads in parallel to achieve the maximum throughput.
XAPInputDStream
can be created using the following function in XAPUtils
object.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Here is an example of creating XAP Input stream. At first we set XAP space url in Spark config:
1 2 3 4 5 |
|
And then we create a stream by merging two parallel sub-streams:
1 2 3 |
|
Once the stream is created, we can apply any Spark functions like map
, filter
, reduce
, transform
, etc.
For instance, to compute a word counter of five-letter words over a sliding window, one can do the following:
1 2 3 4 5 6 |
|
Output operations allow the DStream
’s data to be pushed out to external systems. Please refer to Spark documentation for the details.
To minimize the cost of creating XAP connection for each RDD
, we created a connection pool named GigaSpaceFactory
. Here is an example how to output RDD
to XAP:
1 2 3 4 5 |
|
Please, note that in this example a XAP connection is created and data is written from Spark driver. In some cases, one may want to write data from the Spark worker. Please, refer to Spark documentation - it explains different design patterns using
foreachRDD
.
As a part of this integration pattern, we demonstrate how to build an application that consumes live stream of text and displays top 10 five-letter words over a sliding window in real-time. The user interface consists of a simple single page web application displaying a table of top 10 words and a word cloud. The data on UI is updated every second.
The high-level design diagram of the Word Counter Demo is below:
mvn clean install
spark
by adding export LOOKUPGROUPS=spark
line to <XAP_HOME>/bin/setenv.sh/bat
gs-agent.sh/bat
scriptmvn os:deploy -Dgroups=spark
from <project_root>/word-counter-demo
directoryThis is the simplest option that doesn’t require downloading and installing Spark distributive, which is useful for the development purposes. Spark runs in the embedded mode with as many worker threads as logical cores on your machine.
<project_root>/word-counter-demo/spark/target
directoryjava -jar spark-wordcounter.jar -s jini://*/*/space?groups=spark -m local[*]
In this option Spark runs a cluster in the standalone mode (as an alternative to running on a Mesos or YARN cluster managers).
fe2s
(remember to substitute it with yours)1 2 3 |
|
<project_root>/word-counter-demo/spark/target
directoryjava -jar spark-wordcounter.jar -s jini://*/*/space?groups=spark -m spark://fe2s:7077 -j ./spark-wordcounter.jar
<project_root>/word-counter-demo/feeder/target
java -jar feeder.jar -g spark -n 500
At this point all components should be up and running. The application is available at http://localhost:8090/web/
]]>Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
This pattern integrates XAP with Storm. XAP is used as stream data source and fast reliable persistent storage, whereas Storm is in charge of data processing. We support both pure Storm and Trident framework.
As part of this integration we provide classic Word Counter and Twitter Reach implementations on top of XAP and Trident.
Also, we demonstrate how to build highly available, scalable equivalent of Realtime Google Analytics application with XAP and Storm. Application can be deployed to cloud with one click using Cloudify.
Sources are available on github
Storm is a real time, open source data streaming framework that functions entirely in memory. It constructs a processing graph that feeds data from an input source through processing nodes. The processing graph is called a “topology”. The input data sources are called “spouts”, and the processing nodes are called “bolts”. The data model consists of tuples. Tuples flow from Spouts to the bolts, which execute user code. Besides simply being locations where data is transformed or accumulated, bolts can also join streams and branch streams.
Storm is designed to be run on several machines to provide parallelism. Storm topologies are deployed in a manner somewhat similar to a webapp or a XAP processing unit; a jar file is presented to a deployer which distributes it around the cluster where it is loaded and executed. A topology runs until it is killed.
Beside Storm, there is a Trident – a high-level abstraction for doing realtime computing on top of Storm. Trident adds primitives like groupBy, filter, merge, aggregation to simplify common computation routines. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.
Capability to guarantee exactly-once semantics comes with additional cost. To guarantee that, incremental processing should be done on top of persistence data source. Trident has to ensure that all updates are idempotent. Usually that leads to lower throughput and higher latency than similar topology with pure Storm.
Basically, Spouts provide the source of tuples for Storm processing. For spouts to be maximally performant and reliable, they need to provide tuples in batches, and be able to replay failed batches when necessary. Of course, in order to have batches, you need storage, and to be able to replay batches, you need reliable storage. XAP is about the highest performing, reliable source of data out there, so a spout that serves tuples from XAP is a natural combination.
Depending on domain model and level of guarantees you want to provide, you choose either pure Storm or Trident. We provide Spout implementations for both – XAPSimpleSpout
and XAPTranscationalTridentSpout
respectively.
XAPSimpleSpout
is a spout implementation for pure Storm that reads data in batches from XAP. On XAP side we introduce conception of stream. Please find SimpleStream
– a stream implementation that supports writing data in single and batch modes and reading in batch mode. SimpleStream
leverages XAP’s FIFO(First In, First Out) capabilities.
SimpleStream
works with arbitrary space class that has FifoSupport.OPERATION
annotation and implements Serializable
.
Here is an example how one may write data to SimpleStream
and process it in Storm topology. Let’s consider we would like to build an application to analyze the stream of page views (user clicks) on website. At first, we create a data model that represents a page view
1 2 3 4 5 6 7 |
|
Now we would like to create a reference to stream instance and write some data.
1 2 |
|
The second argument of SimpleStream
is a template used to match objects during reading.
If you want to have several streams with the same type, template objects should differentiate your streams.
Now let’s create a spout for PageView
stream.
1 2 3 4 5 |
|
To create a spout, we have to specify how we want our space class be converted to Storm tuple. That is exactly what TupleConverter
knows about.
1 2 3 4 5 6 7 8 9 10 11 |
|
At this point we have everything ready to build Storm topology with PageViewSpout
.
1 2 3 4 5 |
|
ConfigConstants.XAP_SPACE_URL_KEY
is a space URL
ConfigConstants. XAP_STREAM_BATCH_SIZE
is a maximum number of items that spout reads from XAP with one hit.
XAPTranscationalTridentSpout
is a scalable, fault-tolerant, transactional spout for Trident, supports pipelining. Let’s discuss all its properties in details.
For spout to be maximally performant, we want an ability to scale the number of instances to control the parallelism of reader threads.
There are several spout APIs available that we could potentially use for our XAPTranscationalTridentSpout implementation:
IPartitionedTridentSpout
: A transactional spout that reads from a partitioned data source. The problem with this API is that it doesn’t acknowledge when batch is successfully processed which is critical for in memory solutions since we want to remove items from the grid as soon as they have been processed. Another option would be to use XAP’s lease capability to remove items by time out. This might be unsafe, if we keep items too long, we might consume all available memory.ITridentSpout
: The most general API. Setting parallelism hint for this spout to N will create N spout instances, single coordinator and N emitters. When coordinator issues new transaction id, it passes this id to all emitters. Emitter reads its portion of transaction by given transaction id. Merged data from all emitters forms transaction.For our implementation we choose ITridentSpout
API.
There is one to one mapping between XAP partitions and emitters.
Storm framework guarantees that topology is high available, if some component fails, it restarts it. That means our spout implementation should be stateless or able to recover its state after failure.
When emitter is created, it calls remote service ConsumerRegistryService
to register itself. ConsumerRegistryService
knows the number of XAP partitions and keeps track of the last allocated partition. This information is reliably stored in the space, see ConsumerRegistry.java
.
Remember that parallelism hint for XAPTranscationalTridentSpout
should equal to the number of XAP partitions.
The property of being transactional is defined in Trident as following: - batches for a given txid are always the same. Replays of batches for a txid will exact same set of tuples as the first time that batch was emitted for that txid. - there’s no overlap between batches of tuples (tuples are in one batch or another, never multiple). - every tuple is in a batch (no tuples are skipped)
XAPTranscationalTridentSpout
works with PartitionedStream
that wraps stream elements into Item class and keeps items ordered by ‘offset’ property. There is one PartitionStream
instance per XAP partition.
Stream’s WriterHead
holds the last offset in the stream. Any time batch of elements (or single element) written to stream, WriterHead
incremented by the number of elements. Allocated numbers used to populate offset property of Items. WriterHead
object is kept in heap, there is no need to keep it in space. If primary partition fails, WriterHead
is reinitialized to be the max offset value for given stream.
ReaderHead
points to the last read item. We have to keep this value in the space, otherwise if partition fails we won’t be able to infer this value.
When spout request new batch, we take ReaderHead
, read data from that point and update ReaderHead
. New BatchMetadata
object is placed to the space, it keeps start offset and number of items in the batch. In case Storm requests transaction replaying, we are able to reread exactly the same items by given batchId. Finally, once Storm acknowledges that batch successfully processed, we delete BatchMetadata
and corresponding items from the space.
By default, Trident processes a single batch at a time, waiting for the batch to succeed or fail before trying another batch. We can get significantly higher throughput and lower latency of processing of each batch – by pipelining the batches. You configure the maximum amount of batches to be processed simultaneously with the “topology.max.spout.pending” property.
Operations with PartitionedStream
are encapsulated in remote service – PartitionedStreamService
.
Here is an example how to use XAPTransactionalTridentSpout
:
1 2 3 4 5 6 7 8 9 |
|
The full example that demonstrates usage of XAPTransactionalTridentSpout
to address classic Word Counter problem can be found in XAPTransactionalTridentSpoutTest
.
Trident has first-class abstractions for reading from and writing to stateful sources. Details are available on the Storm wiki site.
In Trident topology that is persisting state via this mechanism, the overall throughput is almost certainly constrained by the performance of the state persistence. This is a good place where XAP can step in and provide extremely high performance persistence for stream processing state.
XAP Trident state implementation supports all state types – non-transactional, transactional and opaque. All you need to create a Trident state is configure space url and choose appropriate factory method of XAPStateFactory
class:
1 2 3 4 5 6 7 8 9 |
|
The full example can be found in TridentWordCountTest
.
Trident Read-Only state allows to lookup persistent data during the computation.
Consider Twitter Reach example. Reach is the number of unique people exposed to a URL on Twitter. To compute reach, you need to fetch all the people who ever tweeted a URL, fetch all the followers of all those people, unique that set of followers, and that count that uniqued set.
XAP is a good candidate to store reference data such as tweeted url and followers. You can easily create XAP read-only state with XAPReadOnlyStateFactory
. The following example demonstrates how to create a read-only state for TweeterUrl
and Followers
classes. The input arguments that Trident pass to stateQuery()
are used as space ids.
The full example can be found in TridentReachTest
.
1 2 3 4 5 6 7 8 9 10 11 |
|
Another option to create XAP read-only state is to use SQL query. In this case stateQuery’s
input arguments are used as SQL parameters:
1 2 |
|
The full example can be found in SqlQueryReadOnlyStateTest
.
If pure Storm suits better your needs, most likely you will want to read/write data from bolts to persistent storage. For instance, imagine you are processing stream of data and would like to present computation result on UI. So the final bolt in your topology pipeline should write result to XAP which can then be accessed from anywhere. For this purpose we created XAPAwareRichBolt
and XAPAwareBasicBolt
that have a reference to space proxy. All you need is to configure space url and extend XAP aware bolt.
Example:
1 2 3 4 5 6 7 |
|
In this section we demonstrate how to build highly available, scalable equivalent of Real-time Google Analytics application and deploy it to cloud with one click using Cloudify.
Real-Time Google Analytics allows you to monitor activity as it happens on your site. The reports are updated continuously and each page view is reported seconds after it occurs on your site. For example, you can see:
PageView feeder is a standalone java application that simulates users on the site. It continuously sends PageView
json to rest service endpoints deployed in XAP web PU. PageView looks like this
1 2 3 4 5 6 |
|
Rest service converts JSON documents to space object and writes them to the stream. Stream is consumed by Storm topology which performs all necessary processing in memory and stores results in XAP space. End user is browsing web page hosted in Web PU that continuously updates reports with AJAX requests backed by another rest service. Rest service reads report from XAP space.
We use pure Storm to build topology. There are several reasons why we don’t use Trident for this application. We are tolerant to page views loss if some Storm node fails. We don’t need exactly-once processing semantic. Instead, we want to maximize throughput and minimize latency.
PageView spout forks five branches, each branch calculates its report and can be scaled independently. The final bolt in the branch writes data to XAP space. In the next sections we take a closer look at branches design.
Top urls report displays top 10 visited urls for the last ten seconds. Topology implements distributed rolling count algorithm. The report is updated every second.
Tuples flow from spout to UrlRollingCountBolt
grouped by ‘url’. UrlRollingCountBolt
calculates rolling count with sliding windows of 10 seconds for every url. Sliding windows is basically a cyclic buffer with a head pointing to current slot. When bolt receives new tuple, it finds a sliding window for this tuple and increments the number in current slot. Every two seconds UrlRollingCountBolt
emits the sum of sliding window for every url, then sliding windows advance and head points to the next slot.
The url and its rolling count flow to IntermediateRankingsBolt
which maintains pair of (url, count) in sorted by count order and emits its top 10 urls to the final stage. TotalUrlRankingBolt
calculates the global top 10 urls and writes report object to XAP space. The primitives to implement rolling count algorithm can be found in storm-starter project.
Top referrals topology branch is identical to top urls one. The only difference in is that we calculate ‘referral’ rather than ‘url’ tuple field.
Active users report displays how many people on the site right now. We assume that if user hasn’t opened any page for the last N seconds, then user has left the site. Users are uniquely identified by ‘sessionId’ tuple field. For demo purpose N is configured to 5 seconds, though it should be much longer in real life application.
Tuples flow from spout to PartitionedActiveUsersBolt
grouped by ‘sessionId’. For every sessionId PartitionedActiveUsersBolt
keeps track of the last seen time. Every second it removes sessions seen last time earlier than N seconds before and then emits the number of remaining ones.
TotalActiveUsersBolt
maintains a map of [source_task, count] and emits the total count for all sources. Report is written to XAP.
Page view time series report displays the dynamic of visited pages for last minute. The chart is updated every second.
PageViewCountBolt
calculates the number of page views and passes local count to PageViewTimeSeriesBolt
every second. PageViewTimeSeriesBolt
maintains a sliding window counter and writes report to XAP space.
Geo report displays a map of users’ geographical location. Depending on the volume of traffic from particular country, country is filled with different colors on the map.
IP address converted to country using MaxMind GeoIP database. The database is a binary file loaded into GeoIPBolt’s
heap. GeoIpLookupService
ensures that it’s loaded only once per JVM.
mvn clean install
gs-agent.sh/bat
script1 2 |
|
apache-storm-0.9.2-incubating/bin
to your $PATH
storm jar ./storm-topology/target/storm-topology-1.0-SNAPSHOT.jar com.gigaspaces.storm.googleanalytics.topology.GoogleAnalyticsTopology google-analytics 127.0.0.1
java -jar ./feeder/target/feeder-1.0-SNAPSHOT.jar 127.0.0.1
storm kill google-analytics
google-analytics/storm-topology/pom.xml
and change scope of storm-core artifact from ‘provided’ to ‘compile’.java -jar ./storm-topology/target/storm-topology-1.0-SNAPSHOT.jar
. Alternatively you can GoogleAnalyticsTopology
from your IDE.Please note, recipes tested with Centos 6 only
<project_root>/cloudify/apps/storm-demo/deployer/files
contains up-to-date version of space-1.0.-SNAPSHOT.jar
, web.war
and feeder-1.0-SNAPSHOT.jar
. As well as <project_root>/cloudify/apps/storm-demo/storm-nimbus/commands
contains storm-topology-1.0-SNAPSHOT.jar
(you can copy them from maven’s target directories using <project_root>/dev-scripts/copy-artifacts-to-cloudify.sh
script)<project_root>/cloudify
recipes to <cloudify_install>/recipes
directory<cloudify_install>/bin/cloudify.sh
bootstrap-localcloud
)install-application storm-demo
xap-management
service. Google Analytics UI should be available at http://<xap_management_service_ip>:8090/web
This pattern integrates GigaSpaces with Apache Kafka. GigaSpaces’ write-behind IMDG operations to Kafka making it available for the subscribers. Such could be Hadoop or other data warehousing systems using the data for reporting and processing. Sources are available on github
The XAP Kafka integration is done via the SpaceSynchronizationEndpoint
interface deployed as a Mirror service PU. It consumes a batch of IMDG operations, converts them to custom Kafka messages and sends these to the Kafka server using the Kafka Producer API.
GigaSpace-Kafka protocol is simple and represents the data and its IMDG operation. The message consists of the IMDG operation type (Write, Update , remove, etc.) and the actual data object. The Data object itself could be represented either as a single object or as a Space Document with key/values pairs (SpaceDocument
).
Since a Kafka message is sent over the wire, it should be serialized to bytes in some way.
The default encoder utilizes Java serialization mechanism which implies Space classes (domain model) to be Serializable
.
By default Kafka messages are uniformly distributed across Kafka partitions. Please note, even though IMDG operations appear ordered in SpaceSynchronizationEndpoint
, it doesn’t imply correct ordering of data processing in Kafka consumers. See below diagram:
You can download the example code from github.
The example located under <project_root>/example
. It demonstrates how to configure Kafka persistence and implements a simple Kafka consumer pulling data from Kafka and store in HsqlDB.
In order to run an example, please follow the instruction below:
Step 1: Install Kafka
Step 2: Start Zookeeper and Kafka server
1 2 |
|
Step 3: Build project
1 2 |
|
Step 4: Deploy example to GigaSpaces
1 2 |
|
Step 5: Check GigaSpaces log files, there should be messages from the Feeder and Consumer.
The following maven dependency needs to be included in your project in order to use Kafka persistence. This artifact is built from <project_rood>/kafka-persistence
source directory.
1 2 3 4 5 |
|
Here is an example of the Kafka Space Synchronization Endpoint configuration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Please consult Kafka documentation for the full list of available producer properties. You can override the default properties if there is a need to customize GigaSpace-Kafka protocol. See Customization section below for details.
In order to associate a Kafka topic with the domain model class, the class needs to be annotated with the @KafkaTopic
annotation and declared as Serializable
. Here is an example
1 2 3 4 5 |
|
To configure a Kafka topic for a SpaceDocuments or Extended SpaceDocument, the property KafkaPersistenceConstants.SPACE_DOCUMENT_KAFKA_TOPIC_PROPERTY_NAME
should be added to document. Here is an example
1 2 3 4 5 6 |
|
It’s also possible to configure the name of the property which defines the Kafka topic for SpaceDocuments. Set spaceDocumentKafkaTopicName
to the desired value as shown below.
1 2 3 4 |
|
The Kafka persistence library provides a wrapper around the native Kafka Consumer API for the GigaSpace-Kafka protocol serialization. Please see com.epam.openspaces.persistency.kafka.consumer.KafkaConsumer
, example of how to use it under <project_root>/example module
.
AbstractKafkaMessage
, AbstractKafkaMessageKey
, AbstractKafkaMessageFactory
.KafkaMessageDecoder
and KafkaMessageKeyDecoder
.For those who haven’t heard about Storm, in short it’s a distributed realtime computation system. Scalable, fault tolerant and guarantees that every message is fully processed.
The incoming tuple(message in Storm terminology) is processed by the bolt(processing node). Bolt can spawn more tuples which in their turn are further processed by succeeding bolt(s). So you end up with a tuple tree (directed acyclic graph actually).
The question is how to guarantee that every tuple is fully processed, i.e if some bolt fails, the tuple is replayed. We have to acknowledge when intermediate tuple created and when it’s processed. With huge tuple trees, tracking the entire tree state is memory expensive.
Strom designers decided to use an elegant trick which allows to know when tuple is fully processed with O(1) memory.
Tuple is associated with a random 64bit id. The tree state is also 64bit value called ‘ack val’. Whenever tuple is created or acknowledged, just XOR
tuple id with ack val. When ack val becomes 0, the tree is fully processed. Yes, there is a a small probability of mistake, at 10K acks per second, it will take 50,000,000 years until a mistake is made. And even then, it will only cause data loss if that tuple happens to fail.
Now the algorithmic problem. Given an array of integer numbers. Each number except one appears exactly twice. The remaining number appears only once in the array, find this number with one iteration and O(1) memory. Easy, yay!
]]>Step 1. Find obfuscated password in jetty.xml, it should start with OBF: prefix. Run it through the following deobfuscating function which I found in jetty sources.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Step 2. Now you should have the password for keystore. The location of keystore should be listed in jetty.xml. Import keys to intermediate PKCS12 format
$ /usr/java/jdk1.6.0_13/bin/keytool -importkeystore -srckeystore $YOUR_PATH_HERE/keystore -destkeystore intermediate.p12 -deststoretype PKCS12
Step 3. Extract RSA key from PKCS12
$ openssl pkcs12 -in intermediate.p12 -nocerts -nodes -passin pass:$YOUR_PASS_HERE | openssl rsa -out privateRSAKey.pem
Step 4. Now you are good to feed wireshark or other preferred tool with RSA key.
]]>Посмотрим за счет чего достигается скорость вставки и поиска по сравнению с классическим B-tree.
Необходимо оговориться, что мы рассматриваем случай когда индекс хранится на диске, а не в памяти. Поэтому нас интересует оценка количества операций IO, а не операций CPU. Операции CPU занимают ничтожно малое время по сравнению с IO.
Вставка:
B-tree O((log N)/(log B))
Fractal-tree O((log N)/(B^(1-k)))
Поиск:
B-tree O((log N)/(log B))
Fractal-tree O((log N)/(log (k*B^(1-k)))
где B - размер IO блока
Итак, упрощенный вариант Fractal Tree:
log N
массивов размером 2^i
Начинаем сверху. Смотрим на верхний массив размером 1. Если он пустой - кладем туда элемент. Иначе вынимаем элемент и сортируем с данным во временом массиве. Спускаемся вниз к массиву размером 2. Если он пустой - кладем туда нашу пару. Если нет - мержим их с временным массивом. Так как оба массива отсортированы, то делаем ето за O(X)
, где X
ето длина массива. По сути ето операция merge из merge-сортировки.
Амортизированное время вставки занимает O((log N)/B)
, хотя в худшем случае вставка одного элемента может повлечь за собой перезапись огромного количества данных по цепочке. Чтобы избежать пиков с длительным ответом TokuDB порождает отдельный поток для вставки, ответ для клиента возвращается немедленно.
проходимся по массивам, в кажом массиве используем бинарный поиск. Итого - O((log N)^2)
. Это медленнее чем B-tree. Как это можно ускорить?
Идея состоит в том, чтобы во время поиска в очередном массиве использовать некоторую информацию из поиска предыдущего массива. И так по цепочке. А информация следующая - каждый элемент массива хранит ссылку на его ‘виртуальное’ место в следующем массиве. Называется ето fractional cascading.
Итого, log N
массивов, константное время на каждом массиве дает O(log N)
.
В целом товарищи из Tokutek считают что в будущем все перейдут на фрактальные деревья как замена Б-деревьям. Детали алгоритма запатентованы.
Предположим вы разрабатываете приложение и решили кешировать данные для улучшения производительности. Так же вы решили использовать горизонтальное масштабирование и разнести данные на N серверов.
Итого, есть N серверов и необходимо реализовать две функции:
1 2 |
|
Имея ключ k, как узнать на каком сервере лежит соответствующий ему элемент?
Первое что приходит в голову - использовать обычную хеш-таблицу. Берем ключ k, применяем к нему хеш-функцию и считаем остаток от деления на количество серверов N - hash(k) mod N
. Да, это будет работать, но что произойдет когда мы захотим добавить ещё один сервер ? Нам необходимо будет перехешировать все данные, большую часть которых нужно будет загрузить на новые сервера. Это дорогая операция. Также не понятно что делать в случае падения существующего сервера.
Здесь появляется консистентное кеширование. Идея простая, возьмем окружность и будем рассматривать ее как интервал на котором определена хеш-функция функция. Применив хеш-функцию к набору ключей (синие точки) и серверов (зеленые точки) сможем разместить их на окружности.
Для того чтобы определить на каком сервере размещен ключ, найдем ключ на окружности и будем двигаться по часовой стрелке до ближайшего сервера.
Теперь в случае падения (недоступности) сервера, загрузить на новые сервер необходимо только недоступные данные. Все остальные хеши не меняют свое местоположение, то есть консистенты.
При добавлении нового сервера соседний разделяет с ним часть своих данных.
В целом ето все. На практике также применяют следующий трюк. Сервер можно пометить на окружности не одной точкой, а несколькими.
Что ето дает ? - более равномерное распределение данных по серверам - при падении сервера данные распределяются не на один соседний, а на несколько, распределяя тем самым нагрузку - при добавлении нового сервера, точки можно делать ‘активными’ постепенно одна за другой, предотвращая шквальную нагрузку на сервер - если конфигурация серверов отличается, например размером диска, количество данных можно контролировать числом его точек. Больше точек - большая длина окружности принадлежит етому серверу и соответственно больше данных.
Храним хеши серверов в виде какого-либо дерева, например Red-Black. Операция поиска сервера по ключу будет занимать O(log n)
.
Consider you are reviewing code changes or making 3-way merge where among various things a name of some method has changed. Personally I don’t want to go through tens of files and check that all usages of method changed accordingly.
I would like to see it in more declarative way. One phrase saying ‘method Foo.bar()
has changed to Foo.bar2()
’ would be enough. Imagine you could accept this particular change and now tool ignores it making the whole picture clearer.
Say for static object-oriented languages I can imagine a number of refactoring types where this could be useful - method extract, all kind of renames, replacing inheritance with delegation, replacing constructor with factory methods and so on.
How difficult would it be to implement semantic aware diff on top of Intellij IDEA ?
]]>