block: Adding ROW scheduling algorithm This patch adds the implementation of a new scheduling algorithm - ROW. The policy of this algorithm is to prioritize READ requests over WRITE as much as possible without starving the WRITE requests. Change-Id: I4ed52ea21d43b0e7c0769b2599779a3d3869c519 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: ROW: Correct minimum values of ROW tunable parameters The ROW scheduling algorithm exposes several tunable parameters. This patch updates the minimum allowed values for those parameters. Change-Id: I5ec19d54b694e2e83ad5376bd99cc91f084967f5 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: ROW: Fix forced dispatch This patch fixes forced dispatch in the ROW scheduling algorithm. When the dispatch function is called with the forced flag on, we can't delay the dispatch of the requests that are in scheduler queues. Thus, when dispatch is called with forced turned on, we need to cancel idling, or not to idle at all. Change-Id: I3aa0da33ad7b59c0731c696f1392b48525b52ddc Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: Add support for reinsert a dispatched req Add support for reinserting a dispatched request back to the scheduler's internal data structures. This capability is used by the device driver when it chooses to interrupt the current request transmission and execute another (more urgent) pending request. For example: interrupting long write in order to handle pending read. The device driver re-inserts the remaining write request back to the scheduler, to be rescheduled for transmission later on. Add API for verifying whether the current scheduler supports reinserting requests mechanism. If reinsert mechanism isn't supported by the scheduler, this code path will never be activated. Change-Id: I5c982a66b651ebf544aae60063ac8a340d79e67f Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: Add API for urgent request handling This patch add support in block & elevator layers for handling urgent requests. The decision if a request is urgent or not is taken by the scheduler. Urgent request notification is passed to the underlying block device driver (eMMC for example). Block device driver may decide to interrupt the currently running low priority request to serve the new urgent request. By doing so READ latency is greatly reduced in read&write collision scenarios. Note that if the current scheduler doesn't implement the urgent request mechanism, this code path is never activated. Change-Id: I8aa74b9b45c0d3a2221bd4e82ea76eb4103e7cfa Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> row: Adding support for reinsert already dispatched req Add support for reinserting already dispatched request back to the schedulers internal data structures. The request will be reinserted back to the queue (head) it was dispatched from as if it was never dispatched. Change-Id: I70954df300774409c25b5821465fb3aa33d8feb5 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block:row: fix idling mechanism in ROW This patch addresses the following issues found in the ROW idling mechanism: 1. Fix the delay passed to queue_delayed_work (pass actual delay and not the time when to start the work) 2. Change the idle time and the idling-trigger frequency to be HZ dependent (instead of using msec_to_jiffies()) 3. Destroy idle_workqueue() in queue_exit Change-Id: If86513ad6b4be44fb7a860f29bd2127197d8d5bf Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> cfq-iosched: Fix null pointer dereference NULL pointer dereference can happen in cfq_choose_cfqg() when there are no cfq groups to select other than the current serving group. Prevent this by adding a NULL check before dereferencing. Unable to handle kernel NULL pointer dereference at virtual address [<c02502cc>] (cfq_dispatch_requests+0x368/0x8c0) from [<c0243f30>] (blk_peek_request+0x220/0x25c) [<c0243f30>] (blk_peek_request+0x220/0x25c) from [<c0243f74>] (blk_fetch_request+0x8/0x1c) [<c0243f74>] (blk_fetch_request+0x8/0x1c) from [<c041cedc>] (mmc_queue_thread+0x58/0x120) [<c041cedc>] (mmc_queue_thread+0x58/0x120) from [<c00ad310>] (kthread+0x84/0x90) [<c00ad310>] (kthread+0x84/0x90) from [<c000eeac>] (kernel_thread_exit+0x0/0x8) CRs-Fixed: 416466 Change-Id: I1fab93a4334b53e1d7c5dcc8f93cff174bae0d5e Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org> row: Add support for urgent request handling This patch adds support for handling urgent requests. ROW queue can be marked as "urgent" so if it was un-served in last dispatch cycle and a request was added to it - it will trigger issuing an urgent-request-notification to the block device driver. The block device driver may choose at stop the transmission of current ongoing request to handle the urgent one. Foe example: long WRITE may be stopped to handle an urgent READ. This decreases READ latency. Change-Id: I84954c13f5e3b1b5caeadc9fe1f9aa21208cb35e Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: Add some debug information on ROW queues 1. Add a counter for number of requests on queue. 2. Add function to print queues status (number requests currently on queue and number of already dispatched requests in current dispatch cycle). Change-Id: I1e98b9ca33853e6e6a8ddc53240f6cd6981e6024 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: Insert dispatch_quantum into struct row_queue There is really no point in keeping the dispatch quantum of a queue outside of it. By inserting it to the row_queue structure we spare extra level in accessing it. Change-Id: Ic77571818b643e71f9aafbb2ca93d0a92158b199 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: fix sysfs functions - idle_time conversion idle_time was updated to be stored in msec instead of jiffies. So there is no need to convert the value when reading from user or displaying the value to him. Change-Id: I58e074b204e90a90536d32199ac668112966e9cf Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: Aggregate row_queue parameters to one structure Each ROW queues has several parameters which default values are defined in separate arrays. This patch aggregates all default values into one array. The values in question are: - is idling enabled for the queue - queue quantum - can the queue notify on urgent request Change-Id: I3821b0a042542295069b340406a16b1000873ec6 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: Dispatch requests according to their io-priority This patch implements "application-hints" which is a way the issuing application can notify the scheduler on the priority of its request. This is done by setting the io-priority of the request. This patch reuses an already existing mechanism of io-priorities developed for CFQ. Please refer to kernel/Documentation/block/ioprio.txt for usage example and explanations. Change-Id: I228ec8e52161b424242bb7bb133418dc8b73925a Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: Idling mechanism re-factoring At the moment idling in ROW is implemented by delayed work that uses jiffies granularity which is not very accurate. This patch replaces current idling mechanism implementation with hrtime API, which gives nanosecond resolution (instead of jiffies). Change-Id: I86c7b1776d035e1d81571894b300228c8b8f2d92 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: Don't notify URGENT if there are un-completed urgent req When ROW scheduler reports to the block layer that there is an urgent request pending, the device driver may decide to stop the transmission of the current request in order to handle the urgent one. If the current transmitted request is an urgent request - we don't want it to be stopped. Due to the above ROW scheduler won't notify of an urgent request if there are urgent requests in flight. Change-Id: I2fa186d911b908ec7611682b378b9cdc48637ac7 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> block: row: Update initial values of ROW data structures This patch sets the initial values of internal ROW parameters. Change-Id: I38132062a7fcbe2e58b9cc757e55caac64d013dc Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> [smuckle@codeaurora.org: ported from msm-3.7] Signed-off-by: Steve Muckle <smuckle@codeaurora.org> block: add REQ_URGENT to request flags This patch adds a new flag to be used in cmd_flags field of struct request for marking request as urgent. Urgent request is the one that should be given priority currently handled (regular) request by the device driver. The decision of a request urgency is taken by the scheduler. Change-Id: Ic20470987ef23410f1d0324f96f00578f7df8717 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>

commit: f4a09d0c6da781d9045604488a45b4bffa490719 [log] [tgz]
author: Tatyana Brokhman <tlinder@codeaurora.org> Thu Sep 20 10:46:10 2012 +0300
committer: Ethan Chen <intervigil@gmail.com> Fri Jun 07 13:40:42 2013 -0700
tree: 3d282a4e8fc207b587513a6a44529bbc768af337
parent: d313183ff6d6e405ebee6c10e8d472ad88a3da13 [diff]
diff --git a/Documentation/block/row-iosched.txt b/Documentation/block/row-iosched.txt
new file mode 100644
index 0000000..987bd88
--- /dev/null
+++ b/Documentation/block/row-iosched.txt

@@ -0,0 +1,117 @@
+Introduction
+============
+
+The ROW scheduling algorithm will be used in mobile devices as default
+block layer IO scheduling algorithm. ROW stands for "READ Over WRITE"
+which is the main requests dispatch policy of this algorithm.
+
+The ROW IO scheduler was developed with the mobile devices needs in
+mind. In mobile devices we favor user experience upon everything else,
+thus we want to give READ IO requests as much priority as possible.
+The main idea of the ROW scheduling policy is:
+If there are READ requests in pipe - dispatch them but don't starve
+the WRITE requests too much.
+
+Software description
+====================
+The requests are kept in queues according to their priority. The
+dispatching of requests is done in a Round Robin manner with a
+different slice for each queue. The dispatch quantum for a specific
+queue is defined according to the queues priority. READ queues are
+given bigger dispatch quantum than the WRITE queues, within a dispatch
+cycle.
+
+At the moment there are 6 types of queues the requests are
+distributed to:
+-	High priority READ queue
+-	High priority Synchronous WRITE queue
+-	Regular priority READ queue
+-	Regular priority Synchronous WRITE queue
+-	Regular priority WRITE queue
+-	Low priority READ queue
+
+If in a certain dispatch cycle one of the queues was empty and didn't
+use its quantum that queue will be marked as "un-served". If we're in a
+middle of a dispatch cycle dispatching from queue Y and a request
+arrives for queue X that was un-served in the previous cycle, if X's
+priority is higher than Y's, queue X will be preempted in the favor of
+queue Y. This won't mean that cycle is restarted. The "dispatched"
+counter of queue X will remain unchanged. Once queue Y uses up it's quantum
+(or there will be no more requests left on it) we'll switch back to queue X
+and allow it to finish it's quantum.
+
+For READ requests queues we allow idling in within a dispatch quantum in
+order to give the application a chance to insert more requests. Idling
+means adding some extra time for serving a certain queue even if the
+queue is empty. The idling is enabled if we identify the application is
+inserting requests in a high frequency.
+
+For idling on READ queues we use timer mechanism. When the timer expires,
+if there are requests in the scheduler we will signal the underlying driver
+(for example the MMC driver) to fetch another request for dispatch.
+
+The ROW algorithm takes the scheduling policy one step further, making
+it a bit more "user-needs oriented", by allowing the application to
+hint on the urgency of its requests. For example: even among the READ
+requests several requests may be more urgent for completion then others.
+The former will go to the High priority READ queue, that is given the
+bigger dispatch quantum than any other queue.
+
+ROW scheduler will support special services for block devices that
+supports High Priority Requests. That is, the scheduler may inform the
+device upon urgent requests using new callback make_urgent_request.
+In addition it will support rescheduling of requests that were
+interrupted. For example, if the device issues a long write request and
+a sudden high priority read interrupt pops in, the scheduler will
+inform the device about the urgent request, so the device can stop the
+current write request and serve the high priority read request. In such
+a case the device may also send back to the scheduler the reminder of
+the interrupted write request, such that the scheduler may continue
+sending high priority requests without the need to interrupt the
+ongoing write again and again. The write remainder will be sent later on
+according to the scheduler policy.
+
+Design
+======
+Existing algorithms (cfq, deadline) sort the io requests according LBA.
+When deciding on the next request to dispatch they choose the closest
+request to the current disk head position (from handling last
+dispatched request). This is done in order to reduce the disk head
+movement to a minimum.
+We feel that this functionality isn't really needed in mobile devices.
+Usually applications that write/read large chunks of data insert the
+requests in already sorted LBA order. Thus dealing with sort trees adds
+unnecessary complexity.
+
+We're planing to try this enhancement in the future to check if the
+performance is influenced by it.
+
+SMP/multi-core
+==============
+At the moment the code is acceded from 2 contexts:
+- Application context (from block/elevator layer): adding the requests.
+- Underlying driver context (for example the mmc driver thread): dispatching
+  the requests and notifying on completion.
+
+One lock is used to synchronize between the two. This lock is provided
+by the underlying driver along with the dispatch queue.
+
+Config options
+==============
+1. hp_read_quantum: dispatch quantum for the high priority READ queue
+2. rp_read_quantum: dispatch quantum for the regular priority READ queue
+3. hp_swrite_quantum: dispatch quantum for the high priority Synchronous
+   WRITE queue
+4. rp_swrite_quantum: dispatch quantum for the regular priority
+   Synchronous WRITE queue
+5. rp_write_quantum: dispatch quantum for the regular priority WRITE
+   queue
+6. lp_read_quantum: dispatch quantum for the low priority READ queue
+7. lp_swrite_quantum: dispatch quantum for the low priority Synchronous
+   WRITE queue
+8. read_idle: how long to idle on read queue in Msec (in case idling
+   is enabled on that queue).
+9. read_idle_freq: frequency of inserting READ requests that will
+   trigger idling. This is the time in Msec between inserting two READ
+   requests
+

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 4142d91..5fd98ea 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched

@@ -32,6 +32,17 @@
 	  a new point in the service tree and doing a batch of IO from there
 	  in case of expiry.
 
+config IOSCHED_ROW
+	tristate "ROW I/O scheduler"
+	default y
+	---help---
+	  The ROW I/O scheduler gives priority to READ requests over the
+	  WRITE requests when dispatching, without starving WRITE requests.
+	  Requests are kept in priority queues. Dispatching is done in a RR
+	  manner when the dispatch quantum for each queue is calculated
+	  according to queue priority.
+	  Most suitable for mobile devices.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	# If BLK_CGROUP is a module, CFQ has to be built as module.
@@ -64,6 +75,16 @@
 	config DEFAULT_DEADLINE
 		bool "Deadline" if IOSCHED_DEADLINE=y
 
+	config DEFAULT_ROW
+		bool "ROW" if IOSCHED_ROW=y
+		help
+		  The ROW I/O scheduler gives priority to READ requests
+		  over the WRITE requests when dispatching, without starving
+		  WRITE requests. Requests are kept in priority queues.
+		  Dispatching is done in a RR manner when the dispatch quantum
+		  for each queue is defined according to queue priority.
+		  Most suitable for mobile devices.
+
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
@@ -75,6 +96,7 @@
 config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
+	default "row" if DEFAULT_ROW
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 

diff --git a/block/Makefile b/block/Makefile
index 436b220..b5e6637 100644
--- a/block/Makefile
+++ b/block/Makefile

@@ -14,6 +14,7 @@
 obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
+obj-$(CONFIG_IOSCHED_ROW)	+= row-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_IOSCHED_TEST)	+= test-iosched.o
 

diff --git a/block/blk-core.c b/block/blk-core.c
index fcb2a1b..f51b05a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c

@@ -298,13 +298,26 @@
  * Description:
  *    See @blk_run_queue. This variant must be called with the queue lock
  *    held and interrupts disabled.
+ *    Device driver will be notified of an urgent request
+ *    pending under the following conditions:
+ *    1. The driver and the current scheduler support urgent reques handling
+ *    2. There is an urgent request pending in the scheduler
+ *    3. There isn't already an urgent request in flight, meaning previously
+ *       notified urgent request completed (!q->notified_urgent)
  */
 void __blk_run_queue(struct request_queue *q)
 {
 	if (unlikely(blk_queue_stopped(q)))
 		return;
 
-	q->request_fn(q);
+	if (!q->notified_urgent &&
+		q->elevator->type->ops.elevator_is_urgent_fn &&
+		q->urgent_request_fn &&
+		q->elevator->type->ops.elevator_is_urgent_fn(q)) {
+		q->notified_urgent = true;
+		q->urgent_request_fn(q);
+	} else
+		q->request_fn(q);
 }
 EXPORT_SYMBOL(__blk_run_queue);
 
@@ -1071,6 +1084,50 @@
 }
 EXPORT_SYMBOL(blk_requeue_request);
 
+/**
+ * blk_reinsert_request() - Insert a request back to the scheduler
+ * @q:		request queue
+ * @rq:		request to be inserted
+ *
+ * This function inserts the request back to the scheduler as if
+ * it was never dispatched.
+ *
+ * Return: 0 on success, error code on fail
+ */
+int blk_reinsert_request(struct request_queue *q, struct request *rq)
+{
+	if (unlikely(!rq) || unlikely(!q))
+		return -EIO;
+
+	blk_delete_timer(rq);
+	blk_clear_rq_complete(rq);
+	trace_block_rq_requeue(q, rq);
+
+	if (blk_rq_tagged(rq))
+		blk_queue_end_tag(q, rq);
+
+	BUG_ON(blk_queued_rq(rq));
+
+	return elv_reinsert_request(q, rq);
+}
+EXPORT_SYMBOL(blk_reinsert_request);
+
+/**
+ * blk_reinsert_req_sup() - check whether the scheduler supports
+ *          reinsertion of requests
+ * @q:		request queue
+ *
+ * Returns true if the current scheduler supports reinserting
+ * request. False otherwise
+ */
+bool blk_reinsert_req_sup(struct request_queue *q)
+{
+	if (unlikely(!q))
+		return false;
+	return q->elevator->type->ops.elevator_reinsert_req_fn ? true : false;
+}
+EXPORT_SYMBOL(blk_reinsert_req_sup);
+
 static void add_acct_request(struct request_queue *q, struct request *rq,
 			     int where)
 {
@@ -2074,8 +2131,17 @@
 	struct request *rq;
 
 	rq = blk_peek_request(q);
-	if (rq)
+	if (rq) {
+		/*
+		 * Assumption: the next request fetched from scheduler after we
+		 * notified "urgent request pending" - will be the urgent one
+		 */
+		if (q->notified_urgent && !q->dispatched_urgent) {
+			q->dispatched_urgent = true;
+			(void)blk_mark_rq_urgent(rq);
+		}
 		blk_start_request(rq);
+	}
 	return rq;
 }
 EXPORT_SYMBOL(blk_fetch_request);

diff --git a/block/blk-settings.c b/block/blk-settings.c
index d3234fc..579328c 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c

@@ -100,6 +100,18 @@
 EXPORT_SYMBOL_GPL(blk_queue_lld_busy);
 
 /**
+ * blk_urgent_request() - Set an urgent_request handler function for queue
+ * @q:		queue
+ * @fn:		handler for urgent requests
+ *
+ */
+void blk_urgent_request(struct request_queue *q, request_fn_proc *fn)
+{
+	q->urgent_request_fn = fn;
+}
+EXPORT_SYMBOL(blk_urgent_request);
+
+/**
  * blk_set_default_limits - reset limits to default values
  * @lim:  the queue_limits structure to reset
  *

diff --git a/block/blk.h b/block/blk.h
index d45be87..a52209f 100644
--- a/block/blk.h
+++ b/block/blk.h

@@ -39,6 +39,7 @@
  */
 enum rq_atomic_flags {
 	REQ_ATOM_COMPLETE = 0,
+	REQ_ATOM_URGENT = 1,
 };
 
 /*
@@ -55,6 +56,16 @@
 	clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
 }
 
+static inline int blk_mark_rq_urgent(struct request *rq)
+{
+	return test_and_set_bit(REQ_ATOM_URGENT, &rq->atomic_flags);
+}
+
+static inline void blk_clear_rq_urgent(struct request *rq)
+{
+	clear_bit(REQ_ATOM_URGENT, &rq->atomic_flags);
+}
+
 /*
  * Internal elevator interface
  */

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3c38536..32629e2 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c

@@ -2305,6 +2305,9 @@
 {
 	struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
 
+	if (!cfqg)
+		return;
+
 	cfqd->serving_group = cfqg;
 
 	/* Restore the workload type data */

diff --git a/block/elevator.c b/block/elevator.c
index 74fd51b..efec457 100644
--- a/block/elevator.c
+++ b/block/elevator.c

@@ -585,6 +585,41 @@
 	__elv_add_request(q, rq, ELEVATOR_INSERT_REQUEUE);
 }
 
+/**
+ * elv_reinsert_request() - Insert a request back to the scheduler
+ * @q:		request queue where request should be inserted
+ * @rq:		request to be inserted
+ *
+ * This function returns the request back to the scheduler to be
+ * inserted as if it was never dispatched
+ *
+ * Return: 0 on success, error code on failure
+ */
+int elv_reinsert_request(struct request_queue *q, struct request *rq)
+{
+	int res;
+
+	if (!q->elevator->type->ops.elevator_reinsert_req_fn)
+		return -EPERM;
+
+	res = q->elevator->type->ops.elevator_reinsert_req_fn(q, rq);
+	if (!res) {
+		/*
+		 * it already went through dequeue, we need to decrement the
+		 * in_flight count again
+		 */
+		if (blk_account_rq(rq)) {
+			q->in_flight[rq_is_sync(rq)]--;
+			if (rq->cmd_flags & REQ_SORTED)
+				elv_deactivate_rq(q, rq);
+		}
+		rq->cmd_flags &= ~REQ_STARTED;
+		q->nr_sorted++;
+	}
+
+	return res;
+}
+
 void elv_drain_elevator(struct request_queue *q)
 {
 	static int printed;
@@ -779,6 +814,11 @@
 {
 	struct elevator_queue *e = q->elevator;
 
+	if (test_bit(REQ_ATOM_URGENT, &rq->atomic_flags)) {
+		q->notified_urgent = false;
+		q->dispatched_urgent = false;
+		blk_clear_rq_urgent(rq);
+	}
 	/*
 	 * request is released from the driver, io must be done
 	 */

diff --git a/block/row-iosched.c b/block/row-iosched.c
new file mode 100644
index 0000000..bdb6abd
--- /dev/null
+++ b/block/row-iosched.c

@@ -0,0 +1,929 @@
+/*
+ * ROW (Read Over Write) I/O scheduler.
+ *
+ * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+/* See Documentation/block/row-iosched.txt */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/blktrace_api.h>
+#include <linux/hrtimer.h>
+
+/*
+ * enum row_queue_prio - Priorities of the ROW queues
+ *
+ * This enum defines the priorities (and the number of queues)
+ * the requests will be distributed to. The higher priority -
+ * the bigger is the "bus time" (or the dispatch quantum) given
+ * to that queue.
+ * ROWQ_PRIO_HIGH_READ - is the higher priority queue.
+ *
+ */
+enum row_queue_prio {
+	ROWQ_PRIO_HIGH_READ = 0,
+	ROWQ_PRIO_HIGH_SWRITE,
+	ROWQ_PRIO_REG_READ,
+	ROWQ_PRIO_REG_SWRITE,
+	ROWQ_PRIO_REG_WRITE,
+	ROWQ_PRIO_LOW_READ,
+	ROWQ_PRIO_LOW_SWRITE,
+	ROWQ_MAX_PRIO,
+};
+
+/*
+ * The following indexes define the distribution of ROW queues according to
+ * priorities. Each index defines the first queue in that priority group.
+ */
+#define ROWQ_HIGH_PRIO_IDX	ROWQ_PRIO_HIGH_READ
+#define ROWQ_REG_PRIO_IDX	ROWQ_PRIO_REG_READ
+#define ROWQ_LOW_PRIO_IDX	ROWQ_PRIO_LOW_READ
+
+/**
+ * struct row_queue_params - ROW queue parameters
+ * @idling_enabled: Flag indicating whether idling is enable on
+ *			the queue
+ * @quantum: Number of requests to be dispatched from this queue
+ *			in a dispatch cycle
+ * @is_urgent: Flags indicating whether the queue can notify on
+ *			urgent requests
+ *
+ */
+struct row_queue_params {
+	bool idling_enabled;
+	int quantum;
+	bool is_urgent;
+};
+
+/*
+ * This array holds the default values of the different configurables
+ * for each ROW queue. Each row of the array holds the following values:
+ * {idling_enabled, quantum, is_urgent}
+ * Each row corresponds to a queue with the same index (according to
+ * enum row_queue_prio)
+ * Note: The quantums are valid inside their priority type. For example:
+ *       For every 10 high priority read requests, 1 high priority sync
+ *       write will be dispatched.
+ *       For every 100 regular read requests 1 regular write request will
+ *       be dispatched.
+ */
+static const struct row_queue_params row_queues_def[] = {
+/* idling_enabled, quantum, is_urgent */
+	{true, 10, true},	/* ROWQ_PRIO_HIGH_READ */
+	{false, 1, true},	/* ROWQ_PRIO_HIGH_SWRITE */
+	{true, 100, true},	/* ROWQ_PRIO_REG_READ */
+	{false, 1, false},	/* ROWQ_PRIO_REG_SWRITE */
+	{false, 1, false},	/* ROWQ_PRIO_REG_WRITE */
+	{false, 1, false},	/* ROWQ_PRIO_LOW_READ */
+	{false, 1, false}	/* ROWQ_PRIO_LOW_SWRITE */
+};
+
+/* Default values for idling on read queues (in msec) */
+#define ROW_IDLE_TIME_MSEC 5
+#define ROW_READ_FREQ_MSEC 20
+
+/**
+ * struct rowq_idling_data -  parameters for idling on the queue
+ * @last_insert_time:	time the last request was inserted
+ *			to the queue
+ * @begin_idling:	flag indicating wether we should idle
+ *
+ */
+struct rowq_idling_data {
+	ktime_t			last_insert_time;
+	bool			begin_idling;
+};
+
+/**
+ * struct row_queue - requests grouping structure
+ * @rdata:		parent row_data structure
+ * @fifo:		fifo of requests
+ * @prio:		queue priority (enum row_queue_prio)
+ * @nr_dispatched:	number of requests already dispatched in
+ *			the current dispatch cycle
+ * @nr_req:		number of requests in queue
+ * @dispatch quantum:	number of requests this queue may
+ *			dispatch in a dispatch cycle
+ * @idle_data:		data for idling on queues
+ *
+ */
+struct row_queue {
+	struct row_data		*rdata;
+	struct list_head	fifo;
+	enum row_queue_prio	prio;
+
+	unsigned int		nr_dispatched;
+
+	unsigned int		nr_req;
+	int			disp_quantum;
+
+	/* used only for READ queues */
+	struct rowq_idling_data	idle_data;
+};
+
+/**
+ * struct idling_data - data for idling on empty rqueue
+ * @idle_time_ms:		idling duration (msec)
+ * @freq_ms:		min time between two requests that
+ *			triger idling (msec)
+ * @hr_timer:	idling timer
+ * @idle_work:	the work to be scheduled when idling timer expires
+ * @idling_queue_idx:	index of the queues we're idling on
+ *
+ */
+struct idling_data {
+	s64				idle_time_ms;
+	s64				freq_ms;
+
+	struct hrtimer			hr_timer;
+	struct work_struct		idle_work;
+	enum row_queue_prio		idling_queue_idx;
+};
+
+/**
+ * struct row_queue - Per block device rqueue structure
+ * @dispatch_queue:	dispatch rqueue
+ * @row_queues:		array of priority request queues
+ * @rd_idle_data:		data for idling after READ request
+ * @nr_reqs: nr_reqs[0] holds the number of all READ requests in
+ *			scheduler, nr_reqs[1] holds the number of all WRITE
+ *			requests in scheduler
+ * @nr_urgent_in_flight: number of uncompleted urgent requests
+ *			(both reads and writes)
+ * @cycle_flags:	used for marking unserved queueus
+ *
+ */
+struct row_data {
+	struct request_queue		*dispatch_queue;
+
+	struct row_queue row_queues[ROWQ_MAX_PRIO];
+
+	struct idling_data		rd_idle_data;
+	unsigned int			nr_reqs[2];
+	unsigned int			nr_urgent_in_flight;
+
+	unsigned int			cycle_flags;
+};
+
+#define RQ_ROWQ(rq) ((struct row_queue *) ((rq)->elv.priv[0]))
+
+#define row_log(q, fmt, args...)   \
+	blk_add_trace_msg(q, "%s():" fmt , __func__, ##args)
+#define row_log_rowq(rdata, rowq_id, fmt, args...)		\
+	blk_add_trace_msg(rdata->dispatch_queue, "rowq%d " fmt, \
+		rowq_id, ##args)
+
+static inline void row_mark_rowq_unserved(struct row_data *rd,
+					 enum row_queue_prio qnum)
+{
+	rd->cycle_flags |= (1 << qnum);
+}
+
+static inline void row_clear_rowq_unserved(struct row_data *rd,
+					  enum row_queue_prio qnum)
+{
+	rd->cycle_flags &= ~(1 << qnum);
+}
+
+static inline int row_rowq_unserved(struct row_data *rd,
+				   enum row_queue_prio qnum)
+{
+	return rd->cycle_flags & (1 << qnum);
+}
+
+static inline void __maybe_unused row_dump_queues_stat(struct row_data *rd)
+{
+	int i;
+
+	row_log(rd->dispatch_queue, " Queues status:");
+	for (i = 0; i < ROWQ_MAX_PRIO; i++)
+		row_log(rd->dispatch_queue,
+			"queue%d: dispatched= %d, nr_req=%d", i,
+			rd->row_queues[i].nr_dispatched,
+			rd->row_queues[i].nr_req);
+}
+
+/******************** Static helper functions ***********************/
+static void kick_queue(struct work_struct *work)
+{
+	struct idling_data *read_data =
+		container_of(work, struct idling_data, idle_work);
+	struct row_data *rd =
+		container_of(read_data, struct row_data, rd_idle_data);
+
+	blk_run_queue(rd->dispatch_queue);
+}
+
+
+static enum hrtimer_restart row_idle_hrtimer_fn(struct hrtimer *hr_timer)
+{
+	struct idling_data *read_data =
+		container_of(hr_timer, struct idling_data, hr_timer);
+	struct row_data *rd =
+		container_of(read_data, struct row_data, rd_idle_data);
+
+	row_log_rowq(rd, rd->rd_idle_data.idling_queue_idx,
+			 "Performing delayed work");
+	/* Mark idling process as done */
+	rd->row_queues[rd->rd_idle_data.idling_queue_idx].
+			idle_data.begin_idling = false;
+	rd->rd_idle_data.idling_queue_idx = ROWQ_MAX_PRIO;
+
+	if (!rd->nr_reqs[READ] && !rd->nr_reqs[WRITE])
+		row_log(rd->dispatch_queue, "No requests in scheduler");
+	else
+		kblockd_schedule_work(rd->dispatch_queue,
+			&read_data->idle_work);
+	return HRTIMER_NORESTART;
+}
+
+/******************* Elevator callback functions *********************/
+
+/*
+ * row_add_request() - Add request to the scheduler
+ * @q:	requests queue
+ * @rq:	request to add
+ *
+ */
+static void row_add_request(struct request_queue *q,
+			    struct request *rq)
+{
+	struct row_data *rd = (struct row_data *)q->elevator->elevator_data;
+	struct row_queue *rqueue = RQ_ROWQ(rq);
+	s64 diff_ms;
+
+	list_add_tail(&rq->queuelist, &rqueue->fifo);
+	rd->nr_reqs[rq_data_dir(rq)]++;
+	rqueue->nr_req++;
+	rq_set_fifo_time(rq, jiffies); /* for statistics*/
+
+	if (row_queues_def[rqueue->prio].idling_enabled) {
+		if (rd->rd_idle_data.idling_queue_idx == rqueue->prio &&
+		    hrtimer_active(&rd->rd_idle_data.hr_timer)) {
+			(void)hrtimer_cancel(&rd->rd_idle_data.hr_timer);
+			row_log_rowq(rd, rqueue->prio,
+				"Canceled delayed work on %d",
+				rd->rd_idle_data.idling_queue_idx);
+			rd->rd_idle_data.idling_queue_idx = ROWQ_MAX_PRIO;
+		}
+		diff_ms = ktime_to_ms(ktime_sub(ktime_get(),
+				rqueue->idle_data.last_insert_time));
+		if (unlikely(diff_ms < 0)) {
+			pr_err("ROW BUG: %s diff_ms < 0", __func__);
+			rqueue->idle_data.begin_idling = false;
+			return;
+		}
+		if (diff_ms < rd->rd_idle_data.freq_ms) {
+			rqueue->idle_data.begin_idling = true;
+			row_log_rowq(rd, rqueue->prio, "Enable idling");
+		} else {
+			rqueue->idle_data.begin_idling = false;
+			row_log_rowq(rd, rqueue->prio, "Disable idling (%ldms)",
+				(long)diff_ms);
+		}
+
+		rqueue->idle_data.last_insert_time = ktime_get();
+	}
+	if (row_queues_def[rqueue->prio].is_urgent &&
+	    row_rowq_unserved(rd, rqueue->prio)) {
+		row_log_rowq(rd, rqueue->prio,
+			"added urgent request (total on queue=%d)",
+			rqueue->nr_req);
+		rq->cmd_flags |= REQ_URGENT;
+	} else
+		row_log_rowq(rd, rqueue->prio,
+			"added request (total on queue=%d)", rqueue->nr_req);
+}
+
+/**
+ * row_reinsert_req() - Reinsert request back to the scheduler
+ * @q:	requests queue
+ * @rq:	request to add
+ *
+ * Reinsert the given request back to the queue it was
+ * dispatched from as if it was never dispatched.
+ *
+ * Returns 0 on success, error code otherwise
+ */
+static int row_reinsert_req(struct request_queue *q,
+			    struct request *rq)
+{
+	struct row_data    *rd = q->elevator->elevator_data;
+	struct row_queue   *rqueue = RQ_ROWQ(rq);
+
+	if (rqueue->prio >= ROWQ_MAX_PRIO) {
+		pr_err("\n\n%s:ROW BUG: row_reinsert_req() rqueue->prio = %d\n",
+			   rq->rq_disk->disk_name, rqueue->prio);
+		blk_dump_rq_flags(rq, "");
+		return -EIO;
+	}
+
+	list_add(&rq->queuelist, &rqueue->fifo);
+	rd->nr_reqs[rq_data_dir(rq)]++;
+	rqueue->nr_req++;
+
+	row_log_rowq(rd, rqueue->prio,
+		"request reinserted (total on queue=%d)", rqueue->nr_req);
+
+	return 0;
+}
+
+static void row_completed_req(struct request_queue *q, struct request *rq)
+{
+	struct row_data *rd = q->elevator->elevator_data;
+
+	 if (rq->cmd_flags & REQ_URGENT) {
+		if (!rd->nr_urgent_in_flight) {
+			pr_err("ROW BUG: %s() nr_urgent_in_flight = 0",
+				__func__);
+			return;
+		}
+		rd->nr_urgent_in_flight--;
+	}
+}
+
+/**
+ * row_urgent_pending() - Return TRUE if there is an urgent
+ *			  request on scheduler
+ * @q:	requests queue
+ */
+static bool row_urgent_pending(struct request_queue *q)
+{
+	struct row_data *rd = q->elevator->elevator_data;
+	int i;
+
+	if (rd->nr_urgent_in_flight) {
+		row_log(rd->dispatch_queue, "%d urgent requests in flight",
+			rd->nr_urgent_in_flight);
+		return false;
+	}
+
+	for (i = ROWQ_HIGH_PRIO_IDX; i < ROWQ_REG_PRIO_IDX; i++)
+		if (!list_empty(&rd->row_queues[i].fifo)) {
+			row_log_rowq(rd, i,
+				"Urgent (high prio) request pending");
+			return true;
+		}
+
+	for (i = ROWQ_REG_PRIO_IDX; i < ROWQ_MAX_PRIO; i++)
+		if (row_queues_def[i].is_urgent && row_rowq_unserved(rd, i) &&
+		    !list_empty(&rd->row_queues[i].fifo)) {
+			row_log_rowq(rd, i, "Urgent request pending");
+			return true;
+		}
+
+	return false;
+}
+
+/**
+ * row_remove_request() -  Remove given request from scheduler
+ * @q:	requests queue
+ * @rq:	request to remove
+ *
+ */
+static void row_remove_request(struct request_queue *q,
+			       struct request *rq)
+{
+	struct row_data *rd = (struct row_data *)q->elevator->elevator_data;
+	struct row_queue *rqueue = RQ_ROWQ(rq);
+
+	rq_fifo_clear(rq);
+	rqueue->nr_req--;
+	rd->nr_reqs[rq_data_dir(rq)]--;
+}
+
+/*
+ * row_dispatch_insert() - move request to dispatch queue
+ * @rd:		pointer to struct row_data
+ * @queue_idx:	index of the row_queue to dispatch from
+ *
+ * This function moves the next request to dispatch from
+ * the given queue (row_queues[queue_idx]) to the dispatch queue
+ *
+ */
+static void row_dispatch_insert(struct row_data *rd, int queue_idx)
+{
+	struct request *rq;
+
+	rq = rq_entry_fifo(rd->row_queues[queue_idx].fifo.next);
+	row_remove_request(rd->dispatch_queue, rq);
+	elv_dispatch_add_tail(rd->dispatch_queue, rq);
+	rd->row_queues[queue_idx].nr_dispatched++;
+	row_clear_rowq_unserved(rd, queue_idx);
+	row_log_rowq(rd, queue_idx, " Dispatched request nr_disp = %d",
+		     rd->row_queues[queue_idx].nr_dispatched);
+	if (rq->cmd_flags & REQ_URGENT)
+		rd->nr_urgent_in_flight++;
+}
+
+/*
+ * row_get_ioprio_class_to_serve() - Return the next I/O priority
+ *				      class to dispatch requests from
+ * @rd:	pointer to struct row_data
+ * @force:	flag indicating if forced dispatch
+ *
+ * This function returns the next I/O priority class to serve
+ * {IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE}.
+ * If there are no more requests in scheduler or if we're idling on some queue
+ * IOPRIO_CLASS_NONE will be returned.
+ * If idling is scheduled on a lower priority queue than the one that needs
+ * to be served, it will be canceled.
+ *
+ */
+static int row_get_ioprio_class_to_serve(struct row_data *rd, int force)
+{
+	int i;
+	int ret = IOPRIO_CLASS_NONE;
+
+	if (!rd->nr_reqs[READ] && !rd->nr_reqs[WRITE]) {
+		row_log(rd->dispatch_queue, "No more requests in scheduler");
+		goto check_idling;
+	}
+
+	/* First, go over the high priority queues */
+	for (i = 0; i < ROWQ_REG_PRIO_IDX; i++) {
+		if (!list_empty(&rd->row_queues[i].fifo)) {
+			if (hrtimer_active(&rd->rd_idle_data.hr_timer)) {
+				(void)hrtimer_cancel(
+					&rd->rd_idle_data.hr_timer);
+				row_log_rowq(rd,
+					rd->rd_idle_data.idling_queue_idx,
+					"Canceling delayed work on %d. RT pending",
+					rd->rd_idle_data.idling_queue_idx);
+				rd->rd_idle_data.idling_queue_idx =
+					ROWQ_MAX_PRIO;
+			}
+			ret = IOPRIO_CLASS_RT;
+			goto done;
+		}
+	}
+
+	/*
+	 * At the moment idling is implemented only for READ queues.
+	 * If enabled on WRITE, this needs updating
+	 */
+	if (hrtimer_active(&rd->rd_idle_data.hr_timer)) {
+		row_log(rd->dispatch_queue, "Delayed work pending. Exiting");
+		goto done;
+	}
+check_idling:
+	/* Check for (high priority) idling and enable if needed */
+	for (i = 0; i < ROWQ_REG_PRIO_IDX && !force; i++) {
+		if (rd->row_queues[i].idle_data.begin_idling &&
+		    row_queues_def[i].idling_enabled)
+			goto initiate_idling;
+	}
+
+	/* Regular priority queues */
+	for (i = ROWQ_REG_PRIO_IDX; i < ROWQ_LOW_PRIO_IDX; i++) {
+		if (list_empty(&rd->row_queues[i].fifo)) {
+			/* We can idle only if this is not a forced dispatch */
+			if (rd->row_queues[i].idle_data.begin_idling &&
+			    !force && row_queues_def[i].idling_enabled)
+				goto initiate_idling;
+		} else {
+			ret = IOPRIO_CLASS_BE;
+			goto done;
+		}
+	}
+
+	if (rd->nr_reqs[READ] || rd->nr_reqs[WRITE])
+		ret = IOPRIO_CLASS_IDLE;
+	goto done;
+
+initiate_idling:
+	hrtimer_start(&rd->rd_idle_data.hr_timer,
+		ktime_set(0, rd->rd_idle_data.idle_time_ms * NSEC_PER_MSEC),
+		HRTIMER_MODE_REL);
+
+	rd->rd_idle_data.idling_queue_idx = i;
+	row_log_rowq(rd, i, "Scheduled delayed work on %d. exiting", i);
+
+done:
+	return ret;
+}
+
+static void row_restart_cycle(struct row_data *rd,
+				int start_idx, int end_idx)
+{
+	int i;
+
+	row_dump_queues_stat(rd);
+	for (i = start_idx; i < end_idx; i++) {
+		if (rd->row_queues[i].nr_dispatched <
+		    rd->row_queues[i].disp_quantum)
+			row_mark_rowq_unserved(rd, i);
+		rd->row_queues[i].nr_dispatched = 0;
+	}
+	row_log(rd->dispatch_queue, "Restarting cycle for class @ %d-%d",
+		start_idx, end_idx);
+}
+
+/*
+ * row_get_next_queue() - selects the next queue to dispatch from
+ * @q:		requests queue
+ * @rd:		pointer to struct row_data
+ * @start_idx/end_idx: indexes in the row_queues array to select a queue
+ *                 from.
+ *
+ * Return index of the queues to dispatch from. Error code if fails.
+ *
+ */
+static int row_get_next_queue(struct request_queue *q, struct row_data *rd,
+				int start_idx, int end_idx)
+{
+	int i = start_idx;
+	bool restart = true;
+	int ret = -EIO;
+
+	do {
+		if (list_empty(&rd->row_queues[i].fifo) ||
+		    rd->row_queues[i].nr_dispatched >=
+		    rd->row_queues[i].disp_quantum) {
+			i++;
+			if (i == end_idx && restart) {
+				/* Restart cycle for this priority class */
+				row_restart_cycle(rd, start_idx, end_idx);
+				i = start_idx;
+				restart = false;
+			}
+		} else {
+			ret = i;
+			break;
+		}
+	} while (i < end_idx);
+
+	return ret;
+}
+
+/*
+ * row_dispatch_requests() - selects the next request to dispatch
+ * @q:		requests queue
+ * @force:		flag indicating if forced dispatch
+ *
+ * Return 0 if no requests were moved to the dispatch queue.
+ *	  1 otherwise
+ *
+ */
+static int row_dispatch_requests(struct request_queue *q, int force)
+{
+	struct row_data *rd = (struct row_data *)q->elevator->elevator_data;
+	int ret = 0, currq, ioprio_class_to_serve, start_idx, end_idx;
+
+	if (force && hrtimer_active(&rd->rd_idle_data.hr_timer)) {
+		(void)hrtimer_cancel(&rd->rd_idle_data.hr_timer);
+		row_log_rowq(rd, rd->rd_idle_data.idling_queue_idx,
+			"Canceled delayed work on %d - forced dispatch",
+			rd->rd_idle_data.idling_queue_idx);
+		rd->rd_idle_data.idling_queue_idx = ROWQ_MAX_PRIO;
+	}
+
+	ioprio_class_to_serve = row_get_ioprio_class_to_serve(rd, force);
+	row_log(rd->dispatch_queue, "Dispatching from %d priority class",
+		ioprio_class_to_serve);
+
+	switch (ioprio_class_to_serve) {
+	case IOPRIO_CLASS_NONE:
+		goto done;
+	case IOPRIO_CLASS_RT:
+		start_idx = ROWQ_HIGH_PRIO_IDX;
+		end_idx = ROWQ_REG_PRIO_IDX;
+		break;
+	case IOPRIO_CLASS_BE:
+		start_idx = ROWQ_REG_PRIO_IDX;
+		end_idx = ROWQ_LOW_PRIO_IDX;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		start_idx = ROWQ_LOW_PRIO_IDX;
+		end_idx = ROWQ_MAX_PRIO;
+		break;
+	default:
+		pr_err("%s(): Invalid I/O priority class", __func__);
+		goto done;
+	}
+
+	currq = row_get_next_queue(q, rd, start_idx, end_idx);
+
+	/* Dispatch */
+	if (currq >= 0) {
+		row_dispatch_insert(rd, currq);
+		ret = 1;
+	}
+done:
+	return ret;
+}
+
+/*
+ * row_init_queue() - Init scheduler data structures
+ * @q:	requests queue
+ *
+ * Return pointer to struct row_data to be saved in elevator for
+ * this dispatch queue
+ *
+ */
+static void *row_init_queue(struct request_queue *q)
+{
+
+	struct row_data *rdata;
+	int i;
+
+	rdata = kmalloc_node(sizeof(*rdata),
+			     GFP_KERNEL | __GFP_ZERO, q->node);
+	if (!rdata)
+		return NULL;
+
+	memset(rdata, 0, sizeof(*rdata));
+	for (i = 0; i < ROWQ_MAX_PRIO; i++) {
+		INIT_LIST_HEAD(&rdata->row_queues[i].fifo);
+		rdata->row_queues[i].disp_quantum = row_queues_def[i].quantum;
+		rdata->row_queues[i].rdata = rdata;
+		rdata->row_queues[i].prio = i;
+		rdata->row_queues[i].idle_data.begin_idling = false;
+		rdata->row_queues[i].idle_data.last_insert_time =
+			ktime_set(0, 0);
+	}
+
+	/*
+	 * Currently idling is enabled only for READ queues. If we want to
+	 * enable it for write queues also, note that idling frequency will
+	 * be the same in both cases
+	 */
+	rdata->rd_idle_data.idle_time_ms = ROW_IDLE_TIME_MSEC;
+	rdata->rd_idle_data.freq_ms = ROW_READ_FREQ_MSEC;
+	hrtimer_init(&rdata->rd_idle_data.hr_timer,
+		CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	rdata->rd_idle_data.hr_timer.function = &row_idle_hrtimer_fn;
+
+	INIT_WORK(&rdata->rd_idle_data.idle_work, kick_queue);
+
+	rdata->rd_idle_data.idling_queue_idx = ROWQ_MAX_PRIO;
+	rdata->dispatch_queue = q;
+
+	return rdata;
+}
+
+/*
+ * row_exit_queue() - called on unloading the RAW scheduler
+ * @e:	poiner to struct elevator_queue
+ *
+ */
+static void row_exit_queue(struct elevator_queue *e)
+{
+	struct row_data *rd = (struct row_data *)e->elevator_data;
+	int i;
+
+	for (i = 0; i < ROWQ_MAX_PRIO; i++)
+		BUG_ON(!list_empty(&rd->row_queues[i].fifo));
+	if (hrtimer_cancel(&rd->rd_idle_data.hr_timer))
+		pr_err("ROW BUG: idle timer was active!");
+	rd->rd_idle_data.idling_queue_idx = ROWQ_MAX_PRIO;
+	kfree(rd);
+}
+
+/*
+ * row_merged_requests() - Called when 2 requests are merged
+ * @q:		requests queue
+ * @rq:		request the two requests were merged into
+ * @next:	request that was merged
+ */
+static void row_merged_requests(struct request_queue *q, struct request *rq,
+				 struct request *next)
+{
+	struct row_queue   *rqueue = RQ_ROWQ(next);
+
+	list_del_init(&next->queuelist);
+	rqueue->nr_req--;
+
+	rqueue->rdata->nr_reqs[rq_data_dir(rq)]--;
+}
+
+/*
+ * row_get_queue_prio() - Get queue priority for a given request
+ *
+ * This is a helping function which purpose is to determine what
+ * ROW queue the given request should be added to (and
+ * dispatched from later on)
+ *
+ */
+static enum row_queue_prio row_get_queue_prio(struct request *rq)
+{
+	const int data_dir = rq_data_dir(rq);
+	const bool is_sync = rq_is_sync(rq);
+	enum row_queue_prio q_type = ROWQ_MAX_PRIO;
+	int ioprio_class = IOPRIO_PRIO_CLASS(rq->elv.icq->ioc->ioprio);
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		if (data_dir == READ)
+			q_type = ROWQ_PRIO_HIGH_READ;
+		else if (is_sync)
+			q_type = ROWQ_PRIO_HIGH_SWRITE;
+		else {
+			pr_err("%s:%s(): got a simple write from RT_CLASS. How???",
+				rq->rq_disk->disk_name, __func__);
+			q_type = ROWQ_PRIO_REG_WRITE;
+		}
+		rq->cmd_flags |= REQ_URGENT;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		if (data_dir == READ)
+			q_type = ROWQ_PRIO_LOW_READ;
+		else if (is_sync)
+			q_type = ROWQ_PRIO_LOW_SWRITE;
+		else {
+			pr_err("%s:%s(): got a simple write from IDLE_CLASS. How???",
+				rq->rq_disk->disk_name, __func__);
+			q_type = ROWQ_PRIO_REG_WRITE;
+		}
+		break;
+	case IOPRIO_CLASS_NONE:
+	case IOPRIO_CLASS_BE:
+	default:
+		if (data_dir == READ)
+			q_type = ROWQ_PRIO_REG_READ;
+		else if (is_sync)
+			q_type = ROWQ_PRIO_REG_SWRITE;
+		else
+			q_type = ROWQ_PRIO_REG_WRITE;
+		break;
+	}
+
+	return q_type;
+}
+
+/*
+ * row_set_request() - Set ROW data structures associated with this request.
+ * @q:		requests queue
+ * @rq:		pointer to the request
+ * @gfp_mask:	ignored
+ *
+ */
+static int
+row_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	struct row_data *rd = (struct row_data *)q->elevator->elevator_data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	rq->elv.priv[0] =
+		(void *)(&rd->row_queues[row_get_queue_prio(rq)]);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 0;
+}
+
+/********** Helping sysfs functions/defenitions for ROW attributes ******/
+static ssize_t row_var_show(int var, char *page)
+{
+	return snprintf(page, 100, "%d\n", var);
+}
+
+static ssize_t row_var_store(int *var, const char *page, size_t count)
+{
+	int err;
+	err = kstrtoul(page, 10, (unsigned long *)var);
+
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct row_data *rowd = e->elevator_data;			\
+	int __data = __VAR;						\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return row_var_show(__data, (page));			\
+}
+SHOW_FUNCTION(row_hp_read_quantum_show,
+	rowd->row_queues[ROWQ_PRIO_HIGH_READ].disp_quantum, 0);
+SHOW_FUNCTION(row_rp_read_quantum_show,
+	rowd->row_queues[ROWQ_PRIO_REG_READ].disp_quantum, 0);
+SHOW_FUNCTION(row_hp_swrite_quantum_show,
+	rowd->row_queues[ROWQ_PRIO_HIGH_SWRITE].disp_quantum, 0);
+SHOW_FUNCTION(row_rp_swrite_quantum_show,
+	rowd->row_queues[ROWQ_PRIO_REG_SWRITE].disp_quantum, 0);
+SHOW_FUNCTION(row_rp_write_quantum_show,
+	rowd->row_queues[ROWQ_PRIO_REG_WRITE].disp_quantum, 0);
+SHOW_FUNCTION(row_lp_read_quantum_show,
+	rowd->row_queues[ROWQ_PRIO_LOW_READ].disp_quantum, 0);
+SHOW_FUNCTION(row_lp_swrite_quantum_show,
+	rowd->row_queues[ROWQ_PRIO_LOW_SWRITE].disp_quantum, 0);
+SHOW_FUNCTION(row_rd_idle_data_show, rowd->rd_idle_data.idle_time_ms, 0);
+SHOW_FUNCTION(row_rd_idle_data_freq_show, rowd->rd_idle_data.freq_ms, 0);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t __FUNC(struct elevator_queue *e,				\
+		const char *page, size_t count)				\
+{									\
+	struct row_data *rowd = e->elevator_data;			\
+	int __data;						\
+	int ret = row_var_store(&__data, (page), count);		\
+	if (__CONV)							\
+		__data = (int)msecs_to_jiffies(__data);			\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	*(__PTR) = __data;						\
+	return ret;							\
+}
+STORE_FUNCTION(row_hp_read_quantum_store,
+&rowd->row_queues[ROWQ_PRIO_HIGH_READ].disp_quantum, 1, INT_MAX, 0);
+STORE_FUNCTION(row_rp_read_quantum_store,
+			&rowd->row_queues[ROWQ_PRIO_REG_READ].disp_quantum,
+			1, INT_MAX, 0);
+STORE_FUNCTION(row_hp_swrite_quantum_store,
+			&rowd->row_queues[ROWQ_PRIO_HIGH_SWRITE].disp_quantum,
+			1, INT_MAX, 0);
+STORE_FUNCTION(row_rp_swrite_quantum_store,
+			&rowd->row_queues[ROWQ_PRIO_REG_SWRITE].disp_quantum,
+			1, INT_MAX, 0);
+STORE_FUNCTION(row_rp_write_quantum_store,
+			&rowd->row_queues[ROWQ_PRIO_REG_WRITE].disp_quantum,
+			1, INT_MAX, 0);
+STORE_FUNCTION(row_lp_read_quantum_store,
+			&rowd->row_queues[ROWQ_PRIO_LOW_READ].disp_quantum,
+			1, INT_MAX, 0);
+STORE_FUNCTION(row_lp_swrite_quantum_store,
+			&rowd->row_queues[ROWQ_PRIO_LOW_SWRITE].disp_quantum,
+			1, INT_MAX, 0);
+STORE_FUNCTION(row_rd_idle_data_store, &rowd->rd_idle_data.idle_time_ms,
+			1, INT_MAX, 0);
+STORE_FUNCTION(row_rd_idle_data_freq_store, &rowd->rd_idle_data.freq_ms,
+			1, INT_MAX, 0);
+
+#undef STORE_FUNCTION
+
+#define ROW_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, row_##name##_show, \
+				      row_##name##_store)
+
+static struct elv_fs_entry row_attrs[] = {
+	ROW_ATTR(hp_read_quantum),
+	ROW_ATTR(rp_read_quantum),
+	ROW_ATTR(hp_swrite_quantum),
+	ROW_ATTR(rp_swrite_quantum),
+	ROW_ATTR(rp_write_quantum),
+	ROW_ATTR(lp_read_quantum),
+	ROW_ATTR(lp_swrite_quantum),
+	ROW_ATTR(rd_idle_data),
+	ROW_ATTR(rd_idle_data_freq),
+	__ATTR_NULL
+};
+
+static struct elevator_type iosched_row = {
+	.ops = {
+		.elevator_merge_req_fn		= row_merged_requests,
+		.elevator_dispatch_fn		= row_dispatch_requests,
+		.elevator_add_req_fn		= row_add_request,
+		.elevator_reinsert_req_fn	= row_reinsert_req,
+		.elevator_is_urgent_fn		= row_urgent_pending,
+		.elevator_completed_req_fn	= row_completed_req,
+		.elevator_former_req_fn		= elv_rb_former_request,
+		.elevator_latter_req_fn		= elv_rb_latter_request,
+		.elevator_set_req_fn		= row_set_request,
+		.elevator_init_fn		= row_init_queue,
+		.elevator_exit_fn		= row_exit_queue,
+	},
+	.icq_size = sizeof(struct io_cq),
+	.icq_align = __alignof__(struct io_cq),
+	.elevator_attrs = row_attrs,
+	.elevator_name = "row",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init row_init(void)
+{
+	elv_register(&iosched_row);
+	return 0;
+}
+
+static void __exit row_exit(void)
+{
+	elv_unregister(&iosched_row);
+}
+
+module_init(row_init);
+module_exit(row_exit);
+
+MODULE_LICENSE("GPLv2");
+MODULE_DESCRIPTION("Read Over Write IO scheduler");

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 9c49d17..2b4542a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h

@@ -135,7 +135,7 @@
 				 * throttling rules. Don't do it again. */
 
 	/* request only flags */
-	__REQ_SORTED,		/* elevator knows about this request */
+	__REQ_SORTED = __REQ_RAHEAD, /* elevator knows about this request */
 	__REQ_SOFTBARRIER,	/* may not be passed by ioscheduler */
 	__REQ_NOMERGE,		/* don't touch this for merging */
 	__REQ_STARTED,		/* drive already may have started this one */
@@ -151,6 +151,7 @@
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
 	__REQ_SANITIZE,		/* sanitize */
+	__REQ_URGENT,		/* urgent request */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -163,6 +164,7 @@
 #define REQ_PRIO		(1 << __REQ_PRIO)
 #define REQ_DISCARD		(1 << __REQ_DISCARD)
 #define REQ_SANITIZE		(1 << __REQ_SANITIZE)
+#define REQ_URGENT		(1 << __REQ_URGENT)
 #define REQ_NOIDLE		(1 << __REQ_NOIDLE)
 
 #define REQ_FAILFAST_MASK \

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e76b0ae..6502841 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h

@@ -282,6 +282,7 @@
 	struct request_list	rq;
 
 	request_fn_proc		*request_fn;
+	request_fn_proc		*urgent_request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
 	unprep_rq_fn		*unprep_rq_fn;
@@ -365,6 +366,8 @@
 	struct list_head	icq_list;
 
 	struct queue_limits	limits;
+	bool			notified_urgent;
+	bool			dispatched_urgent;
 
 	/*
 	 * sg stuff
@@ -673,6 +676,8 @@
 extern struct request *blk_make_request(struct request_queue *, struct bio *,
 					gfp_t);
 extern void blk_requeue_request(struct request_queue *, struct request *);
+extern int blk_reinsert_request(struct request_queue *q, struct request *rq);
+extern bool blk_reinsert_req_sup(struct request_queue *q);
 extern void blk_add_request_payload(struct request *rq, struct page *page,
 		unsigned int len);
 extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
@@ -822,6 +827,7 @@
 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
 extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
 						      request_fn_proc *, spinlock_t *);
+extern void blk_urgent_request(struct request_queue *q, request_fn_proc *fn);
 extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);

diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 7d4e035..b36b28f 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h

@@ -22,6 +22,9 @@
 typedef int (elevator_dispatch_fn) (struct request_queue *, int);
 
 typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
+typedef int (elevator_reinsert_req_fn) (struct request_queue *,
+					struct request *);
+typedef bool (elevator_is_urgent_fn) (struct request_queue *);
 typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
@@ -46,6 +49,9 @@
 
 	elevator_dispatch_fn *elevator_dispatch_fn;
 	elevator_add_req_fn *elevator_add_req_fn;
+	elevator_reinsert_req_fn *elevator_reinsert_req_fn;
+	elevator_is_urgent_fn *elevator_is_urgent_fn;
+
 	elevator_activate_req_fn *elevator_activate_req_fn;
 	elevator_deactivate_req_fn *elevator_deactivate_req_fn;
 
@@ -122,6 +128,7 @@
 extern void elv_bio_merged(struct request_queue *q, struct request *,
 				struct bio *);
 extern void elv_requeue_request(struct request_queue *, struct request *);
+extern int elv_reinsert_request(struct request_queue *, struct request *);
 extern struct request *elv_former_request(struct request_queue *, struct request *);
 extern struct request *elv_latter_request(struct request_queue *, struct request *);
 extern int elv_register_queue(struct request_queue *q);
commit	f4a09d0c6da781d9045604488a45b4bffa490719	[log] [tgz]
author	Tatyana Brokhman <tlinder@codeaurora.org>	Thu Sep 20 10:46:10 2012 +0300
committer	Ethan Chen <intervigil@gmail.com>	Fri Jun 07 13:40:42 2013 -0700
tree	3d282a4e8fc207b587513a6a44529bbc768af337
parent	d313183ff6d6e405ebee6c10e8d472ad88a3da13 [diff]