linux-stable.git/include/linux, branch v3.12.4

take anon inode allocation to libfs.c

2013-12-08T15:29:16+00:00

commit 6987843ff7e836ea65b554905aec34d2fad05c94 upstream.

Signed-off-by: Al Viro 
Cc: Benjamin LaHaise 
Signed-off-by: Greg Kroah-Hartman

mm: numa: return the number of base pages altered by protection changes

2013-12-08T15:29:15+00:00

commit 72403b4a0fbdf433c1fe0127e49864658f6f6468 upstream.

Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa.  task_numa_work was using the old return value
to track how much address space had been updated.  Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.

This patch reverts that commit and accounts for the number of THP
updates separately in vmstat.  It is up to the administrator to
interpret the pair of values correctly.  This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result.  On
the flip size system CPU usage should be lower than recent tests
reported.  This is an illustrative example of a short single JVM specjbb
test

specjbb
                       3.12.0                3.12.0
                      vanilla      acctupdates
TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)

              3.12.0      3.12.0
             vanillaacctupdates
User         5169.64     5184.14
System        100.45       80.02
Elapsed       252.75      251.85

Performance is similar but note the reduction in system CPU time.  While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented.  The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.

                                3.12.0      3.12.0
                               vanillaacctupdates
NUMA page range updates        1408326    11043064
NUMA huge PMD updates                0       21040
NUMA PTE updates               1408326      291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner.  NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.

Signed-off-by: Mel Gorman 
Reported-by: Alex Thorlton 
Reviewed-by: Rik van Riel 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

netfilter: push reasm skb through instead of original frag skbs

2013-12-08T15:29:13+00:00

[ Upstream commit 6aafeef03b9d9ecf255f3a80ed85ee070260e1ae ]

Pushing original fragments through causes several problems. For example
for matching, frags may not be matched correctly. Take following
example:


On HOSTA do:
ip6tables -I INPUT -p icmpv6 -j DROP
ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT

and on HOSTB you do:
ping6 HOSTA -s2000    (MTU is 1500)

Incoming echo requests will be filtered out on HOSTA. This issue does
not occur with smaller packets than MTU (where fragmentation does not happen)


As was discussed previously, the only correct solution seems to be to use
reassembled skb instead of separete frags. Doing this has positive side
effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams
dances in ipvs and conntrack can be removed.

Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c
entirely and use code in net/ipv6/reassembly.c instead.

Signed-off-by: Jiri Pirko 
Acked-by: Julian Anastasov 
Signed-off-by: Marcelo Ricardo Leitner 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: rework recvmsg handler msg_name and msg_namelen logic

2013-12-08T15:29:13+00:00

[ Upstream commit f3d3342602f8bcbf37d7c46641cb9bca7618eb1c ]

This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.

This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.

Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.

Also document these changes in include/linux/net.h as suggested by David
Miller.

Changes since RFC:

Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.

With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
	msg->msg_name = NULL
".

This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.

Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.

Cc: David Miller 
Suggested-by: Eric Dumazet 
Signed-off-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

random32: fix off-by-one in seeding requirement

2013-12-08T15:29:11+00:00

[ Upstream commit 51c37a70aaa3f95773af560e6db3073520513912 ]

For properly initialising the Tausworthe generator [1], we have
a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15.

Commit 697f8d0348 ("random32: seeding improvement") introduced
a __seed() function that imposes boundary checks proposed by the
errata paper [2] to properly ensure above conditions.

However, we're off by one, as the function is implemented as:
"return (x < m) ? x + m : x;", and called with __seed(X, 1),
__seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15
would be possible, whereas the lower boundary should actually
be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise
an initialization with an unwanted seed could have the effect
that Tausworthe's PRNG properties cannot not be ensured.

Note that this PRNG is *not* used for cryptography in the kernel.

 [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps
 [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps

Joint work with Hannes Frederic Sowa.

Fixes: 697f8d0348a6 ("random32: seeding improvement")
Cc: Stephen Hemminger 
Cc: Florian Weimer 
Cc: Theodore Ts'o 
Signed-off-by: Daniel Borkmann 
Signed-off-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

mfd: rtsx: Modify rts5249_optimize_phy

2013-12-04T19:06:13+00:00

commit 26b818511c6562ce372566c219a2ef1afea35fe6 upstream.

In some platforms, specially Thinkpad series, rts5249 won't be
initialized properly. So we need adjust some phy parameters to
improve the compatibility issue.

It is a little different between simulation and real chip. We have
no idea about which configuration is better before tape-out. We set
default settings according to simulation, but need to tune these
parameters after getting the real chip.

I can't explain every change in detail here. The below information is
just a rough description:

PHY_REG_REV: Disable internal clkreq_tx, enable rx_pwst
PHY_BPCR: No change, just turn the magic number to macro definitions
PHY_PCR: Change OOBS sensitivity, from 60mV to 90mV
PHY_RCR2: Control charge-pump current automatically
PHY_FLD4: Use TX cmu reference clock
PHY_RDR: Change RXDSEL from 30nF to 1.9nF
PHY_RCR1: Change the duration between adp_st and asserting cp_en from
0.32 us to 0.64us
PHY_FLD3: Adjust internal timers
PHY_TUNE: Fine tune the regulator12 output voltage

Signed-off-by: Wei WANG 
Signed-off-by: Lee Jones 
Cc: Chris Ball 
Signed-off-by: Greg Kroah-Hartman

mtd: map: fixed bug in 64-bit systems

2013-12-04T19:05:25+00:00

commit a4d62babf988fe5dfde24437fa135ef147bc7aa0 upstream.

Hardware:
	CPU: XLP832,the 64-bit OS
	NOR Flash:S29GL128S 128M
Software:
	Kernel:2.6.32.41
	Filesystem:JFFS2
When writing files, errors appear:
	Write len 182  but return retlen 180
	Write of 182 bytes at 0x072c815c failed. returned -5, retlen 180
	Write len 186  but return retlen 184
	Write of 186 bytes at 0x072caff4 failed. returned -5, retlen 184
These errors exist only in 64-bit systems,not in 32-bit systems. After analysis, we
found that the left shift operation is wrong in map_word_load_partial. For instance:
	unsigned char buf[3] ={0x9e,0x3a,0xea};
	map_bankwidth(map) is 4;
	for (i=0; i < 3; i++) {
		int bitpos;
		bitpos = (map_bankwidth(map)-1-i)*8;
		orig.x[0] &= ~(0xff << bitpos);
		orig.x[0] |= buf[i] << bitpos;
	}

The value of orig.x[0] is expected to be 0x9e3aeaff, but in this situation(64-bit
System) we'll get the wrong value of 0xffffffff9e3aeaff due to the 64-bit sign
extension:
buf[i] is defined as "unsigned char" and the left-shift operation will convert it
to the type of "signed int", so when left-shift buf[i] by 24 bits, the final result
will get the wrong value: 0xffffffff9e3aeaff.

If the left-shift bits are less than 24, then sign extension will not occur. Whereas
the bankwidth of the nor flash we used is 4, therefore this BUG emerges.

Signed-off-by: Pang Xunlei 
Signed-off-by: Zhang Yi 
Signed-off-by: Lu Zhongjun 
Signed-off-by: Brian Norris 
Signed-off-by: Greg Kroah-Hartman

ipc, msg: fix message length check for negative values

2013-12-04T19:05:22+00:00

commit 4e9b45a19241354daec281d7a785739829b52359 upstream.

On 64 bit systems the test for negative message sizes is bogus as the
size, which may be positive when evaluated as a long, will get truncated
to an int when passed to load_msg().  So a long might very well contain a
positive value but when truncated to an int it would become negative.

That in combination with a small negative value of msg_ctlmax (which will
be promoted to an unsigned type for the comparison against msgsz, making
it a big positive value and therefore make it pass the check) will lead to
two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
small buffer as the addition of alen is effectively a subtraction.  2/ The
copy_from_user() call in load_msg() will first overflow the buffer with
userland data and then, when the userland access generates an access
violation, the fixup handler copy_user_handle_tail() will try to fill the
remainder with zeros -- roughly 4GB.  That almost instantly results in a
system crash or reset.

  ,-[ Reproducer (needs to be run as root) ]--
  | #include 
  | #include 
  | #include 
  | #include 
  |
  | int main(void) {
  |     long msg = 1;
  |     int fd;
  |
  |     fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
  |     write(fd, "-1", 2);
  |     close(fd);
  |
  |     msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
  |
  |     return 0;
  | }
  '---

Fix the issue by preventing msgsz from getting truncated by consistently
using size_t for the message length.  This way the size checks in
do_msgsnd() could still be passed with a negative value for msg_ctlmax but
we would fail on the buffer allocation in that case and error out.

Also change the type of m_ts from int to size_t to avoid similar nastiness
in other code paths -- it is used in similar constructs, i.e.  signed vs.
unsigned checks.  It should never become negative under normal
circumstances, though.

Setting msg_ctlmax to a negative value is an odd configuration and should
be prevented.  As that might break existing userland, it will be handled
in a separate commit so it could easily be reverted and reworked without
reintroducing the above described bug.

Hardening mechanisms for user copy operations would have catched that bug
early -- e.g.  checking slab object sizes on user copy operations as the
usercopy feature of the PaX patch does.  Or, for that matter, detect the
long vs.  int sign change due to truncation, as the size overflow plugin
of the very same patch does.

[akpm@linux-foundation.org: fix i386 min() warnings]
Signed-off-by: Mathias Krause 
Cc: Pax Team 
Cc: Davidlohr Bueso 
Cc: Brad Spengler 
Cc: Manfred Spraul 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

rbtree: fix rbtree_postorder_for_each_entry_safe() iterator

2013-12-04T19:05:10+00:00

commit 1310a5a99d900ee30b9f171146406bde0c6c2bd4 upstream.

The iterator rbtree_postorder_for_each_entry_safe() relies on pointer
underflow behavior when testing for loop termination.  In particular it
expects that

  &rb_entry(NULL, type, field)->field

is NULL.  But the result of this expression is not defined by a C standard
and some gcc versions (e.g.  4.3.4) assume the above expression can never
be equal to NULL.  The net result is an oops because the iteration is not
properly terminated.

Fix the problem by modifying the iterator to avoid pointer underflows.

Signed-off-by: Jan Kara 
Signed-off-by: Cody P Schafer 
Cc: Michel Lespinasse 
Cc: "David S. Miller" 
Cc: Adrian Hunter 
Cc: Artem Bityutskiy 
Cc: David Woodhouse 
Cc: Jozsef Kadlecsik 
Cc: Pablo Neira Ayuso 
Cc: Patrick McHardy 
Cc: Paul Mundt 
Cc: Theodore Ts'o 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

usb: Don't enable USB 2.0 Link PM by default.

2013-11-29T19:28:09+00:00

commit de68bab4fa96014cfaa6fcbcdb9750e32969fb86 upstream.

How it's supposed to work:
--------------------------

USB 2.0 Link PM is a lower power state that some newer USB 2.0 devices
support.  USB 3.0 devices certified by the USB-IF are required to
support it if they are plugged into a USB 2.0 only port, or a USB 2.0
cable is used.  USB 2.0 Link PM requires both a USB device and a host
controller that supports USB 2.0 hardware-enabled LPM.

USB 2.0 Link PM is designed to be enabled once by software, and the host
hardware handles transitions to the L1 state automatically.  The premise
of USB 2.0 Link PM is to be able to put the device into a lower power
link state when the bus is idle or the device NAKs USB IN transfers for
a specified amount of time.

...but hardware is broken:
--------------------------

It turns out many USB 3.0 devices claim to support USB 2.0 Link PM (by
setting the LPM bit in their USB 2.0 BOS descriptor), but they don't
actually implement it correctly.  This manifests as the USB device
refusing to respond to transfers when it is plugged into a USB 2.0 only
port under the Haswell-ULT/Lynx Point LP xHCI host.

These devices pass the xHCI driver's simple test to enable USB 2.0 Link
PM, wait for the port to enter L1, and then bring it back into L0.  They
only start to break when L1 entry is interleaved with transfers.

Some devices then fail to respond to the next control transfer (usually
a Set Configuration).  This results in devices never enumerating.

Other mass storage devices (such as a later model Western Digital My
Passport USB 3.0 hard drive) respond fine to going into L1 between
control transfers.  They ACK the entry, come out of L1 when the host
needs to send a control transfer, and respond properly to those control
transfers.  However, when the first READ10 SCSI command is sent, the
device NAKs the data phase while it's reading from the spinning disk.
Eventually, the host requests to put the link into L1, and the device
ACKs that request.  Then it never responds to the data phase of the
READ10 command.  This results in not being able to read from the drive.

Some mass storage devices (like the Corsair Survivor USB 3.0 flash
drive) are well behaved.  They ACK the entry into L1 during control
transfers, and when SCSI commands start coming in, they NAK the requests
to go into L1, because they need to be at full power.

Not all USB 3.0 devices advertise USB 2.0 link PM support.  My Point
Grey USB 3.0 webcam advertises itself as a USB 2.1 device, but doesn't
have a USB 2.0 BOS descriptor, so we don't enable USB 2.0 Link PM.  I
suspect that means the device isn't certified.

What do we do about it?
-----------------------

There's really no good way for the kernel to test these devices.
Therefore, the kernel needs to disable USB 2.0 Link PM by default, and
distros will have to enable it by writing 1 to the sysfs file
/sys/bus/usb/devices/../power/usb2_hardware_lpm.  Rip out the xHCI Link
PM test, since it's not sufficient to detect these buggy devices, and
don't automatically enable LPM after the device is addressed.

This patch should be backported to kernels as old as 3.11, that
contain the commit a558ccdcc71c7770c5e80c926a31cfe8a3892a09 "usb: xhci:
add USB2 Link power management BESL support".  Without this fix, some
USB 3.0 devices will not enumerate or work properly under USB 2.0 ports
on Haswell-ULT systems.

Signed-off-by: Sarah Sharp 
Signed-off-by: Greg Kroah-Hartman