1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 /* * kernel/cpuset.c * * Processor and Memory placement constraints for sets of tasks. * * Copyright (C) 2003 BULL SA. * Copyright (C) 2004-2007 Silicon Graphics, Inc. * Copyright (C) 2006 Google, Inc * * Portions derived from Patrick Mochel's sysfs code. * sysfs is Copyright (c) 2001-3 Patrick Mochel * * 2003-10-10 Written by Simon Derr. * 2003-10-22 Updates by Stephen Hemminger. * 2004 May-July Rework by Paul Jackson. * 2006 Rework by Paul Menage to use generic cgroups * 2008 Rework of the scheduler domains and CPU hotplug handling * by Max Krasnyansky * * This file is subject to the terms and conditions of the GNU General Public * License. See the file COPYING in the main directory of the Linux * distribution for more details. */ #include <linux/cpu.h> #include <linux/cpumask.h> #include <linux/cpuset.h> #include <linux/err.h> #include <linux/errno.h> #include <linux/file.h> #include <linux/fs.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/kernel.h> #include <linux/kmod.h> #include <linux/list.h> #include <linux/mempolicy.h> #include <linux/mm.h> #include <linux/memory.h> #include <linux/export.h> #include <linux/mount.h> #include <linux/fs_context.h> #include <linux/namei.h> #include <linux/pagemap.h> #include <linux/proc_fs.h> #include <linux/rcupdate.h> #include <linux/sched.h> #include <linux/sched/deadline.h> #include <linux/sched/mm.h> #include <linux/sched/task.h> #include <linux/seq_file.h> #include <linux/security.h> #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/stat.h> #include <linux/string.h> #include <linux/time.h> #include <linux/time64.h> #include <linux/backing-dev.h> #include <linux/sort.h> #include <linux/oom.h> #include <linux/sched/isolation.h> #include <linux/uaccess.h> #include <linux/atomic.h> #include <linux/mutex.h> #include <linux/cgroup.h> #include <linux/wait.h> DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); /* See "Frequency meter" comments, below. */ struct fmeter { int cnt; /* unprocessed events count */ int val; /* most recent output value */ time64_t time; /* clock (secs) when val computed */ spinlock_t lock; /* guards read or write of above */ }; struct cpuset { struct cgroup_subsys_state css; unsigned long flags; /* "unsigned long" so bitops work */ /* * On default hierarchy: * * The user-configured masks can only be changed by writing to * cpuset.cpus and cpuset.mems, and won't be limited by the * parent masks. * * The effective masks is the real masks that apply to the tasks * in the cpuset. They may be changed if the configured masks are * changed or hotplug happens. * * effective_mask == configured_mask & parent's effective_mask, * and if it ends up empty, it will inherit the parent's mask. * * * On legacy hierachy: * * The user-configured masks are always the same with effective masks. */ /* user-configured CPUs and Memory Nodes allow to tasks */ cpumask_var_t cpus_allowed; nodemask_t mems_allowed; /* effective CPUs and Memory Nodes allow to tasks */ cpumask_var_t effective_cpus; nodemask_t effective_mems; /* * CPUs allocated to child sub-partitions (default hierarchy only) * - CPUs granted by the parent = effective_cpus U subparts_cpus * - effective_cpus and subparts_cpus are mutually exclusive. * * effective_cpus contains only onlined CPUs, but subparts_cpus * may have offlined ones. */ cpumask_var_t subparts_cpus; /* * This is old Memory Nodes tasks took on. * * - top_cpuset.old_mems_allowed is initialized to mems_allowed. * - A new cpuset's old_mems_allowed is initialized when some * task is moved into it. * - old_mems_allowed is used in cpuset_migrate_mm() when we change * cpuset.mems_allowed and have tasks' nodemask updated, and * then old_mems_allowed is updated to mems_allowed. */ nodemask_t old_mems_allowed; struct fmeter fmeter; /* memory_pressure filter */ /* * Tasks are being attached to this cpuset. Used to prevent * zeroing cpus/mems_allowed between ->can_attach() and ->attach(). */ int attach_in_progress; /* partition number for rebuild_sched_domains() */ int pn; /* for custom sched domain */ int relax_domain_level; /* number of CPUs in subparts_cpus */ int nr_subparts_cpus; /* partition root state */ int partition_root_state; /* * Default hierarchy only: * use_parent_ecpus - set if using parent's effective_cpus * child_ecpus_count - # of children with use_parent_ecpus set */ int use_parent_ecpus; int child_ecpus_count; }; /* * Partition root states: * * 0 - not a partition root * * 1 - partition root * * -1 - invalid partition root * None of the cpus in cpus_allowed can be put into the parent's * subparts_cpus. In this case, the cpuset is not a real partition * root anymore. However, the CPU_EXCLUSIVE bit will still be set * and the cpuset can be restored back to a partition root if the * parent cpuset can give more CPUs back to this child cpuset. */ #define PRS_DISABLED 0 #define PRS_ENABLED 1 #define PRS_ERROR -1 /* * Temporary cpumasks for working with partitions that are passed among * functions to avoid memory allocation in inner functions. */ struct tmpmasks { cpumask_var_t addmask, delmask; /* For partition root */ cpumask_var_t new_cpus; /* For update_cpumasks_hier() */ }; static inline struct cpuset *css_cs(struct cgroup_subsys_state *css) { return css ? container_of(css, struct cpuset, css) : NULL; } /* Retrieve the cpuset for a task */ static inline struct cpuset *task_cs(struct task_struct *task) { return css_cs(task_css(task, cpuset_cgrp_id)); } static inline struct cpuset *parent_cs(struct cpuset *cs) { return css_cs(cs->css.parent); } /* bits in struct cpuset flags field */ typedef enum { CS_ONLINE, CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE, CS_MEM_HARDWALL, CS_MEMORY_MIGRATE, CS_SCHED_LOAD_BALANCE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, } cpuset_flagbits_t; /* convenient tests for these bits */ static inline bool is_cpuset_online(struct cpuset *cs) { return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css); } static inline int is_cpu_exclusive(const struct cpuset *cs) { return test_bit(CS_CPU_EXCLUSIVE, &cs->flags); } static inline int is_mem_exclusive(const struct cpuset *cs) { return test_bit(CS_MEM_EXCLUSIVE, &cs->flags); } static inline int is_mem_hardwall(const struct cpuset *cs) { return test_bit(CS_MEM_HARDWALL, &cs->flags); } static inline int is_sched_load_balance(const struct cpuset *cs) { return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); } static inline int is_memory_migrate(const struct cpuset *cs) { return test_bit(CS_MEMORY_MIGRATE, &cs->flags); } static inline int is_spread_page(const struct cpuset *cs) { return test_bit(CS_SPREAD_PAGE, &cs->flags); } static inline int is_spread_slab(const struct cpuset *cs) { return test_bit(CS_SPREAD_SLAB, &cs->flags); } static inline int is_partition_root(const struct cpuset *cs) { return cs->partition_root_state > 0; } static struct cpuset top_cpuset = { .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), .partition_root_state = PRS_ENABLED, }; /** * cpuset_for_each_child - traverse online children of a cpuset * @child_cs: loop cursor pointing to the current child * @pos_css: used for iteration * @parent_cs: target cpuset to walk children of * * Walk @child_cs through the online children of @parent_cs. Must be used * with RCU read locked. */ #define cpuset_for_each_child(child_cs, pos_css, parent_cs) \ css_for_each_child((pos_css), &(parent_cs)->css) \ if (is_cpuset_online(((child_cs) = css_cs((pos_css))))) /** * cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants * @des_cs: loop cursor pointing to the current descendant * @pos_css: used for iteration * @root_cs: target cpuset to walk ancestor of * * Walk @des_cs through the online descendants of @root_cs. Must be used * with RCU read locked. The caller may modify @pos_css by calling * css_rightmost_descendant() to skip subtree. @root_cs is included in the * iteration and the first node to be visited. */ #define cpuset_for_each_descendant_pre(des_cs, pos_css, root_cs) \ css_for_each_descendant_pre((pos_css), &(root_cs)->css) \ if (is_cpuset_online(((des_cs) = css_cs((pos_css))))) /* * There are two global locks guarding cpuset structures - cpuset_mutex and * callback_lock. We also require taking task_lock() when dereferencing a * task's cpuset pointer. See "The task_lock() exception", at the end of this * comment. * * A task must hold both locks to modify cpusets. If a task holds * cpuset_mutex, then it blocks others wanting that mutex, ensuring that it * is the only task able to also acquire callback_lock and be able to * modify cpusets. It can perform various checks on the cpuset structure * first, knowing nothing will change. It can also allocate memory while * just holding cpuset_mutex. While it is performing these checks, various * callback routines can briefly acquire callback_lock to query cpusets. * Once it is ready to make the changes, it takes callback_lock, blocking * everyone else. * * Calls to the kernel memory allocator can not be made while holding * callback_lock, as that would risk double tripping on callback_lock * from one of the callbacks into the cpuset code from within * __alloc_pages(). * * If a task is only holding callback_lock, then it has read-only * access to cpusets. * * Now, the task_struct fields mems_allowed and mempolicy may be changed * by other task, we use alloc_lock in the task_struct fields to protect * them. * * The cpuset_common_file_read() handlers only hold callback_lock across * small pieces of code, such as when reading out possibly multi-word * cpumasks and nodemasks. * * Accessing a task's cpuset should be done in accordance with the * guidelines for accessing subsystem state in kernel/cgroup.c */ DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem); void cpuset_read_lock(void) { percpu_down_read(&cpuset_rwsem); } void cpuset_read_unlock(void) { percpu_up_read(&cpuset_rwsem); } static DEFINE_SPINLOCK(callback_lock); static struct workqueue_struct *cpuset_migrate_mm_wq; /* * CPU / memory hotplug is handled asynchronously. */ static void cpuset_hotplug_workfn(struct work_struct *work); static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn); static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq); /* * Cgroup v2 behavior is used on the "cpus" and "mems" control files when * on default hierarchy or when the cpuset_v2_mode flag is set by mounting * the v1 cpuset cgroup filesystem with the "cpuset_v2_mode" mount option. * With v2 behavior, "cpus" and "mems" are always what the users have * requested and won't be changed by hotplug events. Only the effective * cpus or mems will be affected. */ static inline bool is_in_v2_mode(void) { return cgroup_subsys_on_dfl(cpuset_cgrp_subsys) || (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); } /* * Return in pmask the portion of a cpusets's cpus_allowed that * are online. If none are online, walk up the cpuset hierarchy * until we find one that does have some online cpus. * * One way or another, we guarantee to return some non-empty subset * of cpu_online_mask. * * Call with callback_lock or cpuset_mutex held. */ static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask) { while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) { cs = parent_cs(cs); if (unlikely(!cs)) { /* * The top cpuset doesn't have any online cpu as a * consequence of a race between cpuset_hotplug_work * and cpu hotplug notifier. But we know the top * cpuset's effective_cpus is on its way to be * identical to cpu_online_mask. */ cpumask_copy(pmask, cpu_online_mask); return; } } cpumask_and(pmask, cs->effective_cpus, cpu_online_mask); } /* * Return in *pmask the portion of a cpusets's mems_allowed that * are online, with memory. If none are online with memory, walk * up the cpuset hierarchy until we find one that does have some * online mems. The top cpuset always has some mems online. * * One way or another, we guarantee to return some non-empty subset * of node_states[N_MEMORY]. * * Call with callback_lock or cpuset_mutex held. */ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) { while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY])) cs = parent_cs(cs); nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]); } /* * update task's spread flag if cpuset's page/slab spread flag is set * * Call with callback_lock or cpuset_mutex held. */ static void cpuset_update_task_spread_flag(struct cpuset *cs, struct task_struct *tsk) { if (is_spread_page(cs)) task_set_spread_page(tsk); else task_clear_spread_page(tsk); if (is_spread_slab(cs)) task_set_spread_slab(tsk); else task_clear_spread_slab(tsk); } /* * is_cpuset_subset(p, q) - Is cpuset p a subset of cpuset q? * * One cpuset is a subset of another if all its allowed CPUs and * Memory Nodes are a subset of the other, and its exclusive flags * are only set if the other's are set. Call holding cpuset_mutex. */ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) { return cpumask_subset(p->cpus_allowed, q->cpus_allowed) && nodes_subset(p->mems_allowed, q->mems_allowed) && is_cpu_exclusive(p) <= is_cpu_exclusive(q) && is_mem_exclusive(p) <= is_mem_exclusive(q); } /** * alloc_cpumasks - allocate three cpumasks for cpuset * @cs: the cpuset that have cpumasks to be allocated. * @tmp: the tmpmasks structure pointer * Return: 0 if successful, -ENOMEM otherwise. * * Only one of the two input arguments should be non-NULL. */ static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks *tmp) { cpumask_var_t *pmask1, *pmask2, *pmask3; if (cs) { pmask1 = &cs->cpus_allowed; pmask2 = &cs->effective_cpus; pmask3 = &cs->subparts_cpus; } else { pmask1 = &tmp->new_cpus; pmask2 = &tmp->addmask; pmask3 = &tmp->delmask; } if (!zalloc_cpumask_var(pmask1, GFP_KERNEL)) return -ENOMEM; if (!zalloc_cpumask_var(pmask2, GFP_KERNEL)) goto free_one; if (!zalloc_cpumask_var(pmask3, GFP_KERNEL)) goto free_two; return 0; free_two: free_cpumask_var(*pmask2); free_one: free_cpumask_var(*pmask1); return -ENOMEM; } /** * free_cpumasks - free cpumasks in a tmpmasks structure * @cs: the cpuset that have cpumasks to be free. * @tmp: the tmpmasks structure pointer */ static inline void free_cpumasks(struct cpuset *cs, struct tmpmasks *tmp) { if (cs) { free_cpumask_var(cs->cpus_allowed); free_cpumask_var(cs->effective_cpus); free_cpumask_var(cs->subparts_cpus); } if (tmp) { free_cpumask_var(tmp->new_cpus); free_cpumask_var(tmp->addmask); free_cpumask_var(tmp->delmask); } } /** * alloc_trial_cpuset - allocate a trial cpuset * @cs: the cpuset that the trial cpuset duplicates */ static struct cpuset *alloc_trial_cpuset(struct cpuset *cs) { struct cpuset *trial; trial = kmemdup(cs, sizeof(*cs), GFP_KERNEL); if (!trial) return NULL; if (alloc_cpumasks(trial, NULL)) { kfree(trial); return NULL; } cpumask_copy(trial->cpus_allowed, cs->cpus_allowed); cpumask_copy(trial->effective_cpus, cs->effective_cpus); return trial; } /** * free_cpuset - free the cpuset * @cs: the cpuset to be freed */ static inline void free_cpuset(struct cpuset *cs) { free_cpumasks(cs, NULL); kfree(cs); } /* * validate_change() - Used to validate that any proposed cpuset change * follows the structural rules for cpusets. * * If we replaced the flag and mask values of the current cpuset * (cur) with those values in the trial cpuset (trial), would * our various subset and exclusive rules still be valid? Presumes * cpuset_mutex held. * * 'cur' is the address of an actual, in-use cpuset. Operations * such as list traversal that depend on the actual address of the * cpuset in the list must use cur below, not trial. * * 'trial' is the address of bulk structure copy of cur, with * perhaps one or more of the fields cpus_allowed, mems_allowed, * or flags changed to new, trial values. * * Return 0 if valid, -errno if not. */ static int validate_change(struct cpuset *cur, struct cpuset *trial) { struct cgroup_subsys_state *css; struct cpuset *c, *par; int ret; rcu_read_lock(); /* Each of our child cpusets must be a subset of us */ ret = -EBUSY; cpuset_for_each_child(c, css, cur) if (!is_cpuset_subset(c, trial)) goto out; /* Remaining checks don't apply to root cpuset */ ret = 0; if (cur == &top_cpuset) goto out; par = parent_cs(cur); /* On legacy hiearchy, we must be a subset of our parent cpuset. */ ret = -EACCES; if (!is_in_v2_mode() && !is_cpuset_subset(trial, par)) goto out; /* * If either I or some sibling (!= me) is exclusive, we can't * overlap */ ret = -EINVAL; cpuset_for_each_child(c, css, par) { if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) && c != cur && cpumask_intersects(trial->cpus_allowed, c->cpus_allowed)) goto out; if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) && c != cur && nodes_intersects(trial->mems_allowed, c->mems_allowed)) goto out; } /* * Cpusets with tasks - existing or newly being attached - can't * be changed to have empty cpus_allowed or mems_allowed. */ ret = -ENOSPC; if ((cgroup_is_populated(cur->css.cgroup) || cur->attach_in_progress)) { if (!cpumask_empty(cur->cpus_allowed) && cpumask_empty(trial->cpus_allowed)) goto out; if (!nodes_empty(cur->mems_allowed) && nodes_empty(trial->mems_allowed)) goto out; } /* * We can't shrink if we won't have enough room for SCHED_DEADLINE * tasks. */ ret = -EBUSY; if (is_cpu_exclusive(cur) && !cpuset_cpumask_can_shrink(cur->cpus_allowed, trial->cpus_allowed)) goto out; ret = 0; out: rcu_read_unlock(); return ret; } #ifdef CONFIG_SMP /* * Helper routine for generate_sched_domains(). * Do cpusets a, b have overlapping effective cpus_allowed masks? */ static int cpusets_overlap(struct cpuset *a, struct cpuset *b) { return cpumask_intersects(a->effective_cpus, b->effective_cpus); } static void update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c) { if (dattr->relax_domain_level < c->relax_domain_level) dattr->relax_domain_level = c->relax_domain_level; return; } static void update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *root_cs) { struct cpuset *cp; struct cgroup_subsys_state *pos_css; rcu_read_lock(); cpuset_for_each_descendant_pre(cp, pos_css, root_cs) { /* skip the whole subtree if @cp doesn't have any CPU */ if (cpumask_empty(cp->cpus_allowed)) { pos_css = css_rightmost_descendant(pos_css); continue; } if (is_sched_load_balance(cp)) update_domain_attr(dattr, cp); } rcu_read_unlock(); } /* Must be called with cpuset_mutex held. */ static inline int nr_cpusets(void) { /* jump label reference count + the top-level cpuset */ return static_key_count(&cpusets_enabled_key.key) + 1; } /* * generate_sched_domains() * * This function builds a partial partition of the systems CPUs * A 'partial partition' is a set of non-overlapping subsets whose * union is a subset of that set. * The output of this function needs to be passed to kernel/sched/core.c * partition_sched_domains() routine, which will rebuild the scheduler's * load balancing domains (sched domains) as specified by that partial * partition. * * See "What is sched_load_balance" in Documentation/admin-guide/cgroup-v1/cpusets.rst * for a background explanation of this. * * Does not return errors, on the theory that the callers of this * routine would rather not worry about failures to rebuild sched * domains when operating in the severe memory shortage situations * that could cause allocation failures below. * * Must be called with cpuset_mutex held. * * The three key local variables below are: * cp - cpuset pointer, used (together with pos_css) to perform a * top-down scan of all cpusets. For our purposes, rebuilding * the schedulers sched domains, we can ignore !is_sched_load_ * balance cpusets. * csa - (for CpuSet Array) Array of pointers to all the cpusets * that need to be load balanced, for convenient iterative * access by the subsequent code that finds the best partition, * i.e the set of domains (subsets) of CPUs such that the * cpus_allowed of every cpuset marked is_sched_load_balance * is a subset of one of these domains, while there are as * many such domains as possible, each as small as possible. * doms - Conversion of 'csa' to an array of cpumasks, for passing to * the kernel/sched/core.c routine partition_sched_domains() in a * convenient format, that can be easily compared to the prior * value to determine what partition elements (sched domains) * were changed (added or removed.) * * Finding the best partition (set of domains): * The triple nested loops below over i, j, k scan over the * load balanced cpusets (using the array of cpuset pointers in * csa[]) looking for pairs of cpusets that have overlapping * cpus_allowed, but which don't have the same 'pn' partition * number and gives them in the same partition number. It keeps * looping on the 'restart' label until it can no longer find * any such pairs. * * The union of the cpus_allowed masks from the set of * all cpusets having the same 'pn' value then form the one * element of the partition (one sched domain) to be passed to * partition_sched_domains(). */ static int generate_sched_domains(cpumask_var_t **domains, struct sched_domain_attr **attributes) { struct cpuset *cp; /* top-down scan of cpusets */ struct cpuset **csa; /* array of all cpuset ptrs */ int csn; /* how many cpuset ptrs in csa so far */ int i, j, k; /* indices for partition finding loops */ cpumask_var_t *doms; /* resulting partition; i.e. sched domains */ struct sched_domain_attr *dattr; /* attributes for custom domains */ int ndoms = 0; /* number of sched domains in result */ int nslot; /* next empty doms[] struct cpumask slot */ struct cgroup_subsys_state *pos_css; bool root_load_balance = is_sched_load_balance(&top_cpuset); doms = NULL; dattr = NULL; csa = NULL; /* Special case for the 99% of systems with one, full, sched domain */ if (root_load_balance && !top_cpuset.nr_subparts_cpus) { ndoms = 1; doms = alloc_sched_domains(ndoms); if (!doms) goto done; dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL); if (dattr) { *dattr = SD_ATTR_INIT; update_domain_attr_tree(dattr, &top_cpuset); } cpumask_and(doms[0], top_cpuset.effective_cpus, housekeeping_cpumask(HK_FLAG_DOMAIN)); goto done; } csa = kmalloc_array(nr_cpusets(), sizeof(cp), GFP_KERNEL); if (!csa) goto done; csn = 0; rcu_read_lock(); if (root_load_balance) csa[csn++] = &top_cpuset; cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) { if (cp == &top_cpuset) continue; /* * Continue traversing beyond @cp iff @cp has some CPUs and * isn't load balancing. The former is obvious. The * latter: All child cpusets contain a subset of the * parent's cpus, so just skip them, and then we call * update_domain_attr_tree() to calc relax_domain_level of * the corresponding sched domain. * * If root is load-balancing, we can skip @cp if it * is a subset of the root's effective_cpus. */ if (!cpumask_empty(cp->cpus_allowed) && !(is_sched_load_balance(cp) && cpumask_intersects(cp->cpus_allowed, housekeeping_cpumask(HK_FLAG_DOMAIN)))) continue; if (root_load_balance && cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) continue; if (is_sched_load_balance(cp) && !cpumask_empty(cp->effective_cpus)) csa[csn++] = cp; /* skip @cp's subtree if not a partition root */ if (!is_partition_root(cp)) pos_css = css_rightmost_descendant(pos_css); } rcu_read_unlock(); for (i = 0; i < csn; i++) csa[i]->pn = i; ndoms = csn; restart: /* Find the best partition (set of sched domains) */ for (i = 0; i < csn; i++) { struct cpuset *a = csa[i]; int apn = a->pn; for (j = 0; j < csn; j++) { struct cpuset *b = csa[j]; int bpn = b->pn; if (apn != bpn && cpusets_overlap(a, b)) { for (k = 0; k < csn; k++) { struct cpuset *c = csa[k]; if (c->pn == bpn) c->pn = apn; } ndoms--; /* one less element */ goto restart; } } } /* * Now we know how many domains to create. * Convert <csn, csa> to <ndoms, doms> and populate cpu masks. */ doms = alloc_sched_domains(ndoms); if (!doms) goto done; /* * The rest of the code, including the scheduler, can deal with * dattr==NULL case. No need to abort if alloc fails. */ dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr), GFP_KERNEL); for (nslot = 0, i = 0; i < csn; i++) { struct cpuset *a = csa[i]; struct cpumask *dp; int apn = a->pn; if (apn < 0) { /* Skip completed partitions */ continue; } dp = doms[nslot]; if (nslot == ndoms) { static int warnings = 10; if (warnings) { pr_warn("rebuild_sched_domains confused: nslot %d, ndoms %d, csn %d, i %d, apn %d\n", nslot, ndoms, csn, i, apn); warnings--; } continue; } cpumask_clear(dp); if (dattr) *(dattr + nslot) = SD_ATTR_INIT; for (j = i; j < csn; j++) { struct cpuset *b = csa[j]; if (apn == b->pn) { cpumask_or(dp, dp, b->effective_cpus); cpumask_and(dp, dp, housekeeping_cpumask(HK_FLAG_DOMAIN)); if (dattr) update_domain_attr_tree(dattr + nslot, b); /* Done with this partition */ b->pn = -1; } } nslot++; } BUG_ON(nslot != ndoms); done: kfree(csa); /* * Fallback to the default domain if kmalloc() failed. * See comments in partition_sched_domains(). */ if (doms == NULL) ndoms = 1; *domains = doms; *attributes = dattr; return ndoms; } static void update_tasks_root_domain(struct cpuset *cs) { struct css_task_iter it; struct task_struct *task; css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) dl_add_task_root_domain(task); css_task_iter_end(&it); } static void rebuild_root_domains(void) { struct cpuset *cs = NULL; struct cgroup_subsys_state *pos_css; percpu_rwsem_assert_held(&cpuset_rwsem); lockdep_assert_cpus_held(); lockdep_assert_held(&sched_domains_mutex); rcu_read_lock(); /* * Clear default root domain DL accounting, it will be computed again * if a task belongs to it. */ dl_clear_root_domain(&def_root_domain); cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { if (cpumask_empty(cs->effective_cpus)) { pos_css = css_rightmost_descendant(pos_css); continue; } css_get(&cs->css); rcu_read_unlock(); update_tasks_root_domain(cs); rcu_read_lock(); css_put(&cs->css); } rcu_read_unlock(); } static void partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[], struct sched_domain_attr *dattr_new) { mutex_lock(&sched_domains_mutex); partition_sched_domains_locked(ndoms_new, doms_new, dattr_new); rebuild_root_domains(); mutex_unlock(&sched_domains_mutex); } /* * Rebuild scheduler domains. * * If the flag 'sched_load_balance' of any cpuset with non-empty * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset * which has that flag enabled, or if any cpuset with a non-empty * 'cpus' is removed, then call this routine to rebuild the * scheduler's dynamic sched domains. * * Call with cpuset_mutex held. Takes get_online_cpus(). */ static void rebuild_sched_domains_locked(void) { struct cgroup_subsys_state *pos_css; struct sched_domain_attr *attr; cpumask_var_t *doms; struct cpuset *cs; int ndoms; lockdep_assert_cpus_held(); percpu_rwsem_assert_held(&cpuset_rwsem); /* * If we have raced with CPU hotplug, return early to avoid * passing doms with offlined cpu to partition_sched_domains(). * Anyways, cpuset_hotplug_workfn() will rebuild sched domains. * * With no CPUs in any subpartitions, top_cpuset's effective CPUs * should be the same as the active CPUs, so checking only top_cpuset * is enough to detect racing CPU offlines. */ if (!top_cpuset.nr_subparts_cpus && !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask)) return; /* * With subpartition CPUs, however, the effective CPUs of a partition * root should be only a subset of the active CPUs. Since a CPU in any * partition root could be offlined, all must be checked. */ if (top_cpuset.nr_subparts_cpus) { rcu_read_lock(); cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { if (!is_partition_root(cs)) { pos_css = css_rightmost_descendant(pos_css); continue; } if (!cpumask_subset(cs->effective_cpus, cpu_active_mask)) { rcu_read_unlock(); return; } } rcu_read_unlock(); } /* Generate domain masks and attrs */ ndoms = generate_sched_domains(&doms, &attr); /* Have scheduler rebuild the domains */ partition_and_rebuild_sched_domains(ndoms, doms, attr); } #else /* !CONFIG_SMP */ static void rebuild_sched_domains_locked(void) { } #endif /* CONFIG_SMP */ void rebuild_sched_domains(void) { get_online_cpus(); percpu_down_write(&cpuset_rwsem); rebuild_sched_domains_locked(); percpu_up_write(&cpuset_rwsem); put_online_cpus(); } /** * update_tasks_cpumask - Update the cpumasks of tasks in the cpuset. * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed * * Iterate through each task of @cs updating its cpus_allowed to the * effective cpuset's. As this function is called with cpuset_mutex held, * cpuset membership stays stable. */ static void update_tasks_cpumask(struct cpuset *cs) { struct css_task_iter it; struct task_struct *task; css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) set_cpus_allowed_ptr(task, cs->effective_cpus); css_task_iter_end(&it); } /** * compute_effective_cpumask - Compute the effective cpumask of the cpuset * @new_cpus: the temp variable for the new effective_cpus mask * @cs: the cpuset the need to recompute the new effective_cpus mask * @parent: the parent cpuset * * If the parent has subpartition CPUs, include them in the list of * allowable CPUs in computing the new effective_cpus mask. Since offlined * CPUs are not removed from subparts_cpus, we have to use cpu_active_mask * to mask those out. */ static void compute_effective_cpumask(struct cpumask *new_cpus, struct cpuset *cs, struct cpuset *parent) { if (parent->nr_subparts_cpus) { cpumask_or(new_cpus, parent->effective_cpus, parent->subparts_cpus); cpumask_and(new_cpus, new_cpus, cs->cpus_allowed); cpumask_and(new_cpus, new_cpus, cpu_active_mask); } else { cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus); } } /* * Commands for update_parent_subparts_cpumask */ enum subparts_cmd { partcmd_enable, /* Enable partition root */ partcmd_disable, /* Disable partition root */ partcmd_update, /* Update parent's subparts_cpus */ }; /** * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset * @cpuset: The cpuset that requests change in partition root state * @cmd: Partition root state change command * @newmask: Optional new cpumask for partcmd_update * @tmp: Temporary addmask and delmask * Return: 0, 1 or an error code * * For partcmd_enable, the cpuset is being transformed from a non-partition * root to a partition root. The cpus_allowed mask of the given cpuset will * be put into parent's subparts_cpus and taken away from parent's * effective_cpus. The function will return 0 if all the CPUs listed in * cpus_allowed can be granted or an error code will be returned. * * For partcmd_disable, the cpuset is being transofrmed from a partition * root back to a non-partition root. Any CPUs in cpus_allowed that are in * parent's subparts_cpus will be taken away from that cpumask and put back * into parent's effective_cpus. 0 should always be returned. * * For partcmd_update, if the optional newmask is specified, the cpu * list is to be changed from cpus_allowed to newmask. Otherwise, * cpus_allowed is assumed to remain the same. The cpuset should either * be a partition root or an invalid partition root. The partition root * state may change if newmask is NULL and none of the requested CPUs can * be granted by the parent. The function will return 1 if changes to * parent's subparts_cpus and effective_cpus happen or 0 otherwise. * Error code should only be returned when newmask is non-NULL. * * The partcmd_enable and partcmd_disable commands are used by * update_prstate(). The partcmd_update command is used by * update_cpumasks_hier() with newmask NULL and update_cpumask() with * newmask set. * * The checking is more strict when enabling partition root than the * other two commands. * * Because of the implicit cpu exclusive nature of a partition root, * cpumask changes that violates the cpu exclusivity rule will not be * permitted when checked by validate_change(). The validate_change() * function will also prevent any changes to the cpu list if it is not * a superset of children's cpu lists. */ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, struct cpumask *newmask, struct tmpmasks *tmp) { struct cpuset *parent = parent_cs(cpuset); int adding; /* Moving cpus from effective_cpus to subparts_cpus */ int deleting; /* Moving cpus from subparts_cpus to effective_cpus */ int new_prs; bool part_error = false; /* Partition error? */ percpu_rwsem_assert_held(&cpuset_rwsem); /* * The parent must be a partition root. * The new cpumask, if present, or the current cpus_allowed must * not be empty. */ if (!is_partition_root(parent) || (newmask && cpumask_empty(newmask)) || (!newmask && cpumask_empty(cpuset->cpus_allowed))) return -EINVAL; /* * Enabling/disabling partition root is not allowed if there are * online children. */ if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css)) return -EBUSY; /* * Enabling partition root is not allowed if not all the CPUs * can be granted from parent's effective_cpus or at least one * CPU will be left after that. */ if ((cmd == partcmd_enable) && (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) || cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))) return -EINVAL; /* * A cpumask update cannot make parent's effective_cpus become empty. */ adding = deleting = false; new_prs = cpuset->partition_root_state; if (cmd == partcmd_enable) { cpumask_copy(tmp->addmask, cpuset->cpus_allowed); adding = true; } else if (cmd == partcmd_disable) { deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed, parent->subparts_cpus); } else if (newmask) { /* * partcmd_update with newmask: * * delmask = cpus_allowed & ~newmask & parent->subparts_cpus * addmask = newmask & parent->effective_cpus * & ~parent->subparts_cpus */ cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask); deleting = cpumask_and(tmp->delmask, tmp->delmask, parent->subparts_cpus); cpumask_and(tmp->addmask, newmask, parent->effective_cpus); adding = cpumask_andnot(tmp->addmask, tmp->addmask, parent->subparts_cpus); /* * Return error if the new effective_cpus could become empty. */ if (adding && cpumask_equal(parent->effective_cpus, tmp->addmask)) { if (!deleting) return -EINVAL; /* * As some of the CPUs in subparts_cpus might have * been offlined, we need to compute the real delmask * to confirm that. */ if (!cpumask_and(tmp->addmask, tmp->delmask, cpu_active_mask)) return -EINVAL; cpumask_copy(tmp->addmask, parent->effective_cpus); } } else { /* * partcmd_update w/o newmask: * * addmask = cpus_allowed & parent->effective_cpus * * Note that parent's subparts_cpus may have been * pre-shrunk in case there is a change in the cpu list. * So no deletion is needed. */ adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, parent->effective_cpus); part_error = cpumask_equal(tmp->addmask, parent->effective_cpus); } if (cmd == partcmd_update) { int prev_prs = cpuset->partition_root_state; /* * Check for possible transition between PRS_ENABLED * and PRS_ERROR. */ switch (cpuset->partition_root_state) { case PRS_ENABLED: if (part_error) new_prs = PRS_ERROR; break; case PRS_ERROR: if (!part_error) new_prs = PRS_ENABLED; break; } /* * Set part_error if previously in invalid state. */ part_error = (prev_prs == PRS_ERROR); } if (!part_error && (new_prs == PRS_ERROR)) return 0; /* Nothing need to be done */ if (new_prs == PRS_ERROR) { /* * Remove all its cpus from parent's subparts_cpus. */ adding = false; deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed, parent->subparts_cpus); } if (!adding && !deleting && (new_prs == cpuset->partition_root_state)) return 0; /* * Change the parent's subparts_cpus. * Newly added CPUs will be removed from effective_cpus and * newly deleted ones will be added back to effective_cpus. */ spin_lock_irq(&callback_lock); if (adding) { cpumask_or(parent->subparts_cpus, parent->subparts_cpus, tmp->addmask); cpumask_andnot(parent->effective_cpus, parent->effective_cpus, tmp->addmask); } if (deleting) { cpumask_andnot(parent->subparts_cpus, parent->subparts_cpus, tmp->delmask); /* * Some of the CPUs in subparts_cpus might have been offlined. */ cpumask_and(tmp->delmask, tmp->delmask, cpu_active_mask); cpumask_or(parent->effective_cpus, parent->effective_cpus, tmp->delmask); } parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus); if (cpuset->partition_root_state != new_prs) cpuset->partition_root_state = new_prs; spin_unlock_irq(&callback_lock); return cmd == partcmd_update; } /* * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree * @cs: the cpuset to consider * @tmp: temp variables for calculating effective_cpus & partition setup * * When congifured cpumask is changed, the effective cpumasks of this cpuset * and all its descendants need to be updated. * * On legacy hierachy, effective_cpus will be the same with cpu_allowed. * * Called with cpuset_mutex held */ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) { struct cpuset *cp; struct cgroup_subsys_state *pos_css; bool need_rebuild_sched_domains = false; int new_prs; rcu_read_lock(); cpuset_for_each_descendant_pre(cp, pos_css, cs) { struct cpuset *parent = parent_cs(cp); compute_effective_cpumask(tmp->new_cpus, cp, parent); /* * If it becomes empty, inherit the effective mask of the * parent, which is guaranteed to have some CPUs. */ if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) { cpumask_copy(tmp->new_cpus, parent->effective_cpus); if (!cp->use_parent_ecpus) { cp->use_parent_ecpus = true; parent->child_ecpus_count++; } } else if (cp->use_parent_ecpus) { cp->use_parent_ecpus = false; WARN_ON_ONCE(!parent->child_ecpus_count); parent->child_ecpus_count--; } /* * Skip the whole subtree if the cpumask remains the same * and has no partition root state. */ if (!cp->partition_root_state && cpumask_equal(tmp->new_cpus, cp->effective_cpus)) { pos_css = css_rightmost_descendant(pos_css); continue; } /* * update_parent_subparts_cpumask() should have been called * for cs already in update_cpumask(). We should also call * update_tasks_cpumask() again for tasks in the parent * cpuset if the parent's subparts_cpus changes. */ new_prs = cp->partition_root_state; if ((cp != cs) && new_prs) { switch (parent->partition_root_state) { case PRS_DISABLED: /* * If parent is not a partition root or an * invalid partition root, clear its state * and its CS_CPU_EXCLUSIVE flag. */ WARN_ON_ONCE(cp->partition_root_state != PRS_ERROR); new_prs = PRS_DISABLED; /* * clear_bit() is an atomic operation and * readers aren't interested in the state * of CS_CPU_EXCLUSIVE anyway. So we can * just update the flag without holding * the callback_lock. */ clear_bit(CS_CPU_EXCLUSIVE, &cp->flags); break; case PRS_ENABLED: if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) update_tasks_cpumask(parent); break; case PRS_ERROR: /* * When parent is invalid, it has to be too. */ new_prs = PRS_ERROR; break; } } if (!css_tryget_online(&cp->css)) continue; rcu_read_unlock(); spin_lock_irq(&callback_lock); cpumask_copy(cp->effective_cpus, tmp->new_cpus); if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) { cp->nr_subparts_cpus = 0; cpumask_clear(cp->subparts_cpus); } else if (cp->nr_subparts_cpus) { /* * Make sure that effective_cpus & subparts_cpus * are mutually exclusive. * * In the unlikely event that effective_cpus * becomes empty. we clear cp->nr_subparts_cpus and * let its child partition roots to compete for * CPUs again. */ cpumask_andnot(cp->effective_cpus, cp->effective_cpus, cp->subparts_cpus); if (cpumask_empty(cp->effective_cpus)) { cpumask_copy(cp->effective_cpus, tmp->new_cpus); cpumask_clear(cp->subparts_cpus); cp->nr_subparts_cpus = 0; } else if (!cpumask_subset(cp->subparts_cpus, tmp->new_cpus)) { cpumask_andnot(cp->subparts_cpus, cp->subparts_cpus, tmp->new_cpus); cp->nr_subparts_cpus = cpumask_weight(cp->subparts_cpus); } } if (new_prs != cp->partition_root_state) cp->partition_root_state = new_prs; spin_unlock_irq(&callback_lock); WARN_ON(!is_in_v2_mode() && !cpumask_equal(cp->cpus_allowed, cp->effective_cpus)); update_tasks_cpumask(cp); /* * On legacy hierarchy, if the effective cpumask of any non- * empty cpuset is changed, we need to rebuild sched domains. * On default hierarchy, the cpuset needs to be a partition * root as well. */ if (!cpumask_empty(cp->cpus_allowed) && is_sched_load_balance(cp) && (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) || is_partition_root(cp))) need_rebuild_sched_domains = true; rcu_read_lock(); css_put(&cp->css); } rcu_read_unlock(); if (need_rebuild_sched_domains) rebuild_sched_domains_locked(); } /** * update_sibling_cpumasks - Update siblings cpumasks * @parent: Parent cpuset * @cs: Current cpuset * @tmp: Temp variables */ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs, struct tmpmasks *tmp) { struct cpuset *sibling; struct cgroup_subsys_state *pos_css; /* * Check all its siblings and call update_cpumasks_hier() * if their use_parent_ecpus flag is set in order for them * to use the right effective_cpus value. */ rcu_read_lock(); cpuset_for_each_child(sibling, pos_css, parent) { if (sibling == cs) continue; if (!sibling->use_parent_ecpus) continue; update_cpumasks_hier(sibling, tmp); } rcu_read_unlock(); } /** * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it * @cs: the cpuset to consider * @trialcs: trial cpuset * @buf: buffer of cpu numbers written to this cpuset */ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, const char *buf) { int retval; struct tmpmasks tmp; /* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */ if (cs == &top_cpuset) return -EACCES; /* * An empty cpus_allowed is ok only if the cpuset has no tasks. * Since cpulist_parse() fails on an empty mask, we special case * that parsing. The validate_change() call ensures that cpusets * with tasks have cpus. */ if (!*buf) { cpumask_clear(trialcs->cpus_allowed); } else { retval = cpulist_parse(buf, trialcs->cpus_allowed); if (retval < 0) return retval; if (!cpumask_subset(trialcs->cpus_allowed, top_cpuset.cpus_allowed)) return -EINVAL; } /* Nothing to do if the cpus didn't change */ if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed)) return 0; retval = validate_change(cs, trialcs); if (retval < 0) return retval; #ifdef CONFIG_CPUMASK_OFFSTACK /* * Use the cpumasks in trialcs for tmpmasks when they are pointers * to allocated cpumasks. */ tmp.addmask = trialcs->subparts_cpus; tmp.delmask = trialcs->effective_cpus; tmp.new_cpus = trialcs->cpus_allowed; #endif if (cs->partition_root_state) { /* Cpumask of a partition root cannot be empty */ if (cpumask_empty(trialcs->cpus_allowed)) return -EINVAL; if (update_parent_subparts_cpumask(cs, partcmd_update, trialcs->cpus_allowed, &tmp) < 0) return -EINVAL; } spin_lock_irq(&callback_lock); cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed); /* * Make sure that subparts_cpus is a subset of cpus_allowed. */ if (cs->nr_subparts_cpus) { cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus, cs->cpus_allowed); cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); } spin_unlock_irq(&callback_lock); update_cpumasks_hier(cs, &tmp); if (cs->partition_root_state) { struct cpuset *parent = parent_cs(cs); /* * For partition root, update the cpumasks of sibling * cpusets if they use parent's effective_cpus. */ if (parent->child_ecpus_count) update_sibling_cpumasks(parent, cs, &tmp); } return 0; } /* * Migrate memory region from one set of nodes to another. This is * performed asynchronously as it can be called from process migration path * holding locks involved in process management. All mm migrations are * performed in the queued order and can be waited for by flushing * cpuset_migrate_mm_wq. */ struct cpuset_migrate_mm_work { struct work_struct work; struct mm_struct *mm; nodemask_t from; nodemask_t to; }; static void cpuset_migrate_mm_workfn(struct work_struct *work) { struct cpuset_migrate_mm_work *mwork = container_of(work, struct cpuset_migrate_mm_work, work); /* on a wq worker, no need to worry about %current's mems_allowed */ do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); mmput(mwork->mm); kfree(mwork); } static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from, const nodemask_t *to) { struct cpuset_migrate_mm_work *mwork; mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); if (mwork) { mwork->mm = mm; mwork->from = *from; mwork->to = *to; INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); queue_work(cpuset_migrate_mm_wq, &mwork->work); } else { mmput(mm); } } static void cpuset_post_attach(void) { flush_workqueue(cpuset_migrate_mm_wq); } /* * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy * @tsk: the task to change * @newmems: new nodes that the task will be set * * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed * and rebind an eventual tasks' mempolicy. If the task is allocating in * parallel, it might temporarily see an empty intersection, which results in * a seqlock check and retry before OOM or allocation failure. */ static void cpuset_change_task_nodemask(struct task_struct *tsk, nodemask_t *newmems) { task_lock(tsk); local_irq_disable(); write_seqcount_begin(&tsk->mems_allowed_seq); nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); mpol_rebind_task(tsk, newmems); tsk->mems_allowed = *newmems; write_seqcount_end(&tsk->mems_allowed_seq); local_irq_enable(); task_unlock(tsk); } static void *cpuset_being_rebound; /** * update_tasks_nodemask - Update the nodemasks of tasks in the cpuset. * @cs: the cpuset in which each task's mems_allowed mask needs to be changed * * Iterate through each task of @cs updating its mems_allowed to the * effective cpuset's. As this function is called with cpuset_mutex held, * cpuset membership stays stable. */ static void update_tasks_nodemask(struct cpuset *cs) { static nodemask_t newmems; /* protected by cpuset_mutex */ struct css_task_iter it; struct task_struct *task; cpuset_being_rebound = cs; /* causes mpol_dup() rebind */ guarantee_online_mems(cs, &newmems); /* * The mpol_rebind_mm() call takes mmap_lock, which we couldn't * take while holding tasklist_lock. Forks can happen - the * mpol_dup() cpuset_being_rebound check will catch such forks, * and rebind their vma mempolicies too. Because we still hold * the global cpuset_mutex, we know that no other rebind effort * will be contending for the global variable cpuset_being_rebound. * It's ok if we rebind the same mm twice; mpol_rebind_mm() * is idempotent. Also migrate pages in each mm to new nodes. */ css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { struct mm_struct *mm; bool migrate; cpuset_change_task_nodemask(task, &newmems); mm = get_task_mm(task); if (!mm) continue; migrate = is_memory_migrate(cs); mpol_rebind_mm(mm, &cs->mems_allowed); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); else mmput(mm); } css_task_iter_end(&it); /* * All the tasks' nodemasks have been updated, update * cs->old_mems_allowed. */ cs->old_mems_allowed = newmems; /* We're done rebinding vmas to this cpuset's new mems_allowed. */ cpuset_being_rebound = NULL; } /* * update_nodemasks_hier - Update effective nodemasks and tasks in the subtree * @cs: the cpuset to consider * @new_mems: a temp variable for calculating new effective_mems * * When configured nodemask is changed, the effective nodemasks of this cpuset * and all its descendants need to be updated. * * On legacy hiearchy, effective_mems will be the same with mems_allowed. * * Called with cpuset_mutex held */ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems) { struct cpuset *cp; struct cgroup_subsys_state *pos_css; rcu_read_lock(); cpuset_for_each_descendant_pre(cp, pos_css, cs) { struct cpuset *parent = parent_cs(cp); nodes_and(*new_mems, cp->mems_allowed, parent->effective_mems); /* * If it becomes empty, inherit the effective mask of the * parent, which is guaranteed to have some MEMs. */ if (is_in_v2_mode() && nodes_empty(*new_mems)) *new_mems = parent->effective_mems; /* Skip the whole subtree if the nodemask remains the same. */ if (nodes_equal(*new_mems, cp->effective_mems)) { pos_css = css_rightmost_descendant(pos_css); continue; } if (!css_tryget_online(&cp->css)) continue; rcu_read_unlock(); spin_lock_irq(&callback_lock); cp->effective_mems = *new_mems; spin_unlock_irq(&callback_lock); WARN_ON(!is_in_v2_mode() && !nodes_equal(cp->mems_allowed, cp->effective_mems)); update_tasks_nodemask(cp); rcu_read_lock(); css_put(&cp->css); } rcu_read_unlock(); } /* * Handle user request to change the 'mems' memory placement * of a cpuset. Needs to validate the request, update the * cpusets mems_allowed, and for each task in the cpuset, * update mems_allowed and rebind task's mempolicy and any vma * mempolicies and if the cpuset is marked 'memory_migrate', * migrate the tasks pages to the new memory. * * Call with cpuset_mutex held. May take callback_lock during call. * Will take tasklist_lock, scan tasklist for tasks in cpuset cs, * lock each such tasks mm->mmap_lock, scan its vma's and rebind * their mempolicies to the cpusets new mems_allowed. */ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, const char *buf) { int retval; /* * top_cpuset.mems_allowed tracks node_stats[N_MEMORY]; * it's read-only */ if (cs == &top_cpuset) { retval = -EACCES; goto done; } /* * An empty mems_allowed is ok iff there are no tasks in the cpuset. * Since nodelist_parse() fails on an empty mask, we special case * that parsing. The validate_change() call ensures that cpusets * with tasks have memory. */ if (!*buf) { nodes_clear(trialcs->mems_allowed); } else { retval = nodelist_parse(buf, trialcs->mems_allowed); if (retval < 0) goto done; if (!nodes_subset(trialcs->mems_allowed, top_cpuset.mems_allowed)) { retval = -EINVAL; goto done; } } if (nodes_equal(cs->mems_allowed, trialcs->mems_allowed)) { retval = 0; /* Too easy - nothing to do */ goto done; } retval = validate_change(cs, trialcs); if (retval < 0) goto done; spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; spin_unlock_irq(&callback_lock); /* use trialcs->mems_allowed as a temp variable */ update_nodemasks_hier(cs, &trialcs->mems_allowed); done: return retval; } bool current_cpuset_is_being_rebound(void) { bool ret; rcu_read_lock(); ret = task_cs(current) == cpuset_being_rebound; rcu_read_unlock(); return ret; } static int update_relax_domain_level(struct cpuset *cs, s64 val) { #ifdef CONFIG_SMP if (val < -1 || val >= sched_domain_level_max) return -EINVAL; #endif if (val != cs->relax_domain_level) { cs->relax_domain_level = val; if (!cpumask_empty(cs->cpus_allowed) && is_sched_load_balance(cs)) rebuild_sched_domains_locked(); } return 0; } /** * update_tasks_flags - update the spread flags of tasks in the cpuset. * @cs: the cpuset in which each task's spread flags needs to be changed * * Iterate through each task of @cs updating its spread flags. As this * function is called with cpuset_mutex held, cpuset membership stays * stable. */ static void update_tasks_flags(struct cpuset *cs) { struct css_task_iter it; struct task_struct *task; css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) cpuset_update_task_spread_flag(cs, task); css_task_iter_end(&it); } /* * update_flag - read a 0 or a 1 in a file and update associated flag * bit: the bit to update (see cpuset_flagbits_t) * cs: the cpuset to update * turning_on: whether the flag is being set or cleared * * Call with cpuset_mutex held. */ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on) { struct cpuset *trialcs; int balance_flag_changed; int spread_flag_changed; int err; trialcs = alloc_trial_cpuset(cs); if (!trialcs) return -ENOMEM; if (turning_on) set_bit(bit, &trialcs->flags); else clear_bit(bit, &trialcs->flags); err = validate_change(cs, trialcs); if (err < 0) goto out; balance_flag_changed = (is_sched_load_balance(cs) != is_sched_load_balance(trialcs)); spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs)) || (is_spread_page(cs) != is_spread_page(trialcs))); spin_lock_irq(&callback_lock); cs->flags = trialcs->flags; spin_unlock_irq(&callback_lock); if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed) rebuild_sched_domains_locked(); if (spread_flag_changed) update_tasks_flags(cs); out: free_cpuset(trialcs); return err; } /* * update_prstate - update partititon_root_state * cs: the cpuset to update * new_prs: new partition root state * * Call with cpuset_mutex held. */ static int update_prstate(struct cpuset *cs, int new_prs) { int err, old_prs = cs->partition_root_state; struct cpuset *parent = parent_cs(cs); struct tmpmasks tmpmask; if (old_prs == new_prs) return 0; /* * Cannot force a partial or invalid partition root to a full * partition root. */ if (new_prs && (old_prs == PRS_ERROR)) return -EINVAL; if (alloc_cpumasks(NULL, &tmpmask)) return -ENOMEM; err = -EINVAL; if (!old_prs) { /* * Turning on partition root requires setting the * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed * cannot be NULL. */ if (cpumask_empty(cs->cpus_allowed)) goto out; err = update_flag(CS_CPU_EXCLUSIVE, cs, 1); if (err) goto out; err = update_parent_subparts_cpumask(cs, partcmd_enable, NULL, &tmpmask); if (err) { update_flag(CS_CPU_EXCLUSIVE, cs, 0); goto out; } } else { /* * Turning off partition root will clear the * CS_CPU_EXCLUSIVE bit. */ if (old_prs == PRS_ERROR) { update_flag(CS_CPU_EXCLUSIVE, cs, 0); err = 0; goto out; } err = update_parent_subparts_cpumask(cs, partcmd_disable, NULL, &tmpmask); if (err) goto out; /* Turning off CS_CPU_EXCLUSIVE will not return error */ update_flag(CS_CPU_EXCLUSIVE, cs, 0); } /* * Update cpumask of parent's tasks except when it is the top * cpuset as some system daemons cannot be mapped to other CPUs. */ if (parent != &top_cpuset) update_tasks_cpumask(parent); if (parent->child_ecpus_count) update_sibling_cpumasks(parent, cs, &tmpmask); rebuild_sched_domains_locked(); out: if (!err) { spin_lock_irq(&callback_lock); cs->partition_root_state = new_prs; spin_unlock_irq(&callback_lock); } free_cpumasks(NULL, &tmpmask); return err; } /* * Frequency meter - How fast is some event occurring? * * These routines manage a digitally filtered, constant time based, * event frequency meter. There are four routines: * fmeter_init() - initialize a frequency meter. * fmeter_markevent() - called each time the event happens. * fmeter_getrate() - returns the recent rate of such events. * fmeter_update() - internal routine used to update fmeter. * * A common data structure is passed to each of these routines, * which is used to keep track of the state required to manage the * frequency meter and its digital filter. * * The filter works on the number of events marked per unit time. * The filter is single-pole low-pass recursive (IIR). The time unit * is 1 second. Arithmetic is done using 32-bit integers scaled to * simulate 3 decimal digits of precision (multiplied by 1000). * * With an FM_COEF of 933, and a time base of 1 second, the filter * has a half-life of 10 seconds, meaning that if the events quit * happening, then the rate returned from the fmeter_getrate() * will be cut in half each 10 seconds, until it converges to zero. * * It is not worth doing a real infinitely recursive filter. If more * than FM_MAXTICKS ticks have elapsed since the last filter event, * just compute FM_MAXTICKS ticks worth, by which point the level * will be stable. * * Limit the count of unprocessed events to FM_MAXCNT, so as to avoid * arithmetic overflow in the fmeter_update() routine. * * Given the simple 32 bit integer arithmetic used, this meter works * best for reporting rates between one per millisecond (msec) and * one per 32 (approx) seconds. At constant rates faster than one * per msec it maxes out at values just under 1,000,000. At constant * rates between one per msec, and one per second it will stabilize * to a value N*1000, where N is the rate of events per second. * At constant rates between one per second and one per 32 seconds, * it will be choppy, moving up on the seconds that have an event, * and then decaying until the next event. At rates slower than * about one in 32 seconds, it decays all the way back to zero between * each event. */ #define FM_COEF 933 /* coefficient for half-life of 10 secs */ #define FM_MAXTICKS ((u32)99) /* useless computing more ticks than this */ #define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */ #define FM_SCALE 1000 /* faux fixed point scale */ /* Initialize a frequency meter */ static void fmeter_init(struct fmeter *fmp) { fmp->cnt = 0; fmp->val = 0; fmp->time = 0; spin_lock_init(&fmp->lock); } /* Internal meter update - process cnt events and update value */ static void fmeter_update(struct fmeter *fmp) { time64_t now; u32 ticks; now = ktime_get_seconds(); ticks = now - fmp->time; if (ticks == 0) return; ticks = min(FM_MAXTICKS, ticks); while (ticks-- > 0) fmp->val = (FM_COEF * fmp->val) / FM_SCALE; fmp->time = now; fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE; fmp->cnt = 0; } /* Process any previous ticks, then bump cnt by one (times scale). */ static void fmeter_markevent(struct fmeter *fmp) { spin_lock(&fmp->lock); fmeter_update(fmp); fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE); spin_unlock(&fmp->lock); } /* Process any previous ticks, then return current value. */ static int fmeter_getrate(struct fmeter *fmp) { int val; spin_lock(&fmp->lock); fmeter_update(fmp); val = fmp->val; spin_unlock(&fmp->lock); return val; } static struct cpuset *cpuset_attach_old_cs; /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */ static int cpuset_can_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; struct cpuset *cs; struct task_struct *task; int ret; /* used later by cpuset_attach() */ cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css)); cs = css_cs(css); percpu_down_write(&cpuset_rwsem); /* allow moving tasks into an empty cpuset if on default hierarchy */ ret = -ENOSPC; if (!is_in_v2_mode() && (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))) goto out_unlock; cgroup_taskset_for_each(task, css, tset) { ret = task_can_attach(task, cs->cpus_allowed); if (ret) goto out_unlock; ret = security_task_setscheduler(task); if (ret) goto out_unlock; } /* * Mark attach is in progress. This makes validate_change() fail * changes which zero cpus/mems_allowed. */ cs->attach_in_progress++; ret = 0; out_unlock: percpu_up_write(&cpuset_rwsem); return ret; } static void cpuset_cancel_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; cgroup_taskset_first(tset, &css); percpu_down_write(&cpuset_rwsem); css_cs(css)->attach_in_progress--; percpu_up_write(&cpuset_rwsem); } /* * Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach() * but we can't allocate it dynamically there. Define it global and * allocate from cpuset_init(). */ static cpumask_var_t cpus_attach; static void cpuset_attach(struct cgroup_taskset *tset) { /* static buf protected by cpuset_mutex */ static nodemask_t cpuset_attach_nodemask_to; struct task_struct *task; struct task_struct *leader; struct cgroup_subsys_state *css; struct cpuset *cs; struct cpuset *oldcs = cpuset_attach_old_cs; cgroup_taskset_first(tset, &css); cs = css_cs(css); percpu_down_write(&cpuset_rwsem); /* prepare for attach */ if (cs == &top_cpuset) cpumask_copy(cpus_attach, cpu_possible_mask); else guarantee_online_cpus(cs, cpus_attach); guarantee_online_mems(cs, &cpuset_attach_nodemask_to); cgroup_taskset_for_each(task, css, tset) { /* * can_attach beforehand should guarantee that this doesn't * fail. TODO: have a better way to handle failure here */ WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach)); cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to); cpuset_update_task_spread_flag(cs, task); } /* * Change mm for all threadgroup leaders. This is expensive and may * sleep and should be moved outside migration path proper. */ cpuset_attach_nodemask_to = cs->effective_mems; cgroup_taskset_for_each_leader(leader, css, tset) { struct mm_struct *mm = get_task_mm(leader); if (mm) { mpol_rebind_mm(mm, &cpuset_attach_nodemask_to); /* * old_mems_allowed is the same with mems_allowed * here, except if this task is being moved * automatically due to hotplug. In that case * @mems_allowed has been updated and is empty, so * @old_mems_allowed is the right nodesets that we * migrate mm from. */ if (is_memory_migrate(cs)) cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, &cpuset_attach_nodemask_to); else mmput(mm); } } cs->old_mems_allowed = cpuset_attach_nodemask_to; cs->attach_in_progress--; if (!cs->attach_in_progress) wake_up(&cpuset_attach_wq); percpu_up_write(&cpuset_rwsem); } /* The various types of files and directories in a cpuset file system */ typedef enum { FILE_MEMORY_MIGRATE, FILE_CPULIST, FILE_MEMLIST, FILE_EFFECTIVE_CPULIST, FILE_EFFECTIVE_MEMLIST, FILE_SUBPARTS_CPULIST, FILE_CPU_EXCLUSIVE, FILE_MEM_EXCLUSIVE, FILE_MEM_HARDWALL, FILE_SCHED_LOAD_BALANCE, FILE_PARTITION_ROOT, FILE_SCHED_RELAX_DOMAIN_LEVEL, FILE_MEMORY_PRESSURE_ENABLED, FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, } cpuset_filetype_t; static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) { struct cpuset *cs = css_cs(css); cpuset_filetype_t type = cft->private; int retval = 0; get_online_cpus(); percpu_down_write(&cpuset_rwsem); if (!is_cpuset_online(cs)) { retval = -ENODEV; goto out_unlock; } switch (type) { case FILE_CPU_EXCLUSIVE: retval = update_flag(CS_CPU_EXCLUSIVE, cs, val); break; case FILE_MEM_EXCLUSIVE: retval = update_flag(CS_MEM_EXCLUSIVE, cs, val); break; case FILE_MEM_HARDWALL: retval = update_flag(CS_MEM_HARDWALL, cs, val); break; case FILE_SCHED_LOAD_BALANCE: retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val); break; case FILE_MEMORY_MIGRATE: retval = update_flag(CS_MEMORY_MIGRATE, cs, val); break; case FILE_MEMORY_PRESSURE_ENABLED: cpuset_memory_pressure_enabled = !!val; break; case FILE_SPREAD_PAGE: retval = update_flag(CS_SPREAD_PAGE, cs, val); break; case FILE_SPREAD_SLAB: retval = update_flag(CS_SPREAD_SLAB, cs, val); break; default: retval = -EINVAL; break; } out_unlock: percpu_up_write(&cpuset_rwsem); put_online_cpus(); return retval; } static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, s64 val) { struct cpuset *cs = css_cs(css); cpuset_filetype_t type = cft->private; int retval = -ENODEV; get_online_cpus(); percpu_down_write(&cpuset_rwsem); if (!is_cpuset_online(cs)) goto out_unlock; switch (type) { case FILE_SCHED_RELAX_DOMAIN_LEVEL: retval = update_relax_domain_level(cs, val); break; default: retval = -EINVAL; break; } out_unlock: percpu_up_write(&cpuset_rwsem); put_online_cpus(); return retval; } /* * Common handling for a write to a "cpus" or "mems" file. */ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct cpuset *cs = css_cs(of_css(of)); struct cpuset *trialcs; int retval = -ENODEV; buf = strstrip(buf); /* * CPU or memory hotunplug may leave @cs w/o any execution * resources, in which case the hotplug code asynchronously updates * configuration and transfers all tasks to the nearest ancestor * which can execute. * * As writes to "cpus" or "mems" may restore @cs's execution * resources, wait for the previously scheduled operations before * proceeding, so that we don't end up keep removing tasks added * after execution capability is restored. * * cpuset_hotplug_work calls back into cgroup core via * cgroup_transfer_tasks() and waiting for it from a cgroupfs * operation like this one can lead to a deadlock through kernfs * active_ref protection. Let's break the protection. Losing the * protection is okay as we check whether @cs is online after * grabbing cpuset_mutex anyway. This only happens on the legacy * hierarchies. */ css_get(&cs->css); kernfs_break_active_protection(of->kn); flush_work(&cpuset_hotplug_work); get_online_cpus(); percpu_down_write(&cpuset_rwsem); if (!is_cpuset_online(cs)) goto out_unlock; trialcs = alloc_trial_cpuset(cs); if (!trialcs) { retval = -ENOMEM; goto out_unlock; } switch (of_cft(of)->private) { case FILE_CPULIST: retval = update_cpumask(cs, trialcs, buf); break; case FILE_MEMLIST: retval = update_nodemask(cs, trialcs, buf); break; default: retval = -EINVAL; break; } free_cpuset(trialcs); out_unlock: percpu_up_write(&cpuset_rwsem); put_online_cpus(); kernfs_unbreak_active_protection(of->kn); css_put(&cs->css); flush_workqueue(cpuset_migrate_mm_wq); return retval ?: nbytes; } /* * These ascii lists should be read in a single call, by using a user * buffer large enough to hold the entire map. If read in smaller * chunks, there is no guarantee of atomicity. Since the display format * used, list of ranges of sequential numbers, is variable length, * and since these maps can change value dynamically, one could read * gibberish by doing partial reads while a list was changing. */ static int cpuset_common_seq_show(struct seq_file *sf, void *v) { struct cpuset *cs = css_cs(seq_css(sf)); cpuset_filetype_t type = seq_cft(sf)->private; int ret = 0; spin_lock_irq(&callback_lock); switch (type) { case FILE_CPULIST: seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->cpus_allowed)); break; case FILE_MEMLIST: seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->mems_allowed)); break; case FILE_EFFECTIVE_CPULIST: seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->effective_cpus)); break; case FILE_EFFECTIVE_MEMLIST: seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems)); break; case FILE_SUBPARTS_CPULIST: seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus)); break; default: ret = -EINVAL; } spin_unlock_irq(&callback_lock); return ret; } static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { struct cpuset *cs = css_cs(css); cpuset_filetype_t type = cft->private; switch (type) { case FILE_CPU_EXCLUSIVE: return is_cpu_exclusive(cs); case FILE_MEM_EXCLUSIVE: return is_mem_exclusive(cs); case FILE_MEM_HARDWALL: return is_mem_hardwall(cs); case FILE_SCHED_LOAD_BALANCE: return is_sched_load_balance(cs); case FILE_MEMORY_MIGRATE: return is_memory_migrate(cs); case FILE_MEMORY_PRESSURE_ENABLED: return cpuset_memory_pressure_enabled; case FILE_MEMORY_PRESSURE: return fmeter_getrate(&cs->fmeter); case FILE_SPREAD_PAGE: return is_spread_page(cs); case FILE_SPREAD_SLAB: return is_spread_slab(cs); default: BUG(); } /* Unreachable but makes gcc happy */ return 0; } static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { struct cpuset *cs = css_cs(css); cpuset_filetype_t type = cft->private; switch (type) { case FILE_SCHED_RELAX_DOMAIN_LEVEL: return cs->relax_domain_level; default: BUG(); } /* Unrechable but makes gcc happy */ return 0; } static int sched_partition_show(struct seq_file *seq, void *v) { struct cpuset *cs = css_cs(seq_css(seq)); switch (cs->partition_root_state) { case PRS_ENABLED: seq_puts(seq, "root\n"); break; case PRS_DISABLED: seq_puts(seq, "member\n"); break; case PRS_ERROR: seq_puts(seq, "root invalid\n"); break; } return 0; } static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct cpuset *cs = css_cs(of_css(of)); int val; int retval = -ENODEV; buf = strstrip(buf); /* * Convert "root" to ENABLED, and convert "member" to DISABLED. */ if (!strcmp(buf, "root")) val = PRS_ENABLED; else if (!strcmp(buf, "member")) val = PRS_DISABLED; else return -EINVAL; css_get(&cs->css); get_online_cpus(); percpu_down_write(&cpuset_rwsem); if (!is_cpuset_online(cs)) goto out_unlock; retval = update_prstate(cs, val); out_unlock: percpu_up_write(&cpuset_rwsem); put_online_cpus(); css_put(&cs->css); return retval ?: nbytes; } /* * for the common functions, 'private' gives the type of file */ static struct cftype legacy_files[] = { { .name = "cpus", .seq_show = cpuset_common_seq_show, .write = cpuset_write_resmask, .max_write_len = (100U + 6 * NR_CPUS), .private = FILE_CPULIST, }, { .name = "mems", .seq_show = cpuset_common_seq_show, .write = cpuset_write_resmask, .max_write_len = (100U + 6 * MAX_NUMNODES), .private = FILE_MEMLIST, }, { .name = "effective_cpus", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_CPULIST, }, { .name = "effective_mems", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_MEMLIST, }, { .name = "cpu_exclusive", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_CPU_EXCLUSIVE, }, { .name = "mem_exclusive", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_MEM_EXCLUSIVE, }, { .name = "mem_hardwall", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_MEM_HARDWALL, }, { .name = "sched_load_balance", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_SCHED_LOAD_BALANCE, }, { .name = "sched_relax_domain_level", .read_s64 = cpuset_read_s64, .write_s64 = cpuset_write_s64, .private = FILE_SCHED_RELAX_DOMAIN_LEVEL, }, { .name = "memory_migrate", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_MEMORY_MIGRATE, }, { .name = "memory_pressure", .read_u64 = cpuset_read_u64, .private = FILE_MEMORY_PRESSURE, }, { .name = "memory_spread_page", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_SPREAD_PAGE, }, { .name = "memory_spread_slab", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_SPREAD_SLAB, }, { .name = "memory_pressure_enabled", .flags = CFTYPE_ONLY_ON_ROOT, .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, .private = FILE_MEMORY_PRESSURE_ENABLED, }, { } /* terminate */ }; /* * This is currently a minimal set for the default hierarchy. It can be * expanded later on by migrating more features and control files from v1. */ static struct cftype dfl_files[] = { { .name = "cpus", .seq_show = cpuset_common_seq_show, .write = cpuset_write_resmask, .max_write_len = (100U + 6 * NR_CPUS), .private = FILE_CPULIST, .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "mems", .seq_show = cpuset_common_seq_show, .write = cpuset_write_resmask, .max_write_len = (100U + 6 * MAX_NUMNODES), .private = FILE_MEMLIST, .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "cpus.effective", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_CPULIST, }, { .name = "mems.effective", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_MEMLIST, }, { .name = "cpus.partition", .seq_show = sched_partition_show, .write = sched_partition_write, .private = FILE_PARTITION_ROOT, .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "cpus.subpartitions", .seq_show = cpuset_common_seq_show, .private = FILE_SUBPARTS_CPULIST, .flags = CFTYPE_DEBUG, }, { } /* terminate */ }; /* * cpuset_css_alloc - allocate a cpuset css * cgrp: control group that the new cpuset will be part of */ static struct cgroup_subsys_state * cpuset_css_alloc(struct cgroup_subsys_state *parent_css) { struct cpuset *cs; if (!parent_css) return &top_cpuset.css; cs = kzalloc(sizeof(*cs), GFP_KERNEL); if (!cs) return ERR_PTR(-ENOMEM); if (alloc_cpumasks(cs, NULL)) { kfree(cs); return ERR_PTR(-ENOMEM); } set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); nodes_clear(cs->mems_allowed); nodes_clear(cs->effective_mems); fmeter_init(&cs->fmeter); cs->relax_domain_level = -1; return &cs->css; } static int cpuset_css_online(struct cgroup_subsys_state *css) { struct cpuset *cs = css_cs(css); struct cpuset *parent = parent_cs(cs); struct cpuset *tmp_cs; struct cgroup_subsys_state *pos_css; if (!parent) return 0; get_online_cpus(); percpu_down_write(&cpuset_rwsem); set_bit(CS_ONLINE, &cs->flags); if (is_spread_page(parent)) set_bit(CS_SPREAD_PAGE, &cs->flags); if (is_spread_slab(parent)) set_bit(CS_SPREAD_SLAB, &cs->flags); cpuset_inc(); spin_lock_irq(&callback_lock); if (is_in_v2_mode()) { cpumask_copy(cs->effective_cpus, parent->effective_cpus); cs->effective_mems = parent->effective_mems; cs->use_parent_ecpus = true; parent->child_ecpus_count++; } spin_unlock_irq(&callback_lock); if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags)) goto out_unlock; /* * Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is * set. This flag handling is implemented in cgroup core for * histrical reasons - the flag may be specified during mount. * * Currently, if any sibling cpusets have exclusive cpus or mem, we * refuse to clone the configuration - thereby refusing the task to * be entered, and as a result refusing the sys_unshare() or * clone() which initiated it. If this becomes a problem for some * users who wish to allow that scenario, then this could be * changed to grant parent->cpus_allowed-sibling_cpus_exclusive * (and likewise for mems) to the new cgroup. */ rcu_read_lock(); cpuset_for_each_child(tmp_cs, pos_css, parent) { if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) { rcu_read_unlock(); goto out_unlock; } } rcu_read_unlock(); spin_lock_irq(&callback_lock); cs->mems_allowed = parent->mems_allowed; cs->effective_mems = parent->mems_allowed; cpumask_copy(cs->cpus_allowed, parent->cpus_allowed); cpumask_copy(cs->effective_cpus, parent->cpus_allowed); spin_unlock_irq(&callback_lock); out_unlock: percpu_up_write(&cpuset_rwsem); put_online_cpus(); return 0; } /* * If the cpuset being removed has its flag 'sched_load_balance' * enabled, then simulate turning sched_load_balance off, which * will call rebuild_sched_domains_locked(). That is not needed * in the default hierarchy where only changes in partition * will cause repartitioning. * * If the cpuset has the 'sched.partition' flag enabled, simulate * turning 'sched.partition" off. */ static void cpuset_css_offline(struct cgroup_subsys_state *css) { struct cpuset *cs = css_cs(css); get_online_cpus(); percpu_down_write(&cpuset_rwsem); if (is_partition_root(cs)) update_prstate(cs, 0); if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && is_sched_load_balance(cs)) update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); if (cs->use_parent_ecpus) { struct cpuset *parent = parent_cs(cs); cs->use_parent_ecpus = false; parent->child_ecpus_count--; } cpuset_dec(); clear_bit(CS_ONLINE, &cs->flags); percpu_up_write(&cpuset_rwsem); put_online_cpus(); } static void cpuset_css_free(struct cgroup_subsys_state *css) { struct cpuset *cs = css_cs(css); free_cpuset(cs); } static void cpuset_bind(struct cgroup_subsys_state *root_css) { percpu_down_write(&cpuset_rwsem); spin_lock_irq(&callback_lock); if (is_in_v2_mode()) { cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask); top_cpuset.mems_allowed = node_possible_map; } else { cpumask_copy(top_cpuset.cpus_allowed, top_cpuset.effective_cpus); top_cpuset.mems_allowed = top_cpuset.effective_mems; } spin_unlock_irq(&callback_lock); percpu_up_write(&cpuset_rwsem); } /* * Make sure the new task conform to the current state of its parent, * which could have been changed by cpuset just after it inherits the * state from the parent and before it sits on the cgroup's task list. */ static void cpuset_fork(struct task_struct *task) { if (task_css_is_root(task, cpuset_cgrp_id)) return; set_cpus_allowed_ptr(task, current->cpus_ptr); task->mems_allowed = current->mems_allowed; } struct cgroup_subsys cpuset_cgrp_subsys = { .css_alloc = cpuset_css_alloc, .css_online = cpuset_css_online, .css_offline = cpuset_css_offline, .css_free = cpuset_css_free, .can_attach = cpuset_can_attach, .cancel_attach = cpuset_cancel_attach, .attach = cpuset_attach, .post_attach = cpuset_post_attach, .bind = cpuset_bind, .fork = cpuset_fork, .legacy_cftypes = legacy_files, .dfl_cftypes = dfl_files, .early_init = true, .threaded = true, }; /** * cpuset_init - initialize cpusets at system boot * * Description: Initialize top_cpuset **/ int __init cpuset_init(void) { BUG_ON(percpu_init_rwsem(&cpuset_rwsem)); BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL)); cpumask_setall(top_cpuset.cpus_allowed); nodes_setall(top_cpuset.mems_allowed); cpumask_setall(top_cpuset.effective_cpus); nodes_setall(top_cpuset.effective_mems); fmeter_init(&top_cpuset.fmeter); set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags); top_cpuset.relax_domain_level = -1; BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)); return 0; } /* * If CPU and/or memory hotplug handlers, below, unplug any CPUs * or memory nodes, we need to walk over the cpuset hierarchy, * removing that CPU or node from all cpusets. If this removes the * last CPU or node from a cpuset, then move the tasks in the empty * cpuset to its next-highest non-empty parent. */ static void remove_tasks_in_empty_cpuset(struct cpuset *cs) { struct cpuset *parent; /* * Find its next-highest non-empty parent, (top cpuset * has online cpus, so can't be empty). */ parent = parent_cs(cs); while (cpumask_empty(parent->cpus_allowed) || nodes_empty(parent->mems_allowed)) parent = parent_cs(parent); if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) { pr_err("cpuset: failed to transfer tasks out of empty cpuset "); pr_cont_cgroup_name(cs->css.cgroup); pr_cont("\n"); } } static void hotplug_update_tasks_legacy(struct cpuset *cs, struct cpumask *new_cpus, nodemask_t *new_mems, bool cpus_updated, bool mems_updated) { bool is_empty; spin_lock_irq(&callback_lock); cpumask_copy(cs->cpus_allowed, new_cpus); cpumask_copy(cs->effective_cpus, new_cpus); cs->mems_allowed = *new_mems; cs->effective_mems = *new_mems; spin_unlock_irq(&callback_lock); /* * Don't call update_tasks_cpumask() if the cpuset becomes empty, * as the tasks will be migratecd to an ancestor. */ if (cpus_updated && !cpumask_empty(cs->cpus_allowed)) update_tasks_cpumask(cs); if (mems_updated && !nodes_empty(cs->mems_allowed)) update_tasks_nodemask(cs); is_empty = cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed); percpu_up_write(&cpuset_rwsem); /* * Move tasks to the nearest ancestor with execution resources, * This is full cgroup operation which will also call back into * cpuset. Should be done outside any lock. */ if (is_empty) remove_tasks_in_empty_cpuset(cs); percpu_down_write(&cpuset_rwsem); } static void hotplug_update_tasks(struct cpuset *cs, struct cpumask *new_cpus, nodemask_t *new_mems, bool cpus_updated, bool mems_updated) { if (cpumask_empty(new_cpus)) cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus); if (nodes_empty(*new_mems)) *new_mems = parent_cs(cs)->effective_mems; spin_lock_irq(&callback_lock); cpumask_copy(cs->effective_cpus, new_cpus); cs->effective_mems = *new_mems; spin_unlock_irq(&callback_lock); if (cpus_updated) update_tasks_cpumask(cs); if (mems_updated) update_tasks_nodemask(cs); } static bool force_rebuild; void cpuset_force_rebuild(void) { force_rebuild = true; } /** * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug * @cs: cpuset in interest * @tmp: the tmpmasks structure pointer * * Compare @cs's cpu and mem masks against top_cpuset and if some have gone * offline, update @cs accordingly. If @cs ends up with no CPU or memory, * all its tasks are moved to the nearest ancestor with both resources. */ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) { static cpumask_t new_cpus; static nodemask_t new_mems; bool cpus_updated; bool mems_updated; struct cpuset *parent; retry: wait_event(cpuset_attach_wq, cs->attach_in_progress == 0); percpu_down_write(&cpuset_rwsem); /* * We have raced with task attaching. We wait until attaching * is finished, so we won't attach a task to an empty cpuset. */ if (cs->attach_in_progress) { percpu_up_write(&cpuset_rwsem); goto retry; } parent = parent_cs(cs); compute_effective_cpumask(&new_cpus, cs, parent); nodes_and(new_mems, cs->mems_allowed, parent->effective_mems); if (cs->nr_subparts_cpus) /* * Make sure that CPUs allocated to child partitions * do not show up in effective_cpus. */ cpumask_andnot(&new_cpus, &new_cpus, cs->subparts_cpus); if (!tmp || !cs->partition_root_state) goto update_tasks; /* * In the unlikely event that a partition root has empty * effective_cpus or its parent becomes erroneous, we have to * transition it to the erroneous state. */ if (is_partition_root(cs) && (cpumask_empty(&new_cpus) || (parent->partition_root_state == PRS_ERROR))) { if (cs->nr_subparts_cpus) { spin_lock_irq(&callback_lock); cs->nr_subparts_cpus = 0; cpumask_clear(cs->subparts_cpus); spin_unlock_irq(&callback_lock); compute_effective_cpumask(&new_cpus, cs, parent); } /* * If the effective_cpus is empty because the child * partitions take away all the CPUs, we can keep * the current partition and let the child partitions * fight for available CPUs. */ if ((parent->partition_root_state == PRS_ERROR) || cpumask_empty(&new_cpus)) { update_parent_subparts_cpumask(cs, partcmd_disable, NULL, tmp); spin_lock_irq(&callback_lock); cs->partition_root_state = PRS_ERROR; spin_unlock_irq(&callback_lock); } cpuset_force_rebuild(); } /* * On the other hand, an erroneous partition root may be transitioned * back to a regular one or a partition root with no CPU allocated * from the parent may change to erroneous. */ if (is_partition_root(parent) && ((cs->partition_root_state == PRS_ERROR) || !cpumask_intersects(&new_cpus, parent->subparts_cpus)) && update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) cpuset_force_rebuild(); update_tasks: cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); mems_updated = !nodes_equal(new_mems, cs->effective_mems); if (is_in_v2_mode()) hotplug_update_tasks(cs, &new_cpus, &new_mems, cpus_updated, mems_updated); else hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, cpus_updated, mems_updated); percpu_up_write(&cpuset_rwsem); } /** * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset * * This function is called after either CPU or memory configuration has * changed and updates cpuset accordingly. The top_cpuset is always * synchronized to cpu_active_mask and N_MEMORY, which is necessary in * order to make cpusets transparent (of no affect) on systems that are * actively using CPU hotplug but making no active use of cpusets. * * Non-root cpusets are only affected by offlining. If any CPUs or memory * nodes have been taken down, cpuset_hotplug_update_tasks() is invoked on * all descendants. * * Note that CPU offlining during suspend is ignored. We don't modify * cpusets across suspend/resume cycles at all. */ static void cpuset_hotplug_workfn(struct work_struct *work) { static cpumask_t new_cpus; static nodemask_t new_mems; bool cpus_updated, mems_updated; bool on_dfl = is_in_v2_mode(); struct tmpmasks tmp, *ptmp = NULL; if (on_dfl && !alloc_cpumasks(NULL, &tmp)) ptmp = &tmp; percpu_down_write(&cpuset_rwsem); /* fetch the available cpus/mems and find out which changed how */ cpumask_copy(&new_cpus, cpu_active_mask); new_mems = node_states[N_MEMORY]; /* * If subparts_cpus is populated, it is likely that the check below * will produce a false positive on cpus_updated when the cpu list * isn't changed. It is extra work, but it is better to be safe. */ cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus); mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems); /* * In the rare case that hotplug removes all the cpus in subparts_cpus, * we assumed that cpus are updated. */ if (!cpus_updated && top_cpuset.nr_subparts_cpus) cpus_updated = true; /* synchronize cpus_allowed to cpu_active_mask */ if (cpus_updated) { spin_lock_irq(&callback_lock); if (!on_dfl) cpumask_copy(top_cpuset.cpus_allowed, &new_cpus); /* * Make sure that CPUs allocated to child partitions * do not show up in effective_cpus. If no CPU is left, * we clear the subparts_cpus & let the child partitions * fight for the CPUs again. */ if (top_cpuset.nr_subparts_cpus) { if (cpumask_subset(&new_cpus, top_cpuset.subparts_cpus)) { top_cpuset.nr_subparts_cpus = 0; cpumask_clear(top_cpuset.subparts_cpus); } else { cpumask_andnot(&new_cpus, &new_cpus, top_cpuset.subparts_cpus); } } cpumask_copy(top_cpuset.effective_cpus, &new_cpus); spin_unlock_irq(&callback_lock); /* we don't mess with cpumasks of tasks in top_cpuset */ } /* synchronize mems_allowed to N_MEMORY */ if (mems_updated) { spin_lock_irq(&callback_lock); if (!on_dfl) top_cpuset.mems_allowed = new_mems; top_cpuset.effective_mems = new_mems; spin_unlock_irq(&callback_lock); update_tasks_nodemask(&top_cpuset); } percpu_up_write(&cpuset_rwsem); /* if cpus or mems changed, we need to propagate to descendants */ if (cpus_updated || mems_updated) { struct cpuset *cs; struct cgroup_subsys_state *pos_css; rcu_read_lock(); cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { if (cs == &top_cpuset || !css_tryget_online(&cs->css)) continue; rcu_read_unlock(); cpuset_hotplug_update_tasks(cs, ptmp); rcu_read_lock(); css_put(&cs->css); } rcu_read_unlock(); } /* rebuild sched domains if cpus_allowed has changed */ if (cpus_updated || force_rebuild) { force_rebuild = false; rebuild_sched_domains(); } free_cpumasks(NULL, ptmp); } void cpuset_update_active_cpus(void) { /* * We're inside cpu hotplug critical region which usually nests * inside cgroup synchronization. Bounce actual hotplug processing * to a work item to avoid reverse locking order. */ schedule_work(&cpuset_hotplug_work); } void cpuset_wait_for_hotplug(void) { flush_work(&cpuset_hotplug_work); } /* * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY]. * Call this routine anytime after node_states[N_MEMORY] changes. * See cpuset_update_active_cpus() for CPU hotplug handling. */ static int cpuset_track_online_nodes(struct notifier_block *self, unsigned long action, void *arg) { schedule_work(&cpuset_hotplug_work); return NOTIFY_OK; } static struct notifier_block cpuset_track_online_nodes_nb = { .notifier_call = cpuset_track_online_nodes, .priority = 10, /* ??! */ }; /** * cpuset_init_smp - initialize cpus_allowed * * Description: Finish top cpuset after cpu, node maps are initialized */ void __init cpuset_init_smp(void) { cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask); top_cpuset.mems_allowed = node_states[N_MEMORY]; top_cpuset.old_mems_allowed = top_cpuset.mems_allowed; cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask); top_cpuset.effective_mems = node_states[N_MEMORY]; register_hotmemory_notifier(&cpuset_track_online_nodes_nb); cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); BUG_ON(!cpuset_migrate_mm_wq); } /** * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset. * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. * * Description: Returns the cpumask_var_t cpus_allowed of the cpuset * attached to the specified @tsk. Guaranteed to return some non-empty * subset of cpu_online_mask, even if this means going outside the * tasks cpuset. **/ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) { unsigned long flags; spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); guarantee_online_cpus(task_cs(tsk), pmask); rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); } /** * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe. * @tsk: pointer to task_struct with which the scheduler is struggling * * Description: In the case that the scheduler cannot find an allowed cpu in * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy * mode however, this value is the same as task_cs(tsk)->effective_cpus, * which will not contain a sane cpumask during cases such as cpu hotplugging. * This is the absolute last resort for the scheduler and it is only used if * _every_ other avenue has been traveled. **/ void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); do_set_cpus_allowed(tsk, is_in_v2_mode() ? task_cs(tsk)->cpus_allowed : cpu_possible_mask); rcu_read_unlock(); /* * We own tsk->cpus_allowed, nobody can change it under us. * * But we used cs && cs->cpus_allowed lockless and thus can * race with cgroup_attach_task() or update_cpumask() and get * the wrong tsk->cpus_allowed. However, both cases imply the * subsequent cpuset_change_cpumask()->set_cpus_allowed_ptr() * which takes task_rq_lock(). * * If we are called after it dropped the lock we must see all * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary * set any mask even if it is not right from task_cs() pov, * the pending set_cpus_allowed_ptr() will fix things. * * select_fallback_rq() will fix things ups and set cpu_possible_mask * if required. */ } void __init cpuset_init_current_mems_allowed(void) { nodes_setall(current->mems_allowed); } /** * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset. * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed. * * Description: Returns the nodemask_t mems_allowed of the cpuset * attached to the specified @tsk. Guaranteed to return some non-empty * subset of node_states[N_MEMORY], even if this means going outside the * tasks cpuset. **/ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) { nodemask_t mask; unsigned long flags; spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); guarantee_online_mems(task_cs(tsk), &mask); rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); return mask; } /** * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed * @nodemask: the nodemask to be checked * * Are any of the nodes in the nodemask allowed in current->mems_allowed? */ int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) { return nodes_intersects(*nodemask, current->mems_allowed); } /* * nearest_hardwall_ancestor() - Returns the nearest mem_exclusive or * mem_hardwall ancestor to the specified cpuset. Call holding * callback_lock. If no ancestor is mem_exclusive or mem_hardwall * (an unusual configuration), then returns the root cpuset. */ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs) { while (!(is_mem_exclusive(cs) || is_mem_hardwall(cs)) && parent_cs(cs)) cs = parent_cs(cs); return cs; } /** * cpuset_node_allowed - Can we allocate on a memory node? * @node: is this an allowed node? * @gfp_mask: memory allocation flags * * If we're in interrupt, yes, we can always allocate. If @node is set in * current's mems_allowed, yes. If it's not a __GFP_HARDWALL request and this * node is set in the nearest hardwalled cpuset ancestor to current's cpuset, * yes. If current has access to memory reserves as an oom victim, yes. * Otherwise, no. * * GFP_USER allocations are marked with the __GFP_HARDWALL bit, * and do not allow allocations outside the current tasks cpuset * unless the task has been OOM killed. * GFP_KERNEL allocations are not so marked, so can escape to the * nearest enclosing hardwalled ancestor cpuset. * * Scanning up parent cpusets requires callback_lock. The * __alloc_pages() routine only calls here with __GFP_HARDWALL bit * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the * current tasks mems_allowed came up empty on the first pass over * the zonelist. So only GFP_KERNEL allocations, if all nodes in the * cpuset are short of memory, might require taking the callback_lock. * * The first call here from mm/page_alloc:get_page_from_freelist() * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets, * so no allocation on a node outside the cpuset is allowed (unless * in interrupt, of course). * * The second pass through get_page_from_freelist() doesn't even call * here for GFP_ATOMIC calls. For those calls, the __alloc_pages() * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set * in alloc_flags. That logic and the checks below have the combined * affect that: * in_interrupt - any node ok (current task context irrelevant) * GFP_ATOMIC - any node ok * tsk_is_oom_victim - any node ok * GFP_KERNEL - any node in enclosing hardwalled cpuset ok * GFP_USER - only nodes in current tasks mems allowed ok. */ bool __cpuset_node_allowed(int node, gfp_t gfp_mask) { struct cpuset *cs; /* current cpuset ancestors */ int allowed; /* is allocation in zone z allowed? */ unsigned long flags; if (in_interrupt()) return true; if (node_isset(node, current->mems_allowed)) return true; /* * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. */ if (unlikely(tsk_is_oom_victim(current))) return true; if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ return false; if (current->flags & PF_EXITING) /* Let dying task have memory */ return true; /* Not hardwall and node outside mems_allowed: scan up cpusets */ spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); cs = nearest_hardwall_ancestor(task_cs(current)); allowed = node_isset(node, cs->mems_allowed); rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); return allowed; } /** * cpuset_mem_spread_node() - On which node to begin search for a file page * cpuset_slab_spread_node() - On which node to begin search for a slab page * * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for * tasks in a cpuset with is_spread_page or is_spread_slab set), * and if the memory allocation used cpuset_mem_spread_node() * to determine on which node to start looking, as it will for * certain page cache or slab cache pages such as used for file * system buffers and inode caches, then instead of starting on the * local node to look for a free page, rather spread the starting * node around the tasks mems_allowed nodes. * * We don't have to worry about the returned node being offline * because "it can't happen", and even if it did, it would be ok. * * The routines calling guarantee_online_mems() are careful to * only set nodes in task->mems_allowed that are online. So it * should not be possible for the following code to return an * offline node. But if it did, that would be ok, as this routine * is not returning the node where the allocation must be, only * the node where the search should start. The zonelist passed to * __alloc_pages() will include all nodes. If the slab allocator * is passed an offline node, it will fall back to the local node. * See kmem_cache_alloc_node(). */ static int cpuset_spread_node(int *rotor) { return *rotor = next_node_in(*rotor, current->mems_allowed); } int cpuset_mem_spread_node(void) { if (current->cpuset_mem_spread_rotor == NUMA_NO_NODE) current->cpuset_mem_spread_rotor = node_random(&current->mems_allowed); return cpuset_spread_node(&current->cpuset_mem_spread_rotor); } int cpuset_slab_spread_node(void) { if (current->cpuset_slab_spread_rotor == NUMA_NO_NODE) current->cpuset_slab_spread_rotor = node_random(&current->mems_allowed); return cpuset_spread_node(&current->cpuset_slab_spread_rotor); } EXPORT_SYMBOL_GPL(cpuset_mem_spread_node); /** * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's? * @tsk1: pointer to task_struct of some task. * @tsk2: pointer to task_struct of some other task. * * Description: Return true if @tsk1's mems_allowed intersects the * mems_allowed of @tsk2. Used by the OOM killer to determine if * one of the task's memory usage might impact the memory available * to the other. **/ int cpuset_mems_allowed_intersects(const struct task_struct *tsk1, const struct task_struct *tsk2) { return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed); } /** * cpuset_print_current_mems_allowed - prints current's cpuset and mems_allowed * * Description: Prints current's name, cpuset name, and cached copy of its * mems_allowed to the kernel log. */ void cpuset_print_current_mems_allowed(void) { struct cgroup *cgrp; rcu_read_lock(); cgrp = task_cs(current)->css.cgroup; pr_cont(",cpuset="); pr_cont_cgroup_name(cgrp); pr_cont(",mems_allowed=%*pbl", nodemask_pr_args(&current->mems_allowed)); rcu_read_unlock(); } /* * Collection of memory_pressure is suppressed unless * this flag is enabled by writing "1" to the special * cpuset file 'memory_pressure_enabled' in the root cpuset. */ int cpuset_memory_pressure_enabled __read_mostly; /** * cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims. * * Keep a running average of the rate of synchronous (direct) * page reclaim efforts initiated by tasks in each cpuset. * * This represents the rate at which some task in the cpuset * ran low on memory on all nodes it was allowed to use, and * had to enter the kernels page reclaim code in an effort to * create more free memory by tossing clean pages or swapping * or writing dirty pages. * * Display to user space in the per-cpuset read-only file * "memory_pressure". Value displayed is an integer * representing the recent rate of entry into the synchronous * (direct) page reclaim by any task attached to the cpuset. **/ void __cpuset_memory_pressure_bump(void) { rcu_read_lock(); fmeter_markevent(&task_cs(current)->fmeter); rcu_read_unlock(); } #ifdef CONFIG_PROC_PID_CPUSET /* * proc_cpuset_show() * - Print tasks cpuset path into seq_file. * - Used for /proc/<pid>/cpuset. * - No need to task_lock(tsk) on this tsk->cpuset reference, as it * doesn't really matter if tsk->cpuset changes after we read it, * and we take cpuset_mutex, keeping cpuset_attach() from changing it * anyway. */ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk) { char *buf; struct cgroup_subsys_state *css; int retval; retval = -ENOMEM; buf = kmalloc(PATH_MAX, GFP_KERNEL); if (!buf) goto out; css = task_get_css(tsk, cpuset_cgrp_id); retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX, current->nsproxy->cgroup_ns); css_put(css); if (retval >= PATH_MAX) retval = -ENAMETOOLONG; if (retval < 0) goto out_free; seq_puts(m, buf); seq_putc(m, '\n'); retval = 0; out_free: kfree(buf); out: return retval; } #endif /* CONFIG_PROC_PID_CPUSET */ /* Display task mems_allowed in /proc/<pid>/status file. */ void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task) { seq_printf(m, "Mems_allowed:\t%*pb\n", nodemask_pr_args(&task->mems_allowed)); seq_printf(m, "Mems_allowed_list:\t%*pbl\n", nodemask_pr_args(&task->mems_allowed)); }
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 // SPDX-License-Identifier: GPL-2.0-only /* * Implementation of the security services. * * Authors : Stephen Smalley, <sds@tycho.nsa.gov> * James Morris <jmorris@redhat.com> * * Updated: Trusted Computer Solutions, Inc. <dgoeddel@trustedcs.com> * * Support for enhanced MLS infrastructure. * Support for context based audit filters. * * Updated: Frank Mayer <mayerf@tresys.com> and Karl MacMillan <kmacmillan@tresys.com> * * Added conditional policy language extensions * * Updated: Hewlett-Packard <paul@paul-moore.com> * * Added support for NetLabel * Added support for the policy capability bitmap * * Updated: Chad Sellers <csellers@tresys.com> * * Added validation of kernel classes and permissions * * Updated: KaiGai Kohei <kaigai@ak.jp.nec.com> * * Added support for bounds domain and audit messaged on masked permissions * * Updated: Guido Trentalancia <guido@trentalancia.com> * * Added support for runtime switching of the policy type * * Copyright (C) 2008, 2009 NEC Corporation * Copyright (C) 2006, 2007 Hewlett-Packard Development Company, L.P. * Copyright (C) 2004-2006 Trusted Computer Solutions, Inc. * Copyright (C) 2003 - 2004, 2006 Tresys Technology, LLC * Copyright (C) 2003 Red Hat, Inc., James Morris <jmorris@redhat.com> */ #include <linux/kernel.h> #include <linux/slab.h> #include <linux/string.h> #include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/errno.h> #include <linux/in.h> #include <linux/sched.h> #include <linux/audit.h> #include <linux/vmalloc.h> #include <net/netlabel.h> #include "flask.h" #include "avc.h" #include "avc_ss.h" #include "security.h" #include "context.h" #include "policydb.h" #include "sidtab.h" #include "services.h" #include "conditional.h" #include "mls.h" #include "objsec.h" #include "netlabel.h" #include "xfrm.h" #include "ebitmap.h" #include "audit.h" #include "policycap_names.h" struct convert_context_args { struct selinux_state *state; struct policydb *oldp; struct policydb *newp; }; struct selinux_policy_convert_data { struct convert_context_args args; struct sidtab_convert_params sidtab_params; }; /* Forward declaration. */ static int context_struct_to_string(struct policydb *policydb, struct context *context, char **scontext, u32 *scontext_len); static int sidtab_entry_to_string(struct policydb *policydb, struct sidtab *sidtab, struct sidtab_entry *entry, char **scontext, u32 *scontext_len); static void context_struct_compute_av(struct policydb *policydb, struct context *scontext, struct context *tcontext, u16 tclass, struct av_decision *avd, struct extended_perms *xperms); static int selinux_set_mapping(struct policydb *pol, struct security_class_mapping *map, struct selinux_map *out_map) { u16 i, j; unsigned k; bool print_unknown_handle = false; /* Find number of classes in the input mapping */ if (!map) return -EINVAL; i = 0; while (map[i].name) i++; /* Allocate space for the class records, plus one for class zero */ out_map->mapping = kcalloc(++i, sizeof(*out_map->mapping), GFP_ATOMIC); if (!out_map->mapping) return -ENOMEM; /* Store the raw class and permission values */ j = 0; while (map[j].name) { struct security_class_mapping *p_in = map + (j++); struct selinux_mapping *p_out = out_map->mapping + j; /* An empty class string skips ahead */ if (!strcmp(p_in->name, "")) { p_out->num_perms = 0; continue; } p_out->value = string_to_security_class(pol, p_in->name); if (!p_out->value) { pr_info("SELinux: Class %s not defined in policy.\n", p_in->name); if (pol->reject_unknown) goto err; p_out->num_perms = 0; print_unknown_handle = true; continue; } k = 0; while (p_in->perms[k]) { /* An empty permission string skips ahead */ if (!*p_in->perms[k]) { k++; continue; } p_out->perms[k] = string_to_av_perm(pol, p_out->value, p_in->perms[k]); if (!p_out->perms[k]) { pr_info("SELinux: Permission %s in class %s not defined in policy.\n", p_in->perms[k], p_in->name); if (pol->reject_unknown) goto err; print_unknown_handle = true; } k++; } p_out->num_perms = k; } if (print_unknown_handle) pr_info("SELinux: the above unknown classes and permissions will be %s\n", pol->allow_unknown ? "allowed" : "denied"); out_map->size = i; return 0; err: kfree(out_map->mapping); out_map->mapping = NULL; return -EINVAL; } /* * Get real, policy values from mapped values */ static u16 unmap_class(struct selinux_map *map, u16 tclass) { if (tclass < map->size) return map->mapping[tclass].value; return tclass; } /* * Get kernel value for class from its policy value */ static u16 map_class(struct selinux_map *map, u16 pol_value) { u16 i; for (i = 1; i < map->size; i++) { if (map->mapping[i].value == pol_value) return i; } return SECCLASS_NULL; } static void map_decision(struct selinux_map *map, u16 tclass, struct av_decision *avd, int allow_unknown) { if (tclass < map->size) { struct selinux_mapping *mapping = &map->mapping[tclass]; unsigned int i, n = mapping->num_perms; u32 result; for (i = 0, result = 0; i < n; i++) { if (avd->allowed & mapping->perms[i]) result |= 1<<i; if (allow_unknown && !mapping->perms[i]) result |= 1<<i; } avd->allowed = result; for (i = 0, result = 0; i < n; i++) if (avd->auditallow & mapping->perms[i]) result |= 1<<i; avd->auditallow = result; for (i = 0, result = 0; i < n; i++) { if (avd->auditdeny & mapping->perms[i]) result |= 1<<i; if (!allow_unknown && !mapping->perms[i]) result |= 1<<i; } /* * In case the kernel has a bug and requests a permission * between num_perms and the maximum permission number, we * should audit that denial */ for (; i < (sizeof(u32)*8); i++) result |= 1<<i; avd->auditdeny = result; } } int security_mls_enabled(struct selinux_state *state) { int mls_enabled; struct selinux_policy *policy; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); mls_enabled = policy->policydb.mls_enabled; rcu_read_unlock(); return mls_enabled; } /* * Return the boolean value of a constraint expression * when it is applied to the specified source and target * security contexts. * * xcontext is a special beast... It is used by the validatetrans rules * only. For these rules, scontext is the context before the transition, * tcontext is the context after the transition, and xcontext is the context * of the process performing the transition. All other callers of * constraint_expr_eval should pass in NULL for xcontext. */ static int constraint_expr_eval(struct policydb *policydb, struct context *scontext, struct context *tcontext, struct context *xcontext, struct constraint_expr *cexpr) { u32 val1, val2; struct context *c; struct role_datum *r1, *r2; struct mls_level *l1, *l2; struct constraint_expr *e; int s[CEXPR_MAXDEPTH]; int sp = -1; for (e = cexpr; e; e = e->next) { switch (e->expr_type) { case CEXPR_NOT: BUG_ON(sp < 0); s[sp] = !s[sp]; break; case CEXPR_AND: BUG_ON(sp < 1); sp--; s[sp] &= s[sp + 1]; break; case CEXPR_OR: BUG_ON(sp < 1); sp--; s[sp] |= s[sp + 1]; break; case CEXPR_ATTR: if (sp == (CEXPR_MAXDEPTH - 1)) return 0; switch (e->attr) { case CEXPR_USER: val1 = scontext->user; val2 = tcontext->user; break; case CEXPR_TYPE: val1 = scontext->type; val2 = tcontext->type; break; case CEXPR_ROLE: val1 = scontext->role; val2 = tcontext->role; r1 = policydb->role_val_to_struct[val1 - 1]; r2 = policydb->role_val_to_struct[val2 - 1]; switch (e->op) { case CEXPR_DOM: s[++sp] = ebitmap_get_bit(&r1->dominates, val2 - 1); continue; case CEXPR_DOMBY: s[++sp] = ebitmap_get_bit(&r2->dominates, val1 - 1); continue; case CEXPR_INCOMP: s[++sp] = (!ebitmap_get_bit(&r1->dominates, val2 - 1) && !ebitmap_get_bit(&r2->dominates, val1 - 1)); continue; default: break; } break; case CEXPR_L1L2: l1 = &(scontext->range.level[0]); l2 = &(tcontext->range.level[0]); goto mls_ops; case CEXPR_L1H2: l1 = &(scontext->range.level[0]); l2 = &(tcontext->range.level[1]); goto mls_ops; case CEXPR_H1L2: l1 = &(scontext->range.level[1]); l2 = &(tcontext->range.level[0]); goto mls_ops; case CEXPR_H1H2: l1 = &(scontext->range.level[1]); l2 = &(tcontext->range.level[1]); goto mls_ops; case CEXPR_L1H1: l1 = &(scontext->range.level[0]); l2 = &(scontext->range.level[1]); goto mls_ops; case CEXPR_L2H2: l1 = &(tcontext->range.level[0]); l2 = &(tcontext->range.level[1]); goto mls_ops; mls_ops: switch (e->op) { case CEXPR_EQ: s[++sp] = mls_level_eq(l1, l2); continue; case CEXPR_NEQ: s[++sp] = !mls_level_eq(l1, l2); continue; case CEXPR_DOM: s[++sp] = mls_level_dom(l1, l2); continue; case CEXPR_DOMBY: s[++sp] = mls_level_dom(l2, l1); continue; case CEXPR_INCOMP: s[++sp] = mls_level_incomp(l2, l1); continue; default: BUG(); return 0; } break; default: BUG(); return 0; } switch (e->op) { case CEXPR_EQ: s[++sp] = (val1 == val2); break; case CEXPR_NEQ: s[++sp] = (val1 != val2); break; default: BUG(); return 0; } break; case CEXPR_NAMES: if (sp == (CEXPR_MAXDEPTH-1)) return 0; c = scontext; if (e->attr & CEXPR_TARGET) c = tcontext; else if (e->attr & CEXPR_XTARGET) { c = xcontext; if (!c) { BUG(); return 0; } } if (e->attr & CEXPR_USER) val1 = c->user; else if (e->attr & CEXPR_ROLE) val1 = c->role; else if (e->attr & CEXPR_TYPE) val1 = c->type; else { BUG(); return 0; } switch (e->op) { case CEXPR_EQ: s[++sp] = ebitmap_get_bit(&e->names, val1 - 1); break; case CEXPR_NEQ: s[++sp] = !ebitmap_get_bit(&e->names, val1 - 1); break; default: BUG(); return 0; } break; default: BUG(); return 0; } } BUG_ON(sp != 0); return s[0]; } /* * security_dump_masked_av - dumps masked permissions during * security_compute_av due to RBAC, MLS/Constraint and Type bounds. */ static int dump_masked_av_helper(void *k, void *d, void *args) { struct perm_datum *pdatum = d; char **permission_names = args; BUG_ON(pdatum->value < 1 || pdatum->value > 32); permission_names[pdatum->value - 1] = (char *)k; return 0; } static void security_dump_masked_av(struct policydb *policydb, struct context *scontext, struct context *tcontext, u16 tclass, u32 permissions, const char *reason) { struct common_datum *common_dat; struct class_datum *tclass_dat; struct audit_buffer *ab; char *tclass_name; char *scontext_name = NULL; char *tcontext_name = NULL; char *permission_names[32]; int index; u32 length; bool need_comma = false; if (!permissions) return; tclass_name = sym_name(policydb, SYM_CLASSES, tclass - 1); tclass_dat = policydb->class_val_to_struct[tclass - 1]; common_dat = tclass_dat->comdatum; /* init permission_names */ if (common_dat && hashtab_map(&common_dat->permissions.table, dump_masked_av_helper, permission_names) < 0) goto out; if (hashtab_map(&tclass_dat->permissions.table, dump_masked_av_helper, permission_names) < 0) goto out; /* get scontext/tcontext in text form */ if (context_struct_to_string(policydb, scontext, &scontext_name, &length) < 0) goto out; if (context_struct_to_string(policydb, tcontext, &tcontext_name, &length) < 0) goto out; /* audit a message */ ab = audit_log_start(audit_context(), GFP_ATOMIC, AUDIT_SELINUX_ERR); if (!ab) goto out; audit_log_format(ab, "op=security_compute_av reason=%s " "scontext=%s tcontext=%s tclass=%s perms=", reason, scontext_name, tcontext_name, tclass_name); for (index = 0; index < 32; index++) { u32 mask = (1 << index); if ((mask & permissions) == 0) continue; audit_log_format(ab, "%s%s", need_comma ? "," : "", permission_names[index] ? permission_names[index] : "????"); need_comma = true; } audit_log_end(ab); out: /* release scontext/tcontext */ kfree(tcontext_name); kfree(scontext_name); return; } /* * security_boundary_permission - drops violated permissions * on boundary constraint. */ static void type_attribute_bounds_av(struct policydb *policydb, struct context *scontext, struct context *tcontext, u16 tclass, struct av_decision *avd) { struct context lo_scontext; struct context lo_tcontext, *tcontextp = tcontext; struct av_decision lo_avd; struct type_datum *source; struct type_datum *target; u32 masked = 0; source = policydb->type_val_to_struct[scontext->type - 1]; BUG_ON(!source); if (!source->bounds) return; target = policydb->type_val_to_struct[tcontext->type - 1]; BUG_ON(!target); memset(&lo_avd, 0, sizeof(lo_avd)); memcpy(&lo_scontext, scontext, sizeof(lo_scontext)); lo_scontext.type = source->bounds; if (target->bounds) { memcpy(&lo_tcontext, tcontext, sizeof(lo_tcontext)); lo_tcontext.type = target->bounds; tcontextp = &lo_tcontext; } context_struct_compute_av(policydb, &lo_scontext, tcontextp, tclass, &lo_avd, NULL); masked = ~lo_avd.allowed & avd->allowed; if (likely(!masked)) return; /* no masked permission */ /* mask violated permissions */ avd->allowed &= ~masked; /* audit masked permissions */ security_dump_masked_av(policydb, scontext, tcontext, tclass, masked, "bounds"); } /* * flag which drivers have permissions * only looking for ioctl based extended permssions */ void services_compute_xperms_drivers( struct extended_perms *xperms, struct avtab_node *node) { unsigned int i; if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLDRIVER) { /* if one or more driver has all permissions allowed */ for (i = 0; i < ARRAY_SIZE(xperms->drivers.p); i++) xperms->drivers.p[i] |= node->datum.u.xperms->perms.p[i]; } else if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLFUNCTION) { /* if allowing permissions within a driver */ security_xperm_set(xperms->drivers.p, node->datum.u.xperms->driver); } /* If no ioctl commands are allowed, ignore auditallow and auditdeny */ if (node->key.specified & AVTAB_XPERMS_ALLOWED) xperms->len = 1; } /* * Compute access vectors and extended permissions based on a context * structure pair for the permissions in a particular class. */ static void context_struct_compute_av(struct policydb *policydb, struct context *scontext, struct context *tcontext, u16 tclass, struct av_decision *avd, struct extended_perms *xperms) { struct constraint_node *constraint; struct role_allow *ra; struct avtab_key avkey; struct avtab_node *node; struct class_datum *tclass_datum; struct ebitmap *sattr, *tattr; struct ebitmap_node *snode, *tnode; unsigned int i, j; avd->allowed = 0; avd->auditallow = 0; avd->auditdeny = 0xffffffff; if (xperms) { memset(&xperms->drivers, 0, sizeof(xperms->drivers)); xperms->len = 0; } if (unlikely(!tclass || tclass > policydb->p_classes.nprim)) { if (printk_ratelimit()) pr_warn("SELinux: Invalid class %hu\n", tclass); return; } tclass_datum = policydb->class_val_to_struct[tclass - 1]; /* * If a specific type enforcement rule was defined for * this permission check, then use it. */ avkey.target_class = tclass; avkey.specified = AVTAB_AV | AVTAB_XPERMS; sattr = &policydb->type_attr_map_array[scontext->type - 1]; tattr = &policydb->type_attr_map_array[tcontext->type - 1]; ebitmap_for_each_positive_bit(sattr, snode, i) { ebitmap_for_each_positive_bit(tattr, tnode, j) { avkey.source_type = i + 1; avkey.target_type = j + 1; for (node = avtab_search_node(&policydb->te_avtab, &avkey); node; node = avtab_search_node_next(node, avkey.specified)) { if (node->key.specified == AVTAB_ALLOWED) avd->allowed |= node->datum.u.data; else if (node->key.specified == AVTAB_AUDITALLOW) avd->auditallow |= node->datum.u.data; else if (node->key.specified == AVTAB_AUDITDENY) avd->auditdeny &= node->datum.u.data; else if (xperms && (node->key.specified & AVTAB_XPERMS)) services_compute_xperms_drivers(xperms, node); } /* Check conditional av table for additional permissions */ cond_compute_av(&policydb->te_cond_avtab, &avkey, avd, xperms); } } /* * Remove any permissions prohibited by a constraint (this includes * the MLS policy). */ constraint = tclass_datum->constraints; while (constraint) { if ((constraint->permissions & (avd->allowed)) && !constraint_expr_eval(policydb, scontext, tcontext, NULL, constraint->expr)) { avd->allowed &= ~(constraint->permissions); } constraint = constraint->next; } /* * If checking process transition permission and the * role is changing, then check the (current_role, new_role) * pair. */ if (tclass == policydb->process_class && (avd->allowed & policydb->process_trans_perms) && scontext->role != tcontext->role) { for (ra = policydb->role_allow; ra; ra = ra->next) { if (scontext->role == ra->role && tcontext->role == ra->new_role) break; } if (!ra) avd->allowed &= ~policydb->process_trans_perms; } /* * If the given source and target types have boundary * constraint, lazy checks have to mask any violated * permission and notice it to userspace via audit. */ type_attribute_bounds_av(policydb, scontext, tcontext, tclass, avd); } static int security_validtrans_handle_fail(struct selinux_state *state, struct selinux_policy *policy, struct sidtab_entry *oentry, struct sidtab_entry *nentry, struct sidtab_entry *tentry, u16 tclass) { struct policydb *p = &policy->policydb; struct sidtab *sidtab = policy->sidtab; char *o = NULL, *n = NULL, *t = NULL; u32 olen, nlen, tlen; if (sidtab_entry_to_string(p, sidtab, oentry, &o, &olen)) goto out; if (sidtab_entry_to_string(p, sidtab, nentry, &n, &nlen)) goto out; if (sidtab_entry_to_string(p, sidtab, tentry, &t, &tlen)) goto out; audit_log(audit_context(), GFP_ATOMIC, AUDIT_SELINUX_ERR, "op=security_validate_transition seresult=denied" " oldcontext=%s newcontext=%s taskcontext=%s tclass=%s", o, n, t, sym_name(p, SYM_CLASSES, tclass-1)); out: kfree(o); kfree(n); kfree(t); if (!enforcing_enabled(state)) return 0; return -EPERM; } static int security_compute_validatetrans(struct selinux_state *state, u32 oldsid, u32 newsid, u32 tasksid, u16 orig_tclass, bool user) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct sidtab_entry *oentry; struct sidtab_entry *nentry; struct sidtab_entry *tentry; struct class_datum *tclass_datum; struct constraint_node *constraint; u16 tclass; int rc = 0; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; if (!user) tclass = unmap_class(&policy->map, orig_tclass); else tclass = orig_tclass; if (!tclass || tclass > policydb->p_classes.nprim) { rc = -EINVAL; goto out; } tclass_datum = policydb->class_val_to_struct[tclass - 1]; oentry = sidtab_search_entry(sidtab, oldsid); if (!oentry) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, oldsid); rc = -EINVAL; goto out; } nentry = sidtab_search_entry(sidtab, newsid); if (!nentry) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, newsid); rc = -EINVAL; goto out; } tentry = sidtab_search_entry(sidtab, tasksid); if (!tentry) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, tasksid); rc = -EINVAL; goto out; } constraint = tclass_datum->validatetrans; while (constraint) { if (!constraint_expr_eval(policydb, &oentry->context, &nentry->context, &tentry->context, constraint->expr)) { if (user) rc = -EPERM; else rc = security_validtrans_handle_fail(state, policy, oentry, nentry, tentry, tclass); goto out; } constraint = constraint->next; } out: rcu_read_unlock(); return rc; } int security_validate_transition_user(struct selinux_state *state, u32 oldsid, u32 newsid, u32 tasksid, u16 tclass) { return security_compute_validatetrans(state, oldsid, newsid, tasksid, tclass, true); } int security_validate_transition(struct selinux_state *state, u32 oldsid, u32 newsid, u32 tasksid, u16 orig_tclass) { return security_compute_validatetrans(state, oldsid, newsid, tasksid, orig_tclass, false); } /* * security_bounded_transition - check whether the given * transition is directed to bounded, or not. * It returns 0, if @newsid is bounded by @oldsid. * Otherwise, it returns error code. * * @oldsid : current security identifier * @newsid : destinated security identifier */ int security_bounded_transition(struct selinux_state *state, u32 old_sid, u32 new_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct sidtab_entry *old_entry, *new_entry; struct type_datum *type; int index; int rc; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; rc = -EINVAL; old_entry = sidtab_search_entry(sidtab, old_sid); if (!old_entry) { pr_err("SELinux: %s: unrecognized SID %u\n", __func__, old_sid); goto out; } rc = -EINVAL; new_entry = sidtab_search_entry(sidtab, new_sid); if (!new_entry) { pr_err("SELinux: %s: unrecognized SID %u\n", __func__, new_sid); goto out; } rc = 0; /* type/domain unchanged */ if (old_entry->context.type == new_entry->context.type) goto out; index = new_entry->context.type; while (true) { type = policydb->type_val_to_struct[index - 1]; BUG_ON(!type); /* not bounded anymore */ rc = -EPERM; if (!type->bounds) break; /* @newsid is bounded by @oldsid */ rc = 0; if (type->bounds == old_entry->context.type) break; index = type->bounds; } if (rc) { char *old_name = NULL; char *new_name = NULL; u32 length; if (!sidtab_entry_to_string(policydb, sidtab, old_entry, &old_name, &length) && !sidtab_entry_to_string(policydb, sidtab, new_entry, &new_name, &length)) { audit_log(audit_context(), GFP_ATOMIC, AUDIT_SELINUX_ERR, "op=security_bounded_transition " "seresult=denied " "oldcontext=%s newcontext=%s", old_name, new_name); } kfree(new_name); kfree(old_name); } out: rcu_read_unlock(); return rc; } static void avd_init(struct selinux_policy *policy, struct av_decision *avd) { avd->allowed = 0; avd->auditallow = 0; avd->auditdeny = 0xffffffff; if (policy) avd->seqno = policy->latest_granting; else avd->seqno = 0; avd->flags = 0; } void services_compute_xperms_decision(struct extended_perms_decision *xpermd, struct avtab_node *node) { unsigned int i; if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLFUNCTION) { if (xpermd->driver != node->datum.u.xperms->driver) return; } else if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLDRIVER) { if (!security_xperm_test(node->datum.u.xperms->perms.p, xpermd->driver)) return; } else { BUG(); } if (node->key.specified == AVTAB_XPERMS_ALLOWED) { xpermd->used |= XPERMS_ALLOWED; if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLDRIVER) { memset(xpermd->allowed->p, 0xff, sizeof(xpermd->allowed->p)); } if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLFUNCTION) { for (i = 0; i < ARRAY_SIZE(xpermd->allowed->p); i++) xpermd->allowed->p[i] |= node->datum.u.xperms->perms.p[i]; } } else if (node->key.specified == AVTAB_XPERMS_AUDITALLOW) { xpermd->used |= XPERMS_AUDITALLOW; if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLDRIVER) { memset(xpermd->auditallow->p, 0xff, sizeof(xpermd->auditallow->p)); } if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLFUNCTION) { for (i = 0; i < ARRAY_SIZE(xpermd->auditallow->p); i++) xpermd->auditallow->p[i] |= node->datum.u.xperms->perms.p[i]; } } else if (node->key.specified == AVTAB_XPERMS_DONTAUDIT) { xpermd->used |= XPERMS_DONTAUDIT; if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLDRIVER) { memset(xpermd->dontaudit->p, 0xff, sizeof(xpermd->dontaudit->p)); } if (node->datum.u.xperms->specified == AVTAB_XPERMS_IOCTLFUNCTION) { for (i = 0; i < ARRAY_SIZE(xpermd->dontaudit->p); i++) xpermd->dontaudit->p[i] |= node->datum.u.xperms->perms.p[i]; } } else { BUG(); } } void security_compute_xperms_decision(struct selinux_state *state, u32 ssid, u32 tsid, u16 orig_tclass, u8 driver, struct extended_perms_decision *xpermd) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; u16 tclass; struct context *scontext, *tcontext; struct avtab_key avkey; struct avtab_node *node; struct ebitmap *sattr, *tattr; struct ebitmap_node *snode, *tnode; unsigned int i, j; xpermd->driver = driver; xpermd->used = 0; memset(xpermd->allowed->p, 0, sizeof(xpermd->allowed->p)); memset(xpermd->auditallow->p, 0, sizeof(xpermd->auditallow->p)); memset(xpermd->dontaudit->p, 0, sizeof(xpermd->dontaudit->p)); rcu_read_lock(); if (!selinux_initialized(state)) goto allow; policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; scontext = sidtab_search(sidtab, ssid); if (!scontext) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, ssid); goto out; } tcontext = sidtab_search(sidtab, tsid); if (!tcontext) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, tsid); goto out; } tclass = unmap_class(&policy->map, orig_tclass); if (unlikely(orig_tclass && !tclass)) { if (policydb->allow_unknown) goto allow; goto out; } if (unlikely(!tclass || tclass > policydb->p_classes.nprim)) { pr_warn_ratelimited("SELinux: Invalid class %hu\n", tclass); goto out; } avkey.target_class = tclass; avkey.specified = AVTAB_XPERMS; sattr = &policydb->type_attr_map_array[scontext->type - 1]; tattr = &policydb->type_attr_map_array[tcontext->type - 1]; ebitmap_for_each_positive_bit(sattr, snode, i) { ebitmap_for_each_positive_bit(tattr, tnode, j) { avkey.source_type = i + 1; avkey.target_type = j + 1; for (node = avtab_search_node(&policydb->te_avtab, &avkey); node; node = avtab_search_node_next(node, avkey.specified)) services_compute_xperms_decision(xpermd, node); cond_compute_xperms(&policydb->te_cond_avtab, &avkey, xpermd); } } out: rcu_read_unlock(); return; allow: memset(xpermd->allowed->p, 0xff, sizeof(xpermd->allowed->p)); goto out; } /** * security_compute_av - Compute access vector decisions. * @ssid: source security identifier * @tsid: target security identifier * @tclass: target security class * @avd: access vector decisions * @xperms: extended permissions * * Compute a set of access vector decisions based on the * SID pair (@ssid, @tsid) for the permissions in @tclass. */ void security_compute_av(struct selinux_state *state, u32 ssid, u32 tsid, u16 orig_tclass, struct av_decision *avd, struct extended_perms *xperms) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; u16 tclass; struct context *scontext = NULL, *tcontext = NULL; rcu_read_lock(); policy = rcu_dereference(state->policy); avd_init(policy, avd); xperms->len = 0; if (!selinux_initialized(state)) goto allow; policydb = &policy->policydb; sidtab = policy->sidtab; scontext = sidtab_search(sidtab, ssid); if (!scontext) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, ssid); goto out; } /* permissive domain? */ if (ebitmap_get_bit(&policydb->permissive_map, scontext->type)) avd->flags |= AVD_FLAGS_PERMISSIVE; tcontext = sidtab_search(sidtab, tsid); if (!tcontext) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, tsid); goto out; } tclass = unmap_class(&policy->map, orig_tclass); if (unlikely(orig_tclass && !tclass)) { if (policydb->allow_unknown) goto allow; goto out; } context_struct_compute_av(policydb, scontext, tcontext, tclass, avd, xperms); map_decision(&policy->map, orig_tclass, avd, policydb->allow_unknown); out: rcu_read_unlock(); return; allow: avd->allowed = 0xffffffff; goto out; } void security_compute_av_user(struct selinux_state *state, u32 ssid, u32 tsid, u16 tclass, struct av_decision *avd) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct context *scontext = NULL, *tcontext = NULL; rcu_read_lock(); policy = rcu_dereference(state->policy); avd_init(policy, avd); if (!selinux_initialized(state)) goto allow; policydb = &policy->policydb; sidtab = policy->sidtab; scontext = sidtab_search(sidtab, ssid); if (!scontext) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, ssid); goto out; } /* permissive domain? */ if (ebitmap_get_bit(&policydb->permissive_map, scontext->type)) avd->flags |= AVD_FLAGS_PERMISSIVE; tcontext = sidtab_search(sidtab, tsid); if (!tcontext) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, tsid); goto out; } if (unlikely(!tclass)) { if (policydb->allow_unknown) goto allow; goto out; } context_struct_compute_av(policydb, scontext, tcontext, tclass, avd, NULL); out: rcu_read_unlock(); return; allow: avd->allowed = 0xffffffff; goto out; } /* * Write the security context string representation of * the context structure `context' into a dynamically * allocated string of the correct size. Set `*scontext' * to point to this string and set `*scontext_len' to * the length of the string. */ static int context_struct_to_string(struct policydb *p, struct context *context, char **scontext, u32 *scontext_len) { char *scontextp; if (scontext) *scontext = NULL; *scontext_len = 0; if (context->len) { *scontext_len = context->len; if (scontext) { *scontext = kstrdup(context->str, GFP_ATOMIC); if (!(*scontext)) return -ENOMEM; } return 0; } /* Compute the size of the context. */ *scontext_len += strlen(sym_name(p, SYM_USERS, context->user - 1)) + 1; *scontext_len += strlen(sym_name(p, SYM_ROLES, context->role - 1)) + 1; *scontext_len += strlen(sym_name(p, SYM_TYPES, context->type - 1)) + 1; *scontext_len += mls_compute_context_len(p, context); if (!scontext) return 0; /* Allocate space for the context; caller must free this space. */ scontextp = kmalloc(*scontext_len, GFP_ATOMIC); if (!scontextp) return -ENOMEM; *scontext = scontextp; /* * Copy the user name, role name and type name into the context. */ scontextp += sprintf(scontextp, "%s:%s:%s", sym_name(p, SYM_USERS, context->user - 1), sym_name(p, SYM_ROLES, context->role - 1), sym_name(p, SYM_TYPES, context->type - 1)); mls_sid_to_context(p, context, &scontextp); *scontextp = 0; return 0; } static int sidtab_entry_to_string(struct policydb *p, struct sidtab *sidtab, struct sidtab_entry *entry, char **scontext, u32 *scontext_len) { int rc = sidtab_sid2str_get(sidtab, entry, scontext, scontext_len); if (rc != -ENOENT) return rc; rc = context_struct_to_string(p, &entry->context, scontext, scontext_len); if (!rc && scontext) sidtab_sid2str_put(sidtab, entry, *scontext, *scontext_len); return rc; } #include "initial_sid_to_string.h" int security_sidtab_hash_stats(struct selinux_state *state, char *page) { struct selinux_policy *policy; int rc; if (!selinux_initialized(state)) { pr_err("SELinux: %s: called before initial load_policy\n", __func__); return -EINVAL; } rcu_read_lock(); policy = rcu_dereference(state->policy); rc = sidtab_hash_stats(policy->sidtab, page); rcu_read_unlock(); return rc; } const char *security_get_initial_sid_context(u32 sid) { if (unlikely(sid > SECINITSID_NUM)) return NULL; return initial_sid_to_string[sid]; } static int security_sid_to_context_core(struct selinux_state *state, u32 sid, char **scontext, u32 *scontext_len, int force, int only_invalid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct sidtab_entry *entry; int rc = 0; if (scontext) *scontext = NULL; *scontext_len = 0; if (!selinux_initialized(state)) { if (sid <= SECINITSID_NUM) { char *scontextp; const char *s = initial_sid_to_string[sid]; if (!s) return -EINVAL; *scontext_len = strlen(s) + 1; if (!scontext) return 0; scontextp = kmemdup(s, *scontext_len, GFP_ATOMIC); if (!scontextp) return -ENOMEM; *scontext = scontextp; return 0; } pr_err("SELinux: %s: called before initial " "load_policy on unknown SID %d\n", __func__, sid); return -EINVAL; } rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; if (force) entry = sidtab_search_entry_force(sidtab, sid); else entry = sidtab_search_entry(sidtab, sid); if (!entry) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, sid); rc = -EINVAL; goto out_unlock; } if (only_invalid && !entry->context.len) goto out_unlock; rc = sidtab_entry_to_string(policydb, sidtab, entry, scontext, scontext_len); out_unlock: rcu_read_unlock(); return rc; } /** * security_sid_to_context - Obtain a context for a given SID. * @sid: security identifier, SID * @scontext: security context * @scontext_len: length in bytes * * Write the string representation of the context associated with @sid * into a dynamically allocated string of the correct size. Set @scontext * to point to this string and set @scontext_len to the length of the string. */ int security_sid_to_context(struct selinux_state *state, u32 sid, char **scontext, u32 *scontext_len) { return security_sid_to_context_core(state, sid, scontext, scontext_len, 0, 0); } int security_sid_to_context_force(struct selinux_state *state, u32 sid, char **scontext, u32 *scontext_len) { return security_sid_to_context_core(state, sid, scontext, scontext_len, 1, 0); } /** * security_sid_to_context_inval - Obtain a context for a given SID if it * is invalid. * @sid: security identifier, SID * @scontext: security context * @scontext_len: length in bytes * * Write the string representation of the context associated with @sid * into a dynamically allocated string of the correct size, but only if the * context is invalid in the current policy. Set @scontext to point to * this string (or NULL if the context is valid) and set @scontext_len to * the length of the string (or 0 if the context is valid). */ int security_sid_to_context_inval(struct selinux_state *state, u32 sid, char **scontext, u32 *scontext_len) { return security_sid_to_context_core(state, sid, scontext, scontext_len, 1, 1); } /* * Caveat: Mutates scontext. */ static int string_to_context_struct(struct policydb *pol, struct sidtab *sidtabp, char *scontext, struct context *ctx, u32 def_sid) { struct role_datum *role; struct type_datum *typdatum; struct user_datum *usrdatum; char *scontextp, *p, oldc; int rc = 0; context_init(ctx); /* Parse the security context. */ rc = -EINVAL; scontextp = (char *) scontext; /* Extract the user. */ p = scontextp; while (*p && *p != ':') p++; if (*p == 0) goto out; *p++ = 0; usrdatum = symtab_search(&pol->p_users, scontextp); if (!usrdatum) goto out; ctx->user = usrdatum->value; /* Extract role. */ scontextp = p; while (*p && *p != ':') p++; if (*p == 0) goto out; *p++ = 0; role = symtab_search(&pol->p_roles, scontextp); if (!role) goto out; ctx->role = role->value; /* Extract type. */ scontextp = p; while (*p && *p != ':') p++; oldc = *p; *p++ = 0; typdatum = symtab_search(&pol->p_types, scontextp); if (!typdatum || typdatum->attribute) goto out; ctx->type = typdatum->value; rc = mls_context_to_sid(pol, oldc, p, ctx, sidtabp, def_sid); if (rc) goto out; /* Check the validity of the new context. */ rc = -EINVAL; if (!policydb_context_isvalid(pol, ctx)) goto out; rc = 0; out: if (rc) context_destroy(ctx); return rc; } static int security_context_to_sid_core(struct selinux_state *state, const char *scontext, u32 scontext_len, u32 *sid, u32 def_sid, gfp_t gfp_flags, int force) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; char *scontext2, *str = NULL; struct context context; int rc = 0; /* An empty security context is never valid. */ if (!scontext_len) return -EINVAL; /* Copy the string to allow changes and ensure a NUL terminator */ scontext2 = kmemdup_nul(scontext, scontext_len, gfp_flags); if (!scontext2) return -ENOMEM; if (!selinux_initialized(state)) { int i; for (i = 1; i < SECINITSID_NUM; i++) { const char *s = initial_sid_to_string[i]; if (s && !strcmp(s, scontext2)) { *sid = i; goto out; } } *sid = SECINITSID_KERNEL; goto out; } *sid = SECSID_NULL; if (force) { /* Save another copy for storing in uninterpreted form */ rc = -ENOMEM; str = kstrdup(scontext2, gfp_flags); if (!str) goto out; } retry: rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; rc = string_to_context_struct(policydb, sidtab, scontext2, &context, def_sid); if (rc == -EINVAL && force) { context.str = str; context.len = strlen(str) + 1; str = NULL; } else if (rc) goto out_unlock; rc = sidtab_context_to_sid(sidtab, &context, sid); if (rc == -ESTALE) { rcu_read_unlock(); if (context.str) { str = context.str; context.str = NULL; } context_destroy(&context); goto retry; } context_destroy(&context); out_unlock: rcu_read_unlock(); out: kfree(scontext2); kfree(str); return rc; } /** * security_context_to_sid - Obtain a SID for a given security context. * @scontext: security context * @scontext_len: length in bytes * @sid: security identifier, SID * @gfp: context for the allocation * * Obtains a SID associated with the security context that * has the string representation specified by @scontext. * Returns -%EINVAL if the context is invalid, -%ENOMEM if insufficient * memory is available, or 0 on success. */ int security_context_to_sid(struct selinux_state *state, const char *scontext, u32 scontext_len, u32 *sid, gfp_t gfp) { return security_context_to_sid_core(state, scontext, scontext_len, sid, SECSID_NULL, gfp, 0); } int security_context_str_to_sid(struct selinux_state *state, const char *scontext, u32 *sid, gfp_t gfp) { return security_context_to_sid(state, scontext, strlen(scontext), sid, gfp); } /** * security_context_to_sid_default - Obtain a SID for a given security context, * falling back to specified default if needed. * * @scontext: security context * @scontext_len: length in bytes * @sid: security identifier, SID * @def_sid: default SID to assign on error * * Obtains a SID associated with the security context that * has the string representation specified by @scontext. * The default SID is passed to the MLS layer to be used to allow * kernel labeling of the MLS field if the MLS field is not present * (for upgrading to MLS without full relabel). * Implicitly forces adding of the context even if it cannot be mapped yet. * Returns -%EINVAL if the context is invalid, -%ENOMEM if insufficient * memory is available, or 0 on success. */ int security_context_to_sid_default(struct selinux_state *state, const char *scontext, u32 scontext_len, u32 *sid, u32 def_sid, gfp_t gfp_flags) { return security_context_to_sid_core(state, scontext, scontext_len, sid, def_sid, gfp_flags, 1); } int security_context_to_sid_force(struct selinux_state *state, const char *scontext, u32 scontext_len, u32 *sid) { return security_context_to_sid_core(state, scontext, scontext_len, sid, SECSID_NULL, GFP_KERNEL, 1); } static int compute_sid_handle_invalid_context( struct selinux_state *state, struct selinux_policy *policy, struct sidtab_entry *sentry, struct sidtab_entry *tentry, u16 tclass, struct context *newcontext) { struct policydb *policydb = &policy->policydb; struct sidtab *sidtab = policy->sidtab; char *s = NULL, *t = NULL, *n = NULL; u32 slen, tlen, nlen; struct audit_buffer *ab; if (sidtab_entry_to_string(policydb, sidtab, sentry, &s, &slen)) goto out; if (sidtab_entry_to_string(policydb, sidtab, tentry, &t, &tlen)) goto out; if (context_struct_to_string(policydb, newcontext, &n, &nlen)) goto out; ab = audit_log_start(audit_context(), GFP_ATOMIC, AUDIT_SELINUX_ERR); audit_log_format(ab, "op=security_compute_sid invalid_context="); /* no need to record the NUL with untrusted strings */ audit_log_n_untrustedstring(ab, n, nlen - 1); audit_log_format(ab, " scontext=%s tcontext=%s tclass=%s", s, t, sym_name(policydb, SYM_CLASSES, tclass-1)); audit_log_end(ab); out: kfree(s); kfree(t); kfree(n); if (!enforcing_enabled(state)) return 0; return -EACCES; } static void filename_compute_type(struct policydb *policydb, struct context *newcontext, u32 stype, u32 ttype, u16 tclass, const char *objname) { struct filename_trans_key ft; struct filename_trans_datum *datum; /* * Most filename trans rules are going to live in specific directories * like /dev or /var/run. This bitmap will quickly skip rule searches * if the ttype does not contain any rules. */ if (!ebitmap_get_bit(&policydb->filename_trans_ttypes, ttype)) return; ft.ttype = ttype; ft.tclass = tclass; ft.name = objname; datum = policydb_filenametr_search(policydb, &ft); while (datum) { if (ebitmap_get_bit(&datum->stypes, stype - 1)) { newcontext->type = datum->otype; return; } datum = datum->next; } } static int security_compute_sid(struct selinux_state *state, u32 ssid, u32 tsid, u16 orig_tclass, u32 specified, const char *objname, u32 *out_sid, bool kern) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct class_datum *cladatum; struct context *scontext, *tcontext, newcontext; struct sidtab_entry *sentry, *tentry; struct avtab_key avkey; struct avtab_datum *avdatum; struct avtab_node *node; u16 tclass; int rc = 0; bool sock; if (!selinux_initialized(state)) { switch (orig_tclass) { case SECCLASS_PROCESS: /* kernel value */ *out_sid = ssid; break; default: *out_sid = tsid; break; } goto out; } retry: cladatum = NULL; context_init(&newcontext); rcu_read_lock(); policy = rcu_dereference(state->policy); if (kern) { tclass = unmap_class(&policy->map, orig_tclass); sock = security_is_socket_class(orig_tclass); } else { tclass = orig_tclass; sock = security_is_socket_class(map_class(&policy->map, tclass)); } policydb = &policy->policydb; sidtab = policy->sidtab; sentry = sidtab_search_entry(sidtab, ssid); if (!sentry) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, ssid); rc = -EINVAL; goto out_unlock; } tentry = sidtab_search_entry(sidtab, tsid); if (!tentry) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, tsid); rc = -EINVAL; goto out_unlock; } scontext = &sentry->context; tcontext = &tentry->context; if (tclass && tclass <= policydb->p_classes.nprim) cladatum = policydb->class_val_to_struct[tclass - 1]; /* Set the user identity. */ switch (specified) { case AVTAB_TRANSITION: case AVTAB_CHANGE: if (cladatum && cladatum->default_user == DEFAULT_TARGET) { newcontext.user = tcontext->user; } else { /* notice this gets both DEFAULT_SOURCE and unset */ /* Use the process user identity. */ newcontext.user = scontext->user; } break; case AVTAB_MEMBER: /* Use the related object owner. */ newcontext.user = tcontext->user; break; } /* Set the role to default values. */ if (cladatum && cladatum->default_role == DEFAULT_SOURCE) { newcontext.role = scontext->role; } else if (cladatum && cladatum->default_role == DEFAULT_TARGET) { newcontext.role = tcontext->role; } else { if ((tclass == policydb->process_class) || sock) newcontext.role = scontext->role; else newcontext.role = OBJECT_R_VAL; } /* Set the type to default values. */ if (cladatum && cladatum->default_type == DEFAULT_SOURCE) { newcontext.type = scontext->type; } else if (cladatum && cladatum->default_type == DEFAULT_TARGET) { newcontext.type = tcontext->type; } else { if ((tclass == policydb->process_class) || sock) { /* Use the type of process. */ newcontext.type = scontext->type; } else { /* Use the type of the related object. */ newcontext.type = tcontext->type; } } /* Look for a type transition/member/change rule. */ avkey.source_type = scontext->type; avkey.target_type = tcontext->type; avkey.target_class = tclass; avkey.specified = specified; avdatum = avtab_search(&policydb->te_avtab, &avkey); /* If no permanent rule, also check for enabled conditional rules */ if (!avdatum) { node = avtab_search_node(&policydb->te_cond_avtab, &avkey); for (; node; node = avtab_search_node_next(node, specified)) { if (node->key.specified & AVTAB_ENABLED) { avdatum = &node->datum; break; } } } if (avdatum) { /* Use the type from the type transition/member/change rule. */ newcontext.type = avdatum->u.data; } /* if we have a objname this is a file trans check so check those rules */ if (objname) filename_compute_type(policydb, &newcontext, scontext->type, tcontext->type, tclass, objname); /* Check for class-specific changes. */ if (specified & AVTAB_TRANSITION) { /* Look for a role transition rule. */ struct role_trans_datum *rtd; struct role_trans_key rtk = { .role = scontext->role, .type = tcontext->type, .tclass = tclass, }; rtd = policydb_roletr_search(policydb, &rtk); if (rtd) newcontext.role = rtd->new_role; } /* Set the MLS attributes. This is done last because it may allocate memory. */ rc = mls_compute_sid(policydb, scontext, tcontext, tclass, specified, &newcontext, sock); if (rc) goto out_unlock; /* Check the validity of the context. */ if (!policydb_context_isvalid(policydb, &newcontext)) { rc = compute_sid_handle_invalid_context(state, policy, sentry, tentry, tclass, &newcontext); if (rc) goto out_unlock; } /* Obtain the sid for the context. */ rc = sidtab_context_to_sid(sidtab, &newcontext, out_sid); if (rc == -ESTALE) { rcu_read_unlock(); context_destroy(&newcontext); goto retry; } out_unlock: rcu_read_unlock(); context_destroy(&newcontext); out: return rc; } /** * security_transition_sid - Compute the SID for a new subject/object. * @ssid: source security identifier * @tsid: target security identifier * @tclass: target security class * @out_sid: security identifier for new subject/object * * Compute a SID to use for labeling a new subject or object in the * class @tclass based on a SID pair (@ssid, @tsid). * Return -%EINVAL if any of the parameters are invalid, -%ENOMEM * if insufficient memory is available, or %0 if the new SID was * computed successfully. */ int security_transition_sid(struct selinux_state *state, u32 ssid, u32 tsid, u16 tclass, const struct qstr *qstr, u32 *out_sid) { return security_compute_sid(state, ssid, tsid, tclass, AVTAB_TRANSITION, qstr ? qstr->name : NULL, out_sid, true); } int security_transition_sid_user(struct selinux_state *state, u32 ssid, u32 tsid, u16 tclass, const char *objname, u32 *out_sid) { return security_compute_sid(state, ssid, tsid, tclass, AVTAB_TRANSITION, objname, out_sid, false); } /** * security_member_sid - Compute the SID for member selection. * @ssid: source security identifier * @tsid: target security identifier * @tclass: target security class * @out_sid: security identifier for selected member * * Compute a SID to use when selecting a member of a polyinstantiated * object of class @tclass based on a SID pair (@ssid, @tsid). * Return -%EINVAL if any of the parameters are invalid, -%ENOMEM * if insufficient memory is available, or %0 if the SID was * computed successfully. */ int security_member_sid(struct selinux_state *state, u32 ssid, u32 tsid, u16 tclass, u32 *out_sid) { return security_compute_sid(state, ssid, tsid, tclass, AVTAB_MEMBER, NULL, out_sid, false); } /** * security_change_sid - Compute the SID for object relabeling. * @ssid: source security identifier * @tsid: target security identifier * @tclass: target security class * @out_sid: security identifier for selected member * * Compute a SID to use for relabeling an object of class @tclass * based on a SID pair (@ssid, @tsid). * Return -%EINVAL if any of the parameters are invalid, -%ENOMEM * if insufficient memory is available, or %0 if the SID was * computed successfully. */ int security_change_sid(struct selinux_state *state, u32 ssid, u32 tsid, u16 tclass, u32 *out_sid) { return security_compute_sid(state, ssid, tsid, tclass, AVTAB_CHANGE, NULL, out_sid, false); } static inline int convert_context_handle_invalid_context( struct selinux_state *state, struct policydb *policydb, struct context *context) { char *s; u32 len; if (enforcing_enabled(state)) return -EINVAL; if (!context_struct_to_string(policydb, context, &s, &len)) { pr_warn("SELinux: Context %s would be invalid if enforcing\n", s); kfree(s); } return 0; } /* * Convert the values in the security context * structure `oldc' from the values specified * in the policy `p->oldp' to the values specified * in the policy `p->newp', storing the new context * in `newc'. Verify that the context is valid * under the new policy. */ static int convert_context(struct context *oldc, struct context *newc, void *p) { struct convert_context_args *args; struct ocontext *oc; struct role_datum *role; struct type_datum *typdatum; struct user_datum *usrdatum; char *s; u32 len; int rc; args = p; if (oldc->str) { s = kstrdup(oldc->str, GFP_KERNEL); if (!s) return -ENOMEM; rc = string_to_context_struct(args->newp, NULL, s, newc, SECSID_NULL); if (rc == -EINVAL) { /* * Retain string representation for later mapping. * * IMPORTANT: We need to copy the contents of oldc->str * back into s again because string_to_context_struct() * may have garbled it. */ memcpy(s, oldc->str, oldc->len); context_init(newc); newc->str = s; newc->len = oldc->len; return 0; } kfree(s); if (rc) { /* Other error condition, e.g. ENOMEM. */ pr_err("SELinux: Unable to map context %s, rc = %d.\n", oldc->str, -rc); return rc; } pr_info("SELinux: Context %s became valid (mapped).\n", oldc->str); return 0; } context_init(newc); /* Convert the user. */ rc = -EINVAL; usrdatum = symtab_search(&args->newp->p_users, sym_name(args->oldp, SYM_USERS, oldc->user - 1)); if (!usrdatum) goto bad; newc->user = usrdatum->value; /* Convert the role. */ rc = -EINVAL; role = symtab_search(&args->newp->p_roles, sym_name(args->oldp, SYM_ROLES, oldc->role - 1)); if (!role) goto bad; newc->role = role->value; /* Convert the type. */ rc = -EINVAL; typdatum = symtab_search(&args->newp->p_types, sym_name(args->oldp, SYM_TYPES, oldc->type - 1)); if (!typdatum) goto bad; newc->type = typdatum->value; /* Convert the MLS fields if dealing with MLS policies */ if (args->oldp->mls_enabled && args->newp->mls_enabled) { rc = mls_convert_context(args->oldp, args->newp, oldc, newc); if (rc) goto bad; } else if (!args->oldp->mls_enabled && args->newp->mls_enabled) { /* * Switching between non-MLS and MLS policy: * ensure that the MLS fields of the context for all * existing entries in the sidtab are filled in with a * suitable default value, likely taken from one of the * initial SIDs. */ oc = args->newp->ocontexts[OCON_ISID]; while (oc && oc->sid[0] != SECINITSID_UNLABELED) oc = oc->next; rc = -EINVAL; if (!oc) { pr_err("SELinux: unable to look up" " the initial SIDs list\n"); goto bad; } rc = mls_range_set(newc, &oc->context[0].range); if (rc) goto bad; } /* Check the validity of the new context. */ if (!policydb_context_isvalid(args->newp, newc)) { rc = convert_context_handle_invalid_context(args->state, args->oldp, oldc); if (rc) goto bad; } return 0; bad: /* Map old representation to string and save it. */ rc = context_struct_to_string(args->oldp, oldc, &s, &len); if (rc) return rc; context_destroy(newc); newc->str = s; newc->len = len; pr_info("SELinux: Context %s became invalid (unmapped).\n", newc->str); return 0; } static void security_load_policycaps(struct selinux_state *state, struct selinux_policy *policy) { struct policydb *p; unsigned int i; struct ebitmap_node *node; p = &policy->policydb; for (i = 0; i < ARRAY_SIZE(state->policycap); i++) WRITE_ONCE(state->policycap[i], ebitmap_get_bit(&p->policycaps, i)); for (i = 0; i < ARRAY_SIZE(selinux_policycap_names); i++) pr_info("SELinux: policy capability %s=%d\n", selinux_policycap_names[i], ebitmap_get_bit(&p->policycaps, i)); ebitmap_for_each_positive_bit(&p->policycaps, node, i) { if (i >= ARRAY_SIZE(selinux_policycap_names)) pr_info("SELinux: unknown policy capability %u\n", i); } } static int security_preserve_bools(struct selinux_policy *oldpolicy, struct selinux_policy *newpolicy); static void selinux_policy_free(struct selinux_policy *policy) { if (!policy) return; sidtab_destroy(policy->sidtab); kfree(policy->map.mapping); policydb_destroy(&policy->policydb); kfree(policy->sidtab); kfree(policy); } static void selinux_policy_cond_free(struct selinux_policy *policy) { cond_policydb_destroy_dup(&policy->policydb); kfree(policy); } void selinux_policy_cancel(struct selinux_state *state, struct selinux_load_state *load_state) { struct selinux_policy *oldpolicy; oldpolicy = rcu_dereference_protected(state->policy, lockdep_is_held(&state->policy_mutex)); sidtab_cancel_convert(oldpolicy->sidtab); selinux_policy_free(load_state->policy); kfree(load_state->convert_data); } static void selinux_notify_policy_change(struct selinux_state *state, u32 seqno) { /* Flush external caches and notify userspace of policy load */ avc_ss_reset(state->avc, seqno); selnl_notify_policyload(seqno); selinux_status_update_policyload(state, seqno); selinux_netlbl_cache_invalidate(); selinux_xfrm_notify_policyload(); } void selinux_policy_commit(struct selinux_state *state, struct selinux_load_state *load_state) { struct selinux_policy *oldpolicy, *newpolicy = load_state->policy; unsigned long flags; u32 seqno; oldpolicy = rcu_dereference_protected(state->policy, lockdep_is_held(&state->policy_mutex)); /* If switching between different policy types, log MLS status */ if (oldpolicy) { if (oldpolicy->policydb.mls_enabled && !newpolicy->policydb.mls_enabled) pr_info("SELinux: Disabling MLS support...\n"); else if (!oldpolicy->policydb.mls_enabled && newpolicy->policydb.mls_enabled) pr_info("SELinux: Enabling MLS support...\n"); } /* Set latest granting seqno for new policy. */ if (oldpolicy) newpolicy->latest_granting = oldpolicy->latest_granting + 1; else newpolicy->latest_granting = 1; seqno = newpolicy->latest_granting; /* Install the new policy. */ if (oldpolicy) { sidtab_freeze_begin(oldpolicy->sidtab, &flags); rcu_assign_pointer(state->policy, newpolicy); sidtab_freeze_end(oldpolicy->sidtab, &flags); } else { rcu_assign_pointer(state->policy, newpolicy); } /* Load the policycaps from the new policy */ security_load_policycaps(state, newpolicy); if (!selinux_initialized(state)) { /* * After first policy load, the security server is * marked as initialized and ready to handle requests and * any objects created prior to policy load are then labeled. */ selinux_mark_initialized(state); selinux_complete_init(); } /* Free the old policy */ synchronize_rcu(); selinux_policy_free(oldpolicy); kfree(load_state->convert_data); /* Notify others of the policy change */ selinux_notify_policy_change(state, seqno); } /** * security_load_policy - Load a security policy configuration. * @data: binary policy data * @len: length of data in bytes * * Load a new set of security policy configuration data, * validate it and convert the SID table as necessary. * This function will flush the access vector cache after * loading the new policy. */ int security_load_policy(struct selinux_state *state, void *data, size_t len, struct selinux_load_state *load_state) { struct selinux_policy *newpolicy, *oldpolicy; struct selinux_policy_convert_data *convert_data; int rc = 0; struct policy_file file = { data, len }, *fp = &file; newpolicy = kzalloc(sizeof(*newpolicy), GFP_KERNEL); if (!newpolicy) return -ENOMEM; newpolicy->sidtab = kzalloc(sizeof(*newpolicy->sidtab), GFP_KERNEL); if (!newpolicy->sidtab) { rc = -ENOMEM; goto err_policy; } rc = policydb_read(&newpolicy->policydb, fp); if (rc) goto err_sidtab; newpolicy->policydb.len = len; rc = selinux_set_mapping(&newpolicy->policydb, secclass_map, &newpolicy->map); if (rc) goto err_policydb; rc = policydb_load_isids(&newpolicy->policydb, newpolicy->sidtab); if (rc) { pr_err("SELinux: unable to load the initial SIDs\n"); goto err_mapping; } if (!selinux_initialized(state)) { /* First policy load, so no need to preserve state from old policy */ load_state->policy = newpolicy; load_state->convert_data = NULL; return 0; } oldpolicy = rcu_dereference_protected(state->policy, lockdep_is_held(&state->policy_mutex)); /* Preserve active boolean values from the old policy */ rc = security_preserve_bools(oldpolicy, newpolicy); if (rc) { pr_err("SELinux: unable to preserve booleans\n"); goto err_free_isids; } convert_data = kmalloc(sizeof(*convert_data), GFP_KERNEL); if (!convert_data) { rc = -ENOMEM; goto err_free_isids; } /* * Convert the internal representations of contexts * in the new SID table. */ convert_data->args.state = state; convert_data->args.oldp = &oldpolicy->policydb; convert_data->args.newp = &newpolicy->policydb; convert_data->sidtab_params.func = convert_context; convert_data->sidtab_params.args = &convert_data->args; convert_data->sidtab_params.target = newpolicy->sidtab; rc = sidtab_convert(oldpolicy->sidtab, &convert_data->sidtab_params); if (rc) { pr_err("SELinux: unable to convert the internal" " representation of contexts in the new SID" " table\n"); goto err_free_convert_data; } load_state->policy = newpolicy; load_state->convert_data = convert_data; return 0; err_free_convert_data: kfree(convert_data); err_free_isids: sidtab_destroy(newpolicy->sidtab); err_mapping: kfree(newpolicy->map.mapping); err_policydb: policydb_destroy(&newpolicy->policydb); err_sidtab: kfree(newpolicy->sidtab); err_policy: kfree(newpolicy); return rc; } /** * security_port_sid - Obtain the SID for a port. * @protocol: protocol number * @port: port number * @out_sid: security identifier */ int security_port_sid(struct selinux_state *state, u8 protocol, u16 port, u32 *out_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct ocontext *c; int rc; if (!selinux_initialized(state)) { *out_sid = SECINITSID_PORT; return 0; } retry: rc = 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; c = policydb->ocontexts[OCON_PORT]; while (c) { if (c->u.port.protocol == protocol && c->u.port.low_port <= port && c->u.port.high_port >= port) break; c = c->next; } if (c) { if (!c->sid[0]) { rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; } *out_sid = c->sid[0]; } else { *out_sid = SECINITSID_PORT; } out: rcu_read_unlock(); return rc; } /** * security_pkey_sid - Obtain the SID for a pkey. * @subnet_prefix: Subnet Prefix * @pkey_num: pkey number * @out_sid: security identifier */ int security_ib_pkey_sid(struct selinux_state *state, u64 subnet_prefix, u16 pkey_num, u32 *out_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct ocontext *c; int rc; if (!selinux_initialized(state)) { *out_sid = SECINITSID_UNLABELED; return 0; } retry: rc = 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; c = policydb->ocontexts[OCON_IBPKEY]; while (c) { if (c->u.ibpkey.low_pkey <= pkey_num && c->u.ibpkey.high_pkey >= pkey_num && c->u.ibpkey.subnet_prefix == subnet_prefix) break; c = c->next; } if (c) { if (!c->sid[0]) { rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; } *out_sid = c->sid[0]; } else *out_sid = SECINITSID_UNLABELED; out: rcu_read_unlock(); return rc; } /** * security_ib_endport_sid - Obtain the SID for a subnet management interface. * @dev_name: device name * @port: port number * @out_sid: security identifier */ int security_ib_endport_sid(struct selinux_state *state, const char *dev_name, u8 port_num, u32 *out_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct ocontext *c; int rc; if (!selinux_initialized(state)) { *out_sid = SECINITSID_UNLABELED; return 0; } retry: rc = 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; c = policydb->ocontexts[OCON_IBENDPORT]; while (c) { if (c->u.ibendport.port == port_num && !strncmp(c->u.ibendport.dev_name, dev_name, IB_DEVICE_NAME_MAX)) break; c = c->next; } if (c) { if (!c->sid[0]) { rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; } *out_sid = c->sid[0]; } else *out_sid = SECINITSID_UNLABELED; out: rcu_read_unlock(); return rc; } /** * security_netif_sid - Obtain the SID for a network interface. * @name: interface name * @if_sid: interface SID */ int security_netif_sid(struct selinux_state *state, char *name, u32 *if_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; int rc; struct ocontext *c; if (!selinux_initialized(state)) { *if_sid = SECINITSID_NETIF; return 0; } retry: rc = 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; c = policydb->ocontexts[OCON_NETIF]; while (c) { if (strcmp(name, c->u.name) == 0) break; c = c->next; } if (c) { if (!c->sid[0] || !c->sid[1]) { rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; rc = sidtab_context_to_sid(sidtab, &c->context[1], &c->sid[1]); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; } *if_sid = c->sid[0]; } else *if_sid = SECINITSID_NETIF; out: rcu_read_unlock(); return rc; } static int match_ipv6_addrmask(u32 *input, u32 *addr, u32 *mask) { int i, fail = 0; for (i = 0; i < 4; i++) if (addr[i] != (input[i] & mask[i])) { fail = 1; break; } return !fail; } /** * security_node_sid - Obtain the SID for a node (host). * @domain: communication domain aka address family * @addrp: address * @addrlen: address length in bytes * @out_sid: security identifier */ int security_node_sid(struct selinux_state *state, u16 domain, void *addrp, u32 addrlen, u32 *out_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; int rc; struct ocontext *c; if (!selinux_initialized(state)) { *out_sid = SECINITSID_NODE; return 0; } retry: rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; switch (domain) { case AF_INET: { u32 addr; rc = -EINVAL; if (addrlen != sizeof(u32)) goto out; addr = *((u32 *)addrp); c = policydb->ocontexts[OCON_NODE]; while (c) { if (c->u.node.addr == (addr & c->u.node.mask)) break; c = c->next; } break; } case AF_INET6: rc = -EINVAL; if (addrlen != sizeof(u64) * 2) goto out; c = policydb->ocontexts[OCON_NODE6]; while (c) { if (match_ipv6_addrmask(addrp, c->u.node6.addr, c->u.node6.mask)) break; c = c->next; } break; default: rc = 0; *out_sid = SECINITSID_NODE; goto out; } if (c) { if (!c->sid[0]) { rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; } *out_sid = c->sid[0]; } else { *out_sid = SECINITSID_NODE; } rc = 0; out: rcu_read_unlock(); return rc; } #define SIDS_NEL 25 /** * security_get_user_sids - Obtain reachable SIDs for a user. * @fromsid: starting SID * @username: username * @sids: array of reachable SIDs for user * @nel: number of elements in @sids * * Generate the set of SIDs for legal security contexts * for a given user that can be reached by @fromsid. * Set *@sids to point to a dynamically allocated * array containing the set of SIDs. Set *@nel to the * number of elements in the array. */ int security_get_user_sids(struct selinux_state *state, u32 fromsid, char *username, u32 **sids, u32 *nel) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct context *fromcon, usercon; u32 *mysids = NULL, *mysids2, sid; u32 i, j, mynel, maxnel = SIDS_NEL; struct user_datum *user; struct role_datum *role; struct ebitmap_node *rnode, *tnode; int rc; *sids = NULL; *nel = 0; if (!selinux_initialized(state)) return 0; mysids = kcalloc(maxnel, sizeof(*mysids), GFP_KERNEL); if (!mysids) return -ENOMEM; retry: mynel = 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; context_init(&usercon); rc = -EINVAL; fromcon = sidtab_search(sidtab, fromsid); if (!fromcon) goto out_unlock; rc = -EINVAL; user = symtab_search(&policydb->p_users, username); if (!user) goto out_unlock; usercon.user = user->value; ebitmap_for_each_positive_bit(&user->roles, rnode, i) { role = policydb->role_val_to_struct[i]; usercon.role = i + 1; ebitmap_for_each_positive_bit(&role->types, tnode, j) { usercon.type = j + 1; if (mls_setup_user_range(policydb, fromcon, user, &usercon)) continue; rc = sidtab_context_to_sid(sidtab, &usercon, &sid); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out_unlock; if (mynel < maxnel) { mysids[mynel++] = sid; } else { rc = -ENOMEM; maxnel += SIDS_NEL; mysids2 = kcalloc(maxnel, sizeof(*mysids2), GFP_ATOMIC); if (!mysids2) goto out_unlock; memcpy(mysids2, mysids, mynel * sizeof(*mysids2)); kfree(mysids); mysids = mysids2; mysids[mynel++] = sid; } } } rc = 0; out_unlock: rcu_read_unlock(); if (rc || !mynel) { kfree(mysids); return rc; } rc = -ENOMEM; mysids2 = kcalloc(mynel, sizeof(*mysids2), GFP_KERNEL); if (!mysids2) { kfree(mysids); return rc; } for (i = 0, j = 0; i < mynel; i++) { struct av_decision dummy_avd; rc = avc_has_perm_noaudit(state, fromsid, mysids[i], SECCLASS_PROCESS, /* kernel value */ PROCESS__TRANSITION, AVC_STRICT, &dummy_avd); if (!rc) mysids2[j++] = mysids[i]; cond_resched(); } kfree(mysids); *sids = mysids2; *nel = j; return 0; } /** * __security_genfs_sid - Helper to obtain a SID for a file in a filesystem * @fstype: filesystem type * @path: path from root of mount * @sclass: file security class * @sid: SID for path * * Obtain a SID to use for a file in a filesystem that * cannot support xattr or use a fixed labeling behavior like * transition SIDs or task SIDs. * * WARNING: This function may return -ESTALE, indicating that the caller * must retry the operation after re-acquiring the policy pointer! */ static inline int __security_genfs_sid(struct selinux_policy *policy, const char *fstype, char *path, u16 orig_sclass, u32 *sid) { struct policydb *policydb = &policy->policydb; struct sidtab *sidtab = policy->sidtab; int len; u16 sclass; struct genfs *genfs; struct ocontext *c; int rc, cmp = 0; while (path[0] == '/' && path[1] == '/') path++; sclass = unmap_class(&policy->map, orig_sclass); *sid = SECINITSID_UNLABELED; for (genfs = policydb->genfs; genfs; genfs = genfs->next) { cmp = strcmp(fstype, genfs->fstype); if (cmp <= 0) break; } rc = -ENOENT; if (!genfs || cmp) goto out; for (c = genfs->head; c; c = c->next) { len = strlen(c->u.name); if ((!c->v.sclass || sclass == c->v.sclass) && (strncmp(c->u.name, path, len) == 0)) break; } rc = -ENOENT; if (!c) goto out; if (!c->sid[0]) { rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); if (rc) goto out; } *sid = c->sid[0]; rc = 0; out: return rc; } /** * security_genfs_sid - Obtain a SID for a file in a filesystem * @fstype: filesystem type * @path: path from root of mount * @sclass: file security class * @sid: SID for path * * Acquire policy_rwlock before calling __security_genfs_sid() and release * it afterward. */ int security_genfs_sid(struct selinux_state *state, const char *fstype, char *path, u16 orig_sclass, u32 *sid) { struct selinux_policy *policy; int retval; if (!selinux_initialized(state)) { *sid = SECINITSID_UNLABELED; return 0; } do { rcu_read_lock(); policy = rcu_dereference(state->policy); retval = __security_genfs_sid(policy, fstype, path, orig_sclass, sid); rcu_read_unlock(); } while (retval == -ESTALE); return retval; } int selinux_policy_genfs_sid(struct selinux_policy *policy, const char *fstype, char *path, u16 orig_sclass, u32 *sid) { /* no lock required, policy is not yet accessible by other threads */ return __security_genfs_sid(policy, fstype, path, orig_sclass, sid); } /** * security_fs_use - Determine how to handle labeling for a filesystem. * @sb: superblock in question */ int security_fs_use(struct selinux_state *state, struct super_block *sb) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; int rc; struct ocontext *c; struct superblock_security_struct *sbsec = sb->s_security; const char *fstype = sb->s_type->name; if (!selinux_initialized(state)) { sbsec->behavior = SECURITY_FS_USE_NONE; sbsec->sid = SECINITSID_UNLABELED; return 0; } retry: rc = 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; c = policydb->ocontexts[OCON_FSUSE]; while (c) { if (strcmp(fstype, c->u.name) == 0) break; c = c->next; } if (c) { sbsec->behavior = c->v.behavior; if (!c->sid[0]) { rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; } sbsec->sid = c->sid[0]; } else { rc = __security_genfs_sid(policy, fstype, "/", SECCLASS_DIR, &sbsec->sid); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) { sbsec->behavior = SECURITY_FS_USE_NONE; rc = 0; } else { sbsec->behavior = SECURITY_FS_USE_GENFS; } } out: rcu_read_unlock(); return rc; } int security_get_bools(struct selinux_policy *policy, u32 *len, char ***names, int **values) { struct policydb *policydb; u32 i; int rc; policydb = &policy->policydb; *names = NULL; *values = NULL; rc = 0; *len = policydb->p_bools.nprim; if (!*len) goto out; rc = -ENOMEM; *names = kcalloc(*len, sizeof(char *), GFP_ATOMIC); if (!*names) goto err; rc = -ENOMEM; *values = kcalloc(*len, sizeof(int), GFP_ATOMIC); if (!*values) goto err; for (i = 0; i < *len; i++) { (*values)[i] = policydb->bool_val_to_struct[i]->state; rc = -ENOMEM; (*names)[i] = kstrdup(sym_name(policydb, SYM_BOOLS, i), GFP_ATOMIC); if (!(*names)[i]) goto err; } rc = 0; out: return rc; err: if (*names) { for (i = 0; i < *len; i++) kfree((*names)[i]); kfree(*names); } kfree(*values); *len = 0; *names = NULL; *values = NULL; goto out; } int security_set_bools(struct selinux_state *state, u32 len, int *values) { struct selinux_policy *newpolicy, *oldpolicy; int rc; u32 i, seqno = 0; if (!selinux_initialized(state)) return -EINVAL; oldpolicy = rcu_dereference_protected(state->policy, lockdep_is_held(&state->policy_mutex)); /* Consistency check on number of booleans, should never fail */ if (WARN_ON(len != oldpolicy->policydb.p_bools.nprim)) return -EINVAL; newpolicy = kmemdup(oldpolicy, sizeof(*newpolicy), GFP_KERNEL); if (!newpolicy) return -ENOMEM; /* * Deep copy only the parts of the policydb that might be * modified as a result of changing booleans. */ rc = cond_policydb_dup(&newpolicy->policydb, &oldpolicy->policydb); if (rc) { kfree(newpolicy); return -ENOMEM; } /* Update the boolean states in the copy */ for (i = 0; i < len; i++) { int new_state = !!values[i]; int old_state = newpolicy->policydb.bool_val_to_struct[i]->state; if (new_state != old_state) { audit_log(audit_context(), GFP_ATOMIC, AUDIT_MAC_CONFIG_CHANGE, "bool=%s val=%d old_val=%d auid=%u ses=%u", sym_name(&newpolicy->policydb, SYM_BOOLS, i), new_state, old_state, from_kuid(&init_user_ns, audit_get_loginuid(current)), audit_get_sessionid(current)); newpolicy->policydb.bool_val_to_struct[i]->state = new_state; } } /* Re-evaluate the conditional rules in the copy */ evaluate_cond_nodes(&newpolicy->policydb); /* Set latest granting seqno for new policy */ newpolicy->latest_granting = oldpolicy->latest_granting + 1; seqno = newpolicy->latest_granting; /* Install the new policy */ rcu_assign_pointer(state->policy, newpolicy); /* * Free the conditional portions of the old policydb * that were copied for the new policy, and the oldpolicy * structure itself but not what it references. */ synchronize_rcu(); selinux_policy_cond_free(oldpolicy); /* Notify others of the policy change */ selinux_notify_policy_change(state, seqno); return 0; } int security_get_bool_value(struct selinux_state *state, u32 index) { struct selinux_policy *policy; struct policydb *policydb; int rc; u32 len; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; rc = -EFAULT; len = policydb->p_bools.nprim; if (index >= len) goto out; rc = policydb->bool_val_to_struct[index]->state; out: rcu_read_unlock(); return rc; } static int security_preserve_bools(struct selinux_policy *oldpolicy, struct selinux_policy *newpolicy) { int rc, *bvalues = NULL; char **bnames = NULL; struct cond_bool_datum *booldatum; u32 i, nbools = 0; rc = security_get_bools(oldpolicy, &nbools, &bnames, &bvalues); if (rc) goto out; for (i = 0; i < nbools; i++) { booldatum = symtab_search(&newpolicy->policydb.p_bools, bnames[i]); if (booldatum) booldatum->state = bvalues[i]; } evaluate_cond_nodes(&newpolicy->policydb); out: if (bnames) { for (i = 0; i < nbools; i++) kfree(bnames[i]); } kfree(bnames); kfree(bvalues); return rc; } /* * security_sid_mls_copy() - computes a new sid based on the given * sid and the mls portion of mls_sid. */ int security_sid_mls_copy(struct selinux_state *state, u32 sid, u32 mls_sid, u32 *new_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; struct context *context1; struct context *context2; struct context newcon; char *s; u32 len; int rc; if (!selinux_initialized(state)) { *new_sid = sid; return 0; } retry: rc = 0; context_init(&newcon); rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; if (!policydb->mls_enabled) { *new_sid = sid; goto out_unlock; } rc = -EINVAL; context1 = sidtab_search(sidtab, sid); if (!context1) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, sid); goto out_unlock; } rc = -EINVAL; context2 = sidtab_search(sidtab, mls_sid); if (!context2) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, mls_sid); goto out_unlock; } newcon.user = context1->user; newcon.role = context1->role; newcon.type = context1->type; rc = mls_context_cpy(&newcon, context2); if (rc) goto out_unlock; /* Check the validity of the new context. */ if (!policydb_context_isvalid(policydb, &newcon)) { rc = convert_context_handle_invalid_context(state, policydb, &newcon); if (rc) { if (!context_struct_to_string(policydb, &newcon, &s, &len)) { struct audit_buffer *ab; ab = audit_log_start(audit_context(), GFP_ATOMIC, AUDIT_SELINUX_ERR); audit_log_format(ab, "op=security_sid_mls_copy invalid_context="); /* don't record NUL with untrusted strings */ audit_log_n_untrustedstring(ab, s, len - 1); audit_log_end(ab); kfree(s); } goto out_unlock; } } rc = sidtab_context_to_sid(sidtab, &newcon, new_sid); if (rc == -ESTALE) { rcu_read_unlock(); context_destroy(&newcon); goto retry; } out_unlock: rcu_read_unlock(); context_destroy(&newcon); return rc; } /** * security_net_peersid_resolve - Compare and resolve two network peer SIDs * @nlbl_sid: NetLabel SID * @nlbl_type: NetLabel labeling protocol type * @xfrm_sid: XFRM SID * * Description: * Compare the @nlbl_sid and @xfrm_sid values and if the two SIDs can be * resolved into a single SID it is returned via @peer_sid and the function * returns zero. Otherwise @peer_sid is set to SECSID_NULL and the function * returns a negative value. A table summarizing the behavior is below: * * | function return | @sid * ------------------------------+-----------------+----------------- * no peer labels | 0 | SECSID_NULL * single peer label | 0 | <peer_label> * multiple, consistent labels | 0 | <peer_label> * multiple, inconsistent labels | -<errno> | SECSID_NULL * */ int security_net_peersid_resolve(struct selinux_state *state, u32 nlbl_sid, u32 nlbl_type, u32 xfrm_sid, u32 *peer_sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; int rc; struct context *nlbl_ctx; struct context *xfrm_ctx; *peer_sid = SECSID_NULL; /* handle the common (which also happens to be the set of easy) cases * right away, these two if statements catch everything involving a * single or absent peer SID/label */ if (xfrm_sid == SECSID_NULL) { *peer_sid = nlbl_sid; return 0; } /* NOTE: an nlbl_type == NETLBL_NLTYPE_UNLABELED is a "fallback" label * and is treated as if nlbl_sid == SECSID_NULL when a XFRM SID/label * is present */ if (nlbl_sid == SECSID_NULL || nlbl_type == NETLBL_NLTYPE_UNLABELED) { *peer_sid = xfrm_sid; return 0; } if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; /* * We don't need to check initialized here since the only way both * nlbl_sid and xfrm_sid are not equal to SECSID_NULL would be if the * security server was initialized and state->initialized was true. */ if (!policydb->mls_enabled) { rc = 0; goto out; } rc = -EINVAL; nlbl_ctx = sidtab_search(sidtab, nlbl_sid); if (!nlbl_ctx) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, nlbl_sid); goto out; } rc = -EINVAL; xfrm_ctx = sidtab_search(sidtab, xfrm_sid); if (!xfrm_ctx) { pr_err("SELinux: %s: unrecognized SID %d\n", __func__, xfrm_sid); goto out; } rc = (mls_context_cmp(nlbl_ctx, xfrm_ctx) ? 0 : -EACCES); if (rc) goto out; /* at present NetLabel SIDs/labels really only carry MLS * information so if the MLS portion of the NetLabel SID * matches the MLS portion of the labeled XFRM SID/label * then pass along the XFRM SID as it is the most * expressive */ *peer_sid = xfrm_sid; out: rcu_read_unlock(); return rc; } static int get_classes_callback(void *k, void *d, void *args) { struct class_datum *datum = d; char *name = k, **classes = args; int value = datum->value - 1; classes[value] = kstrdup(name, GFP_ATOMIC); if (!classes[value]) return -ENOMEM; return 0; } int security_get_classes(struct selinux_policy *policy, char ***classes, int *nclasses) { struct policydb *policydb; int rc; policydb = &policy->policydb; rc = -ENOMEM; *nclasses = policydb->p_classes.nprim; *classes = kcalloc(*nclasses, sizeof(**classes), GFP_ATOMIC); if (!*classes) goto out; rc = hashtab_map(&policydb->p_classes.table, get_classes_callback, *classes); if (rc) { int i; for (i = 0; i < *nclasses; i++) kfree((*classes)[i]); kfree(*classes); } out: return rc; } static int get_permissions_callback(void *k, void *d, void *args) { struct perm_datum *datum = d; char *name = k, **perms = args; int value = datum->value - 1; perms[value] = kstrdup(name, GFP_ATOMIC); if (!perms[value]) return -ENOMEM; return 0; } int security_get_permissions(struct selinux_policy *policy, char *class, char ***perms, int *nperms) { struct policydb *policydb; int rc, i; struct class_datum *match; policydb = &policy->policydb; rc = -EINVAL; match = symtab_search(&policydb->p_classes, class); if (!match) { pr_err("SELinux: %s: unrecognized class %s\n", __func__, class); goto out; } rc = -ENOMEM; *nperms = match->permissions.nprim; *perms = kcalloc(*nperms, sizeof(**perms), GFP_ATOMIC); if (!*perms) goto out; if (match->comdatum) { rc = hashtab_map(&match->comdatum->permissions.table, get_permissions_callback, *perms); if (rc) goto err; } rc = hashtab_map(&match->permissions.table, get_permissions_callback, *perms); if (rc) goto err; out: return rc; err: for (i = 0; i < *nperms; i++) kfree((*perms)[i]); kfree(*perms); return rc; } int security_get_reject_unknown(struct selinux_state *state) { struct selinux_policy *policy; int value; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); value = policy->policydb.reject_unknown; rcu_read_unlock(); return value; } int security_get_allow_unknown(struct selinux_state *state) { struct selinux_policy *policy; int value; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); value = policy->policydb.allow_unknown; rcu_read_unlock(); return value; } /** * security_policycap_supported - Check for a specific policy capability * @req_cap: capability * * Description: * This function queries the currently loaded policy to see if it supports the * capability specified by @req_cap. Returns true (1) if the capability is * supported, false (0) if it isn't supported. * */ int security_policycap_supported(struct selinux_state *state, unsigned int req_cap) { struct selinux_policy *policy; int rc; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); rc = ebitmap_get_bit(&policy->policydb.policycaps, req_cap); rcu_read_unlock(); return rc; } struct selinux_audit_rule { u32 au_seqno; struct context au_ctxt; }; void selinux_audit_rule_free(void *vrule) { struct selinux_audit_rule *rule = vrule; if (rule) { context_destroy(&rule->au_ctxt); kfree(rule); } } int selinux_audit_rule_init(u32 field, u32 op, char *rulestr, void **vrule) { struct selinux_state *state = &selinux_state; struct selinux_policy *policy; struct policydb *policydb; struct selinux_audit_rule *tmprule; struct role_datum *roledatum; struct type_datum *typedatum; struct user_datum *userdatum; struct selinux_audit_rule **rule = (struct selinux_audit_rule **)vrule; int rc = 0; *rule = NULL; if (!selinux_initialized(state)) return -EOPNOTSUPP; switch (field) { case AUDIT_SUBJ_USER: case AUDIT_SUBJ_ROLE: case AUDIT_SUBJ_TYPE: case AUDIT_OBJ_USER: case AUDIT_OBJ_ROLE: case AUDIT_OBJ_TYPE: /* only 'equals' and 'not equals' fit user, role, and type */ if (op != Audit_equal && op != Audit_not_equal) return -EINVAL; break; case AUDIT_SUBJ_SEN: case AUDIT_SUBJ_CLR: case AUDIT_OBJ_LEV_LOW: case AUDIT_OBJ_LEV_HIGH: /* we do not allow a range, indicated by the presence of '-' */ if (strchr(rulestr, '-')) return -EINVAL; break; default: /* only the above fields are valid */ return -EINVAL; } tmprule = kzalloc(sizeof(struct selinux_audit_rule), GFP_KERNEL); if (!tmprule) return -ENOMEM; context_init(&tmprule->au_ctxt); rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; tmprule->au_seqno = policy->latest_granting; switch (field) { case AUDIT_SUBJ_USER: case AUDIT_OBJ_USER: rc = -EINVAL; userdatum = symtab_search(&policydb->p_users, rulestr); if (!userdatum) goto out; tmprule->au_ctxt.user = userdatum->value; break; case AUDIT_SUBJ_ROLE: case AUDIT_OBJ_ROLE: rc = -EINVAL; roledatum = symtab_search(&policydb->p_roles, rulestr); if (!roledatum) goto out; tmprule->au_ctxt.role = roledatum->value; break; case AUDIT_SUBJ_TYPE: case AUDIT_OBJ_TYPE: rc = -EINVAL; typedatum = symtab_search(&policydb->p_types, rulestr); if (!typedatum) goto out; tmprule->au_ctxt.type = typedatum->value; break; case AUDIT_SUBJ_SEN: case AUDIT_SUBJ_CLR: case AUDIT_OBJ_LEV_LOW: case AUDIT_OBJ_LEV_HIGH: rc = mls_from_string(policydb, rulestr, &tmprule->au_ctxt, GFP_ATOMIC); if (rc) goto out; break; } rc = 0; out: rcu_read_unlock(); if (rc) { selinux_audit_rule_free(tmprule); tmprule = NULL; } *rule = tmprule; return rc; } /* Check to see if the rule contains any selinux fields */ int selinux_audit_rule_known(struct audit_krule *rule) { int i; for (i = 0; i < rule->field_count; i++) { struct audit_field *f = &rule->fields[i]; switch (f->type) { case AUDIT_SUBJ_USER: case AUDIT_SUBJ_ROLE: case AUDIT_SUBJ_TYPE: case AUDIT_SUBJ_SEN: case AUDIT_SUBJ_CLR: case AUDIT_OBJ_USER: case AUDIT_OBJ_ROLE: case AUDIT_OBJ_TYPE: case AUDIT_OBJ_LEV_LOW: case AUDIT_OBJ_LEV_HIGH: return 1; } } return 0; } int selinux_audit_rule_match(u32 sid, u32 field, u32 op, void *vrule) { struct selinux_state *state = &selinux_state; struct selinux_policy *policy; struct context *ctxt; struct mls_level *level; struct selinux_audit_rule *rule = vrule; int match = 0; if (unlikely(!rule)) { WARN_ONCE(1, "selinux_audit_rule_match: missing rule\n"); return -ENOENT; } if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); if (rule->au_seqno < policy->latest_granting) { match = -ESTALE; goto out; } ctxt = sidtab_search(policy->sidtab, sid); if (unlikely(!ctxt)) { WARN_ONCE(1, "selinux_audit_rule_match: unrecognized SID %d\n", sid); match = -ENOENT; goto out; } /* a field/op pair that is not caught here will simply fall through without a match */ switch (field) { case AUDIT_SUBJ_USER: case AUDIT_OBJ_USER: switch (op) { case Audit_equal: match = (ctxt->user == rule->au_ctxt.user); break; case Audit_not_equal: match = (ctxt->user != rule->au_ctxt.user); break; } break; case AUDIT_SUBJ_ROLE: case AUDIT_OBJ_ROLE: switch (op) { case Audit_equal: match = (ctxt->role == rule->au_ctxt.role); break; case Audit_not_equal: match = (ctxt->role != rule->au_ctxt.role); break; } break; case AUDIT_SUBJ_TYPE: case AUDIT_OBJ_TYPE: switch (op) { case Audit_equal: match = (ctxt->type == rule->au_ctxt.type); break; case Audit_not_equal: match = (ctxt->type != rule->au_ctxt.type); break; } break; case AUDIT_SUBJ_SEN: case AUDIT_SUBJ_CLR: case AUDIT_OBJ_LEV_LOW: case AUDIT_OBJ_LEV_HIGH: level = ((field == AUDIT_SUBJ_SEN || field == AUDIT_OBJ_LEV_LOW) ? &ctxt->range.level[0] : &ctxt->range.level[1]); switch (op) { case Audit_equal: match = mls_level_eq(&rule->au_ctxt.range.level[0], level); break; case Audit_not_equal: match = !mls_level_eq(&rule->au_ctxt.range.level[0], level); break; case Audit_lt: match = (mls_level_dom(&rule->au_ctxt.range.level[0], level) && !mls_level_eq(&rule->au_ctxt.range.level[0], level)); break; case Audit_le: match = mls_level_dom(&rule->au_ctxt.range.level[0], level); break; case Audit_gt: match = (mls_level_dom(level, &rule->au_ctxt.range.level[0]) && !mls_level_eq(level, &rule->au_ctxt.range.level[0])); break; case Audit_ge: match = mls_level_dom(level, &rule->au_ctxt.range.level[0]); break; } } out: rcu_read_unlock(); return match; } static int (*aurule_callback)(void) = audit_update_lsm_rules; static int aurule_avc_callback(u32 event) { int err = 0; if (event == AVC_CALLBACK_RESET && aurule_callback) err = aurule_callback(); return err; } static int __init aurule_init(void) { int err; err = avc_add_callback(aurule_avc_callback, AVC_CALLBACK_RESET); if (err) panic("avc_add_callback() failed, error %d\n", err); return err; } __initcall(aurule_init); #ifdef CONFIG_NETLABEL /** * security_netlbl_cache_add - Add an entry to the NetLabel cache * @secattr: the NetLabel packet security attributes * @sid: the SELinux SID * * Description: * Attempt to cache the context in @ctx, which was derived from the packet in * @skb, in the NetLabel subsystem cache. This function assumes @secattr has * already been initialized. * */ static void security_netlbl_cache_add(struct netlbl_lsm_secattr *secattr, u32 sid) { u32 *sid_cache; sid_cache = kmalloc(sizeof(*sid_cache), GFP_ATOMIC); if (sid_cache == NULL) return; secattr->cache = netlbl_secattr_cache_alloc(GFP_ATOMIC); if (secattr->cache == NULL) { kfree(sid_cache); return; } *sid_cache = sid; secattr->cache->free = kfree; secattr->cache->data = sid_cache; secattr->flags |= NETLBL_SECATTR_CACHE; } /** * security_netlbl_secattr_to_sid - Convert a NetLabel secattr to a SELinux SID * @secattr: the NetLabel packet security attributes * @sid: the SELinux SID * * Description: * Convert the given NetLabel security attributes in @secattr into a * SELinux SID. If the @secattr field does not contain a full SELinux * SID/context then use SECINITSID_NETMSG as the foundation. If possible the * 'cache' field of @secattr is set and the CACHE flag is set; this is to * allow the @secattr to be used by NetLabel to cache the secattr to SID * conversion for future lookups. Returns zero on success, negative values on * failure. * */ int security_netlbl_secattr_to_sid(struct selinux_state *state, struct netlbl_lsm_secattr *secattr, u32 *sid) { struct selinux_policy *policy; struct policydb *policydb; struct sidtab *sidtab; int rc; struct context *ctx; struct context ctx_new; if (!selinux_initialized(state)) { *sid = SECSID_NULL; return 0; } retry: rc = 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; sidtab = policy->sidtab; if (secattr->flags & NETLBL_SECATTR_CACHE) *sid = *(u32 *)secattr->cache->data; else if (secattr->flags & NETLBL_SECATTR_SECID) *sid = secattr->attr.secid; else if (secattr->flags & NETLBL_SECATTR_MLS_LVL) { rc = -EIDRM; ctx = sidtab_search(sidtab, SECINITSID_NETMSG); if (ctx == NULL) goto out; context_init(&ctx_new); ctx_new.user = ctx->user; ctx_new.role = ctx->role; ctx_new.type = ctx->type; mls_import_netlbl_lvl(policydb, &ctx_new, secattr); if (secattr->flags & NETLBL_SECATTR_MLS_CAT) { rc = mls_import_netlbl_cat(policydb, &ctx_new, secattr); if (rc) goto out; } rc = -EIDRM; if (!mls_context_isvalid(policydb, &ctx_new)) { ebitmap_destroy(&ctx_new.range.level[0].cat); goto out; } rc = sidtab_context_to_sid(sidtab, &ctx_new, sid); ebitmap_destroy(&ctx_new.range.level[0].cat); if (rc == -ESTALE) { rcu_read_unlock(); goto retry; } if (rc) goto out; security_netlbl_cache_add(secattr, *sid); } else *sid = SECSID_NULL; out: rcu_read_unlock(); return rc; } /** * security_netlbl_sid_to_secattr - Convert a SELinux SID to a NetLabel secattr * @sid: the SELinux SID * @secattr: the NetLabel packet security attributes * * Description: * Convert the given SELinux SID in @sid into a NetLabel security attribute. * Returns zero on success, negative values on failure. * */ int security_netlbl_sid_to_secattr(struct selinux_state *state, u32 sid, struct netlbl_lsm_secattr *secattr) { struct selinux_policy *policy; struct policydb *policydb; int rc; struct context *ctx; if (!selinux_initialized(state)) return 0; rcu_read_lock(); policy = rcu_dereference(state->policy); policydb = &policy->policydb; rc = -ENOENT; ctx = sidtab_search(policy->sidtab, sid); if (ctx == NULL) goto out; rc = -ENOMEM; secattr->domain = kstrdup(sym_name(policydb, SYM_TYPES, ctx->type - 1), GFP_ATOMIC); if (secattr->domain == NULL) goto out; secattr->attr.secid = sid; secattr->flags |= NETLBL_SECATTR_DOMAIN_CPY | NETLBL_SECATTR_SECID; mls_export_netlbl_lvl(policydb, ctx, secattr); rc = mls_export_netlbl_cat(policydb, ctx, secattr); out: rcu_read_unlock(); return rc; } #endif /* CONFIG_NETLABEL */ /** * security_read_policy - read the policy. * @data: binary policy data * @len: length of data in bytes * */ int security_read_policy(struct selinux_state *state, void **data, size_t *len) { struct selinux_policy *policy; int rc; struct policy_file fp; policy = rcu_dereference_protected( state->policy, lockdep_is_held(&state->policy_mutex)); if (!policy) return -EINVAL; *len = policy->policydb.len; *data = vmalloc_user(*len); if (!*data) return -ENOMEM; fp.data = *data; fp.len = *len; rc = policydb_write(&policy->policydb, &fp); if (rc) return rc; *len = (unsigned long)fp.data - (unsigned long)*data; return 0; }
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/ext4/ialloc.c * * Copyright (C) 1992, 1993, 1994, 1995 * Remy Card (card@masi.ibp.fr) * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * * BSD ufs-inspired inode and directory allocation by * Stephen Tweedie (sct@redhat.com), 1993 * Big-endian to little-endian byte-swapping/bitmaps by * David S. Miller (davem@caip.rutgers.edu), 1995 */ #include <linux/time.h> #include <linux/fs.h> #include <linux/stat.h> #include <linux/string.h> #include <linux/quotaops.h> #include <linux/buffer_head.h> #include <linux/random.h> #include <linux/bitops.h> #include <linux/blkdev.h> #include <linux/cred.h> #include <asm/byteorder.h> #include "ext4.h" #include "ext4_jbd2.h" #include "xattr.h" #include "acl.h" #include <trace/events/ext4.h> /* * ialloc.c contains the inodes allocation and deallocation routines */ /* * The free inodes are managed by bitmaps. A file system contains several * blocks groups. Each group contains 1 bitmap block for blocks, 1 bitmap * block for inodes, N blocks for the inode table and data blocks. * * The file system contains group descriptors which are located after the * super block. Each descriptor contains the number of the bitmap block and * the free blocks count in the block. */ /* * To avoid calling the atomic setbit hundreds or thousands of times, we only * need to use it within a single byte (to ensure we get endianness right). * We can use memset for the rest of the bitmap as there are no other users. */ void ext4_mark_bitmap_end(int start_bit, int end_bit, char *bitmap) { int i; if (start_bit >= end_bit) return; ext4_debug("mark end bits +%d through +%d used\n", start_bit, end_bit); for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++) ext4_set_bit(i, bitmap); if (i < end_bit) memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3); } void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate) { if (uptodate) { set_buffer_uptodate(bh); set_bitmap_uptodate(bh); } unlock_buffer(bh); put_bh(bh); } static int ext4_validate_inode_bitmap(struct super_block *sb, struct ext4_group_desc *desc, ext4_group_t block_group, struct buffer_head *bh) { ext4_fsblk_t blk; struct ext4_group_info *grp; if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) return 0; grp = ext4_get_group_info(sb, block_group); if (buffer_verified(bh)) return 0; if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) return -EFSCORRUPTED; ext4_lock_group(sb, block_group); if (buffer_verified(bh)) goto verified; blk = ext4_inode_bitmap(sb, desc); if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh, EXT4_INODES_PER_GROUP(sb) / 8) || ext4_simulate_fail(sb, EXT4_SIM_IBITMAP_CRC)) { ext4_unlock_group(sb, block_group); ext4_error(sb, "Corrupt inode bitmap - block_group = %u, " "inode_bitmap = %llu", block_group, blk); ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_IBITMAP_CORRUPT); return -EFSBADCRC; } set_buffer_verified(bh); verified: ext4_unlock_group(sb, block_group); return 0; } /* * Read the inode allocation bitmap for a given block_group, reading * into the specified slot in the superblock's bitmap cache. * * Return buffer_head of bitmap on success, or an ERR_PTR on error. */ static struct buffer_head * ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group) { struct ext4_group_desc *desc; struct ext4_sb_info *sbi = EXT4_SB(sb); struct buffer_head *bh = NULL; ext4_fsblk_t bitmap_blk; int err; desc = ext4_get_group_desc(sb, block_group, NULL); if (!desc) return ERR_PTR(-EFSCORRUPTED); bitmap_blk = ext4_inode_bitmap(sb, desc); if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) || (bitmap_blk >= ext4_blocks_count(sbi->s_es))) { ext4_error(sb, "Invalid inode bitmap blk %llu in " "block_group %u", bitmap_blk, block_group); ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_IBITMAP_CORRUPT); return ERR_PTR(-EFSCORRUPTED); } bh = sb_getblk(sb, bitmap_blk); if (unlikely(!bh)) { ext4_warning(sb, "Cannot read inode bitmap - " "block_group = %u, inode_bitmap = %llu", block_group, bitmap_blk); return ERR_PTR(-ENOMEM); } if (bitmap_uptodate(bh)) goto verify; lock_buffer(bh); if (bitmap_uptodate(bh)) { unlock_buffer(bh); goto verify; } ext4_lock_group(sb, block_group); if (ext4_has_group_desc_csum(sb) && (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))) { if (block_group == 0) { ext4_unlock_group(sb, block_group); unlock_buffer(bh); ext4_error(sb, "Inode bitmap for bg 0 marked " "uninitialized"); err = -EFSCORRUPTED; goto out; } memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8); ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), sb->s_blocksize * 8, bh->b_data); set_bitmap_uptodate(bh); set_buffer_uptodate(bh); set_buffer_verified(bh); ext4_unlock_group(sb, block_group); unlock_buffer(bh); return bh; } ext4_unlock_group(sb, block_group); if (buffer_uptodate(bh)) { /* * if not uninit if bh is uptodate, * bitmap is also uptodate */ set_bitmap_uptodate(bh); unlock_buffer(bh); goto verify; } /* * submit the buffer_head for reading */ trace_ext4_load_inode_bitmap(sb, block_group); ext4_read_bh(bh, REQ_META | REQ_PRIO, ext4_end_bitmap_read); ext4_simulate_fail_bh(sb, bh, EXT4_SIM_IBITMAP_EIO); if (!buffer_uptodate(bh)) { put_bh(bh); ext4_error_err(sb, EIO, "Cannot read inode bitmap - " "block_group = %u, inode_bitmap = %llu", block_group, bitmap_blk); ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_IBITMAP_CORRUPT); return ERR_PTR(-EIO); } verify: err = ext4_validate_inode_bitmap(sb, desc, block_group, bh); if (err) goto out; return bh; out: put_bh(bh); return ERR_PTR(err); } /* * NOTE! When we get the inode, we're the only people * that have access to it, and as such there are no * race conditions we have to worry about. The inode * is not on the hash-lists, and it cannot be reached * through the filesystem because the directory entry * has been deleted earlier. * * HOWEVER: we must make sure that we get no aliases, * which means that we have to call "clear_inode()" * _before_ we mark the inode not in use in the inode * bitmaps. Otherwise a newly created file might use * the same inode number (not actually the same pointer * though), and then we'd have two inodes sharing the * same inode number and space on the harddisk. */ void ext4_free_inode(handle_t *handle, struct inode *inode) { struct super_block *sb = inode->i_sb; int is_directory; unsigned long ino; struct buffer_head *bitmap_bh = NULL; struct buffer_head *bh2; ext4_group_t block_group; unsigned long bit; struct ext4_group_desc *gdp; struct ext4_super_block *es; struct ext4_sb_info *sbi; int fatal = 0, err, count, cleared; struct ext4_group_info *grp; if (!sb) { printk(KERN_ERR "EXT4-fs: %s:%d: inode on " "nonexistent device\n", __func__, __LINE__); return; } if (atomic_read(&inode->i_count) > 1) { ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d", __func__, __LINE__, inode->i_ino, atomic_read(&inode->i_count)); return; } if (inode->i_nlink) { ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: nlink=%d\n", __func__, __LINE__, inode->i_ino, inode->i_nlink); return; } sbi = EXT4_SB(sb); ino = inode->i_ino; ext4_debug("freeing inode %lu\n", ino); trace_ext4_free_inode(inode); dquot_initialize(inode); dquot_free_inode(inode); is_directory = S_ISDIR(inode->i_mode); /* Do this BEFORE marking the inode not in use or returning an error */ ext4_clear_inode(inode); es = sbi->s_es; if (ino < EXT4_FIRST_INO(sb) || ino > le32_to_cpu(es->s_inodes_count)) { ext4_error(sb, "reserved or nonexistent inode %lu", ino); goto error_return; } block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb); bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb); bitmap_bh = ext4_read_inode_bitmap(sb, block_group); /* Don't bother if the inode bitmap is corrupt. */ if (IS_ERR(bitmap_bh)) { fatal = PTR_ERR(bitmap_bh); bitmap_bh = NULL; goto error_return; } if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) { grp = ext4_get_group_info(sb, block_group); if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) { fatal = -EFSCORRUPTED; goto error_return; } } BUFFER_TRACE(bitmap_bh, "get_write_access"); fatal = ext4_journal_get_write_access(handle, bitmap_bh); if (fatal) goto error_return; fatal = -ESRCH; gdp = ext4_get_group_desc(sb, block_group, &bh2); if (gdp) { BUFFER_TRACE(bh2, "get_write_access"); fatal = ext4_journal_get_write_access(handle, bh2); } ext4_lock_group(sb, block_group); cleared = ext4_test_and_clear_bit(bit, bitmap_bh->b_data); if (fatal || !cleared) { ext4_unlock_group(sb, block_group); goto out; } count = ext4_free_inodes_count(sb, gdp) + 1; ext4_free_inodes_set(sb, gdp, count); if (is_directory) { count = ext4_used_dirs_count(sb, gdp) - 1; ext4_used_dirs_set(sb, gdp, count); if (percpu_counter_initialized(&sbi->s_dirs_counter)) percpu_counter_dec(&sbi->s_dirs_counter); } ext4_inode_bitmap_csum_set(sb, block_group, gdp, bitmap_bh, EXT4_INODES_PER_GROUP(sb) / 8); ext4_group_desc_csum_set(sb, block_group, gdp); ext4_unlock_group(sb, block_group); if (percpu_counter_initialized(&sbi->s_freeinodes_counter)) percpu_counter_inc(&sbi->s_freeinodes_counter); if (sbi->s_log_groups_per_flex) { struct flex_groups *fg; fg = sbi_array_rcu_deref(sbi, s_flex_groups, ext4_flex_group(sbi, block_group)); atomic_inc(&fg->free_inodes); if (is_directory) atomic_dec(&fg->used_dirs); } BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata"); fatal = ext4_handle_dirty_metadata(handle, NULL, bh2); out: if (cleared) { BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); if (!fatal) fatal = err; } else { ext4_error(sb, "bit already cleared for inode %lu", ino); ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_IBITMAP_CORRUPT); } error_return: brelse(bitmap_bh); ext4_std_error(sb, fatal); } struct orlov_stats { __u64 free_clusters; __u32 free_inodes; __u32 used_dirs; }; /* * Helper function for Orlov's allocator; returns critical information * for a particular block group or flex_bg. If flex_size is 1, then g * is a block group number; otherwise it is flex_bg number. */ static void get_orlov_stats(struct super_block *sb, ext4_group_t g, int flex_size, struct orlov_stats *stats) { struct ext4_group_desc *desc; if (flex_size > 1) { struct flex_groups *fg = sbi_array_rcu_deref(EXT4_SB(sb), s_flex_groups, g); stats->free_inodes = atomic_read(&fg->free_inodes); stats->free_clusters = atomic64_read(&fg->free_clusters); stats->used_dirs = atomic_read(&fg->used_dirs); return; } desc = ext4_get_group_desc(sb, g, NULL); if (desc) { stats->free_inodes = ext4_free_inodes_count(sb, desc); stats->free_clusters = ext4_free_group_clusters(sb, desc); stats->used_dirs = ext4_used_dirs_count(sb, desc); } else { stats->free_inodes = 0; stats->free_clusters = 0; stats->used_dirs = 0; } } /* * Orlov's allocator for directories. * * We always try to spread first-level directories. * * If there are blockgroups with both free inodes and free clusters counts * not worse than average we return one with smallest directory count. * Otherwise we simply return a random group. * * For the rest rules look so: * * It's OK to put directory into a group unless * it has too many directories already (max_dirs) or * it has too few free inodes left (min_inodes) or * it has too few free clusters left (min_clusters) or * Parent's group is preferred, if it doesn't satisfy these * conditions we search cyclically through the rest. If none * of the groups look good we just look for a group with more * free inodes than average (starting at parent's group). */ static int find_group_orlov(struct super_block *sb, struct inode *parent, ext4_group_t *group, umode_t mode, const struct qstr *qstr) { ext4_group_t parent_group = EXT4_I(parent)->i_block_group; struct ext4_sb_info *sbi = EXT4_SB(sb); ext4_group_t real_ngroups = ext4_get_groups_count(sb); int inodes_per_group = EXT4_INODES_PER_GROUP(sb); unsigned int freei, avefreei, grp_free; ext4_fsblk_t freec, avefreec; unsigned int ndirs; int max_dirs, min_inodes; ext4_grpblk_t min_clusters; ext4_group_t i, grp, g, ngroups; struct ext4_group_desc *desc; struct orlov_stats stats; int flex_size = ext4_flex_bg_size(sbi); struct dx_hash_info hinfo; ngroups = real_ngroups; if (flex_size > 1) { ngroups = (real_ngroups + flex_size - 1) >> sbi->s_log_groups_per_flex; parent_group >>= sbi->s_log_groups_per_flex; } freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter); avefreei = freei / ngroups; freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter); avefreec = freec; do_div(avefreec, ngroups); ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter); if (S_ISDIR(mode) && ((parent == d_inode(sb->s_root)) || (ext4_test_inode_flag(parent, EXT4_INODE_TOPDIR)))) { int best_ndir = inodes_per_group; int ret = -1; if (qstr) { hinfo.hash_version = DX_HASH_HALF_MD4; hinfo.seed = sbi->s_hash_seed; ext4fs_dirhash(parent, qstr->name, qstr->len, &hinfo); grp = hinfo.hash; } else grp = prandom_u32(); parent_group = (unsigned)grp % ngroups; for (i = 0; i < ngroups; i++) { g = (parent_group + i) % ngroups; get_orlov_stats(sb, g, flex_size, &stats); if (!stats.free_inodes) continue; if (stats.used_dirs >= best_ndir) continue; if (stats.free_inodes < avefreei) continue; if (stats.free_clusters < avefreec) continue; grp = g; ret = 0; best_ndir = stats.used_dirs; } if (ret) goto fallback; found_flex_bg: if (flex_size == 1) { *group = grp; return 0; } /* * We pack inodes at the beginning of the flexgroup's * inode tables. Block allocation decisions will do * something similar, although regular files will * start at 2nd block group of the flexgroup. See * ext4_ext_find_goal() and ext4_find_near(). */ grp *= flex_size; for (i = 0; i < flex_size; i++) { if (grp+i >= real_ngroups) break; desc = ext4_get_group_desc(sb, grp+i, NULL); if (desc && ext4_free_inodes_count(sb, desc)) { *group = grp+i; return 0; } } goto fallback; } max_dirs = ndirs / ngroups + inodes_per_group / 16; min_inodes = avefreei - inodes_per_group*flex_size / 4; if (min_inodes < 1) min_inodes = 1; min_clusters = avefreec - EXT4_CLUSTERS_PER_GROUP(sb)*flex_size / 4; /* * Start looking in the flex group where we last allocated an * inode for this parent directory */ if (EXT4_I(parent)->i_last_alloc_group != ~0) { parent_group = EXT4_I(parent)->i_last_alloc_group; if (flex_size > 1) parent_group >>= sbi->s_log_groups_per_flex; } for (i = 0; i < ngroups; i++) { grp = (parent_group + i) % ngroups; get_orlov_stats(sb, grp, flex_size, &stats); if (stats.used_dirs >= max_dirs) continue; if (stats.free_inodes < min_inodes) continue; if (stats.free_clusters < min_clusters) continue; goto found_flex_bg; } fallback: ngroups = real_ngroups; avefreei = freei / ngroups; fallback_retry: parent_group = EXT4_I(parent)->i_block_group; for (i = 0; i < ngroups; i++) { grp = (parent_group + i) % ngroups; desc = ext4_get_group_desc(sb, grp, NULL); if (desc) { grp_free = ext4_free_inodes_count(sb, desc); if (grp_free && grp_free >= avefreei) { *group = grp; return 0; } } } if (avefreei) { /* * The free-inodes counter is approximate, and for really small * filesystems the above test can fail to find any blockgroups */ avefreei = 0; goto fallback_retry; } return -1; } static int find_group_other(struct super_block *sb, struct inode *parent, ext4_group_t *group, umode_t mode) { ext4_group_t parent_group = EXT4_I(parent)->i_block_group; ext4_group_t i, last, ngroups = ext4_get_groups_count(sb); struct ext4_group_desc *desc; int flex_size = ext4_flex_bg_size(EXT4_SB(sb)); /* * Try to place the inode is the same flex group as its * parent. If we can't find space, use the Orlov algorithm to * find another flex group, and store that information in the * parent directory's inode information so that use that flex * group for future allocations. */ if (flex_size > 1) { int retry = 0; try_again: parent_group &= ~(flex_size-1); last = parent_group + flex_size; if (last > ngroups) last = ngroups; for (i = parent_group; i < last; i++) { desc = ext4_get_group_desc(sb, i, NULL); if (desc && ext4_free_inodes_count(sb, desc)) { *group = i; return 0; } } if (!retry && EXT4_I(parent)->i_last_alloc_group != ~0) { retry = 1; parent_group = EXT4_I(parent)->i_last_alloc_group; goto try_again; } /* * If this didn't work, use the Orlov search algorithm * to find a new flex group; we pass in the mode to * avoid the topdir algorithms. */ *group = parent_group + flex_size; if (*group > ngroups) *group = 0; return find_group_orlov(sb, parent, group, mode, NULL); } /* * Try to place the inode in its parent directory */ *group = parent_group; desc = ext4_get_group_desc(sb, *group, NULL); if (desc && ext4_free_inodes_count(sb, desc) && ext4_free_group_clusters(sb, desc)) return 0; /* * We're going to place this inode in a different blockgroup from its * parent. We want to cause files in a common directory to all land in * the same blockgroup. But we want files which are in a different * directory which shares a blockgroup with our parent to land in a * different blockgroup. * * So add our directory's i_ino into the starting point for the hash. */ *group = (*group + parent->i_ino) % ngroups; /* * Use a quadratic hash to find a group with a free inode and some free * blocks. */ for (i = 1; i < ngroups; i <<= 1) { *group += i; if (*group >= ngroups) *group -= ngroups; desc = ext4_get_group_desc(sb, *group, NULL); if (desc && ext4_free_inodes_count(sb, desc) && ext4_free_group_clusters(sb, desc)) return 0; } /* * That failed: try linear search for a free inode, even if that group * has no free blocks. */ *group = parent_group; for (i = 0; i < ngroups; i++) { if (++*group >= ngroups) *group = 0; desc = ext4_get_group_desc(sb, *group, NULL); if (desc && ext4_free_inodes_count(sb, desc)) return 0; } return -1; } /* * In no journal mode, if an inode has recently been deleted, we want * to avoid reusing it until we're reasonably sure the inode table * block has been written back to disk. (Yes, these values are * somewhat arbitrary...) */ #define RECENTCY_MIN 60 #define RECENTCY_DIRTY 300 static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino) { struct ext4_group_desc *gdp; struct ext4_inode *raw_inode; struct buffer_head *bh; int inodes_per_block = EXT4_SB(sb)->s_inodes_per_block; int offset, ret = 0; int recentcy = RECENTCY_MIN; u32 dtime, now; gdp = ext4_get_group_desc(sb, group, NULL); if (unlikely(!gdp)) return 0; bh = sb_find_get_block(sb, ext4_inode_table(sb, gdp) + (ino / inodes_per_block)); if (!bh || !buffer_uptodate(bh)) /* * If the block is not in the buffer cache, then it * must have been written out. */ goto out; offset = (ino % inodes_per_block) * EXT4_INODE_SIZE(sb); raw_inode = (struct ext4_inode *) (bh->b_data + offset); /* i_dtime is only 32 bits on disk, but we only care about relative * times in the range of a few minutes (i.e. long enough to sync a * recently-deleted inode to disk), so using the low 32 bits of the * clock (a 68 year range) is enough, see time_before32() */ dtime = le32_to_cpu(raw_inode->i_dtime); now = ktime_get_real_seconds(); if (buffer_dirty(bh)) recentcy += RECENTCY_DIRTY; if (dtime && time_before32(dtime, now) && time_before32(now, dtime + recentcy)) ret = 1; out: brelse(bh); return ret; } static int find_inode_bit(struct super_block *sb, ext4_group_t group, struct buffer_head *bitmap, unsigned long *ino) { bool check_recently_deleted = EXT4_SB(sb)->s_journal == NULL; unsigned long recently_deleted_ino = EXT4_INODES_PER_GROUP(sb); next: *ino = ext4_find_next_zero_bit((unsigned long *) bitmap->b_data, EXT4_INODES_PER_GROUP(sb), *ino); if (*ino >= EXT4_INODES_PER_GROUP(sb)) goto not_found; if (check_recently_deleted && recently_deleted(sb, group, *ino)) { recently_deleted_ino = *ino; *ino = *ino + 1; if (*ino < EXT4_INODES_PER_GROUP(sb)) goto next; goto not_found; } return 1; not_found: if (recently_deleted_ino >= EXT4_INODES_PER_GROUP(sb)) return 0; /* * Not reusing recently deleted inodes is mostly a preference. We don't * want to report ENOSPC or skew allocation patterns because of that. * So return even recently deleted inode if we could find better in the * given range. */ *ino = recently_deleted_ino; return 1; } int ext4_mark_inode_used(struct super_block *sb, int ino) { unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count); struct buffer_head *inode_bitmap_bh = NULL, *group_desc_bh = NULL; struct ext4_group_desc *gdp; ext4_group_t group; int bit; int err = -EFSCORRUPTED; if (ino < EXT4_FIRST_INO(sb) || ino > max_ino) goto out; group = (ino - 1) / EXT4_INODES_PER_GROUP(sb); bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb); inode_bitmap_bh = ext4_read_inode_bitmap(sb, group); if (IS_ERR(inode_bitmap_bh)) return PTR_ERR(inode_bitmap_bh); if (ext4_test_bit(bit, inode_bitmap_bh->b_data)) { err = 0; goto out; } gdp = ext4_get_group_desc(sb, group, &group_desc_bh); if (!gdp || !group_desc_bh) { err = -EINVAL; goto out; } ext4_set_bit(bit, inode_bitmap_bh->b_data); BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_metadata(NULL, NULL, inode_bitmap_bh); if (err) { ext4_std_error(sb, err); goto out; } err = sync_dirty_buffer(inode_bitmap_bh); if (err) { ext4_std_error(sb, err); goto out; } /* We may have to initialize the block bitmap if it isn't already */ if (ext4_has_group_desc_csum(sb) && gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { struct buffer_head *block_bitmap_bh; block_bitmap_bh = ext4_read_block_bitmap(sb, group); if (IS_ERR(block_bitmap_bh)) { err = PTR_ERR(block_bitmap_bh); goto out; } BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap"); err = ext4_handle_dirty_metadata(NULL, NULL, block_bitmap_bh); sync_dirty_buffer(block_bitmap_bh); /* recheck and clear flag under lock if we still need to */ ext4_lock_group(sb, group); if (ext4_has_group_desc_csum(sb) && (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); ext4_free_group_clusters_set(sb, gdp, ext4_free_clusters_after_init(sb, group, gdp)); ext4_block_bitmap_csum_set(sb, group, gdp, block_bitmap_bh); ext4_group_desc_csum_set(sb, group, gdp); } ext4_unlock_group(sb, group); brelse(block_bitmap_bh); if (err) { ext4_std_error(sb, err); goto out; } } /* Update the relevant bg descriptor fields */ if (ext4_has_group_desc_csum(sb)) { int free; ext4_lock_group(sb, group); /* while we modify the bg desc */ free = EXT4_INODES_PER_GROUP(sb) - ext4_itable_unused_count(sb, gdp); if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT); free = 0; } /* * Check the relative inode number against the last used * relative inode number in this group. if it is greater * we need to update the bg_itable_unused count */ if (bit >= free) ext4_itable_unused_set(sb, gdp, (EXT4_INODES_PER_GROUP(sb) - bit - 1)); } else { ext4_lock_group(sb, group); } ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1); if (ext4_has_group_desc_csum(sb)) { ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh, EXT4_INODES_PER_GROUP(sb) / 8); ext4_group_desc_csum_set(sb, group, gdp); } ext4_unlock_group(sb, group); err = ext4_handle_dirty_metadata(NULL, NULL, group_desc_bh); sync_dirty_buffer(group_desc_bh); out: return err; } static int ext4_xattr_credits_for_new_inode(struct inode *dir, mode_t mode, bool encrypt) { struct super_block *sb = dir->i_sb; int nblocks = 0; #ifdef CONFIG_EXT4_FS_POSIX_ACL struct posix_acl *p = get_acl(dir, ACL_TYPE_DEFAULT); if (IS_ERR(p)) return PTR_ERR(p); if (p) { int acl_size = p->a_count * sizeof(ext4_acl_entry); nblocks += (S_ISDIR(mode) ? 2 : 1) * __ext4_xattr_set_credits(sb, NULL /* inode */, NULL /* block_bh */, acl_size, true /* is_create */); posix_acl_release(p); } #endif #ifdef CONFIG_SECURITY { int num_security_xattrs = 1; #ifdef CONFIG_INTEGRITY num_security_xattrs++; #endif /* * We assume that security xattrs are never more than 1k. * In practice they are under 128 bytes. */ nblocks += num_security_xattrs * __ext4_xattr_set_credits(sb, NULL /* inode */, NULL /* block_bh */, 1024, true /* is_create */); } #endif if (encrypt) nblocks += __ext4_xattr_set_credits(sb, NULL /* inode */, NULL /* block_bh */, FSCRYPT_SET_CONTEXT_MAX_SIZE, true /* is_create */); return nblocks; } /* * There are two policies for allocating an inode. If the new inode is * a directory, then a forward search is made for a block group with both * free space and a low directory-to-inode ratio; if that fails, then of * the groups with above-average free space, that group with the fewest * directories already is chosen. * * For other inodes, search forward from the parent directory's block * group to find a free inode. */ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir, umode_t mode, const struct qstr *qstr, __u32 goal, uid_t *owner, __u32 i_flags, int handle_type, unsigned int line_no, int nblocks) { struct super_block *sb; struct buffer_head *inode_bitmap_bh = NULL; struct buffer_head *group_desc_bh; ext4_group_t ngroups, group = 0; unsigned long ino = 0; struct inode *inode; struct ext4_group_desc *gdp = NULL; struct ext4_inode_info *ei; struct ext4_sb_info *sbi; int ret2, err; struct inode *ret; ext4_group_t i; ext4_group_t flex_group; struct ext4_group_info *grp = NULL; bool encrypt = false; /* Cannot create files in a deleted directory */ if (!dir || !dir->i_nlink) return ERR_PTR(-EPERM); sb = dir->i_sb; sbi = EXT4_SB(sb); if (unlikely(ext4_forced_shutdown(sbi))) return ERR_PTR(-EIO); ngroups = ext4_get_groups_count(sb); trace_ext4_request_inode(dir, mode); inode = new_inode(sb); if (!inode) return ERR_PTR(-ENOMEM); ei = EXT4_I(inode); /* * Initialize owners and quota early so that we don't have to account * for quota initialization worst case in standard inode creating * transaction */ if (owner) { inode->i_mode = mode; i_uid_write(inode, owner[0]); i_gid_write(inode, owner[1]); } else if (test_opt(sb, GRPID)) { inode->i_mode = mode; inode->i_uid = current_fsuid(); inode->i_gid = dir->i_gid; } else inode_init_owner(inode, dir, mode); if (ext4_has_feature_project(sb) && ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) ei->i_projid = EXT4_I(dir)->i_projid; else ei->i_projid = make_kprojid(&init_user_ns, EXT4_DEF_PROJID); if (!(i_flags & EXT4_EA_INODE_FL)) { err = fscrypt_prepare_new_inode(dir, inode, &encrypt); if (err) goto out; } err = dquot_initialize(inode); if (err) goto out; if (!handle && sbi->s_journal && !(i_flags & EXT4_EA_INODE_FL)) { ret2 = ext4_xattr_credits_for_new_inode(dir, mode, encrypt); if (ret2 < 0) { err = ret2; goto out; } nblocks += ret2; } if (!goal) goal = sbi->s_inode_goal; if (goal && goal <= le32_to_cpu(sbi->s_es->s_inodes_count)) { group = (goal - 1) / EXT4_INODES_PER_GROUP(sb); ino = (goal - 1) % EXT4_INODES_PER_GROUP(sb); ret2 = 0; goto got_group; } if (S_ISDIR(mode)) ret2 = find_group_orlov(sb, dir, &group, mode, qstr); else ret2 = find_group_other(sb, dir, &group, mode); got_group: EXT4_I(dir)->i_last_alloc_group = group; err = -ENOSPC; if (ret2 == -1) goto out; /* * Normally we will only go through one pass of this loop, * unless we get unlucky and it turns out the group we selected * had its last inode grabbed by someone else. */ for (i = 0; i < ngroups; i++, ino = 0) { err = -EIO; gdp = ext4_get_group_desc(sb, group, &group_desc_bh); if (!gdp) goto out; /* * Check free inodes count before loading bitmap. */ if (ext4_free_inodes_count(sb, gdp) == 0) goto next_group; if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) { grp = ext4_get_group_info(sb, group); /* * Skip groups with already-known suspicious inode * tables */ if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) goto next_group; } brelse(inode_bitmap_bh); inode_bitmap_bh = ext4_read_inode_bitmap(sb, group); /* Skip groups with suspicious inode tables */ if (((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) || IS_ERR(inode_bitmap_bh)) { inode_bitmap_bh = NULL; goto next_group; } repeat_in_this_group: ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino); if (!ret2) goto next_group; if (group == 0 && (ino + 1) < EXT4_FIRST_INO(sb)) { ext4_error(sb, "reserved inode found cleared - " "inode=%lu", ino + 1); ext4_mark_group_bitmap_corrupted(sb, group, EXT4_GROUP_INFO_IBITMAP_CORRUPT); goto next_group; } if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) { BUG_ON(nblocks <= 0); handle = __ext4_journal_start_sb(dir->i_sb, line_no, handle_type, nblocks, 0, ext4_trans_default_revoke_credits(sb)); if (IS_ERR(handle)) { err = PTR_ERR(handle); ext4_std_error(sb, err); goto out; } } BUFFER_TRACE(inode_bitmap_bh, "get_write_access"); err = ext4_journal_get_write_access(handle, inode_bitmap_bh); if (err) { ext4_std_error(sb, err); goto out; } ext4_lock_group(sb, group); ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data); if (ret2) { /* Someone already took the bit. Repeat the search * with lock held. */ ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino); if (ret2) { ext4_set_bit(ino, inode_bitmap_bh->b_data); ret2 = 0; } else { ret2 = 1; /* we didn't grab the inode */ } } ext4_unlock_group(sb, group); ino++; /* the inode bitmap is zero-based */ if (!ret2) goto got; /* we grabbed the inode! */ if (ino < EXT4_INODES_PER_GROUP(sb)) goto repeat_in_this_group; next_group: if (++group == ngroups) group = 0; } err = -ENOSPC; goto out; got: BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh); if (err) { ext4_std_error(sb, err); goto out; } BUFFER_TRACE(group_desc_bh, "get_write_access"); err = ext4_journal_get_write_access(handle, group_desc_bh); if (err) { ext4_std_error(sb, err); goto out; } /* We may have to initialize the block bitmap if it isn't already */ if (ext4_has_group_desc_csum(sb) && gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { struct buffer_head *block_bitmap_bh; block_bitmap_bh = ext4_read_block_bitmap(sb, group); if (IS_ERR(block_bitmap_bh)) { err = PTR_ERR(block_bitmap_bh); goto out; } BUFFER_TRACE(block_bitmap_bh, "get block bitmap access"); err = ext4_journal_get_write_access(handle, block_bitmap_bh); if (err) { brelse(block_bitmap_bh); ext4_std_error(sb, err); goto out; } BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap"); err = ext4_handle_dirty_metadata(handle, NULL, block_bitmap_bh); /* recheck and clear flag under lock if we still need to */ ext4_lock_group(sb, group); if (ext4_has_group_desc_csum(sb) && (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); ext4_free_group_clusters_set(sb, gdp, ext4_free_clusters_after_init(sb, group, gdp)); ext4_block_bitmap_csum_set(sb, group, gdp, block_bitmap_bh); ext4_group_desc_csum_set(sb, group, gdp); } ext4_unlock_group(sb, group); brelse(block_bitmap_bh); if (err) { ext4_std_error(sb, err); goto out; } } /* Update the relevant bg descriptor fields */ if (ext4_has_group_desc_csum(sb)) { int free; struct ext4_group_info *grp = NULL; if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) { grp = ext4_get_group_info(sb, group); down_read(&grp->alloc_sem); /* * protect vs itable * lazyinit */ } ext4_lock_group(sb, group); /* while we modify the bg desc */ free = EXT4_INODES_PER_GROUP(sb) - ext4_itable_unused_count(sb, gdp); if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT); free = 0; } /* * Check the relative inode number against the last used * relative inode number in this group. if it is greater * we need to update the bg_itable_unused count */ if (ino > free) ext4_itable_unused_set(sb, gdp, (EXT4_INODES_PER_GROUP(sb) - ino)); if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) up_read(&grp->alloc_sem); } else { ext4_lock_group(sb, group); } ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1); if (S_ISDIR(mode)) { ext4_used_dirs_set(sb, gdp, ext4_used_dirs_count(sb, gdp) + 1); if (sbi->s_log_groups_per_flex) { ext4_group_t f = ext4_flex_group(sbi, group); atomic_inc(&sbi_array_rcu_deref(sbi, s_flex_groups, f)->used_dirs); } } if (ext4_has_group_desc_csum(sb)) { ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh, EXT4_INODES_PER_GROUP(sb) / 8); ext4_group_desc_csum_set(sb, group, gdp); } ext4_unlock_group(sb, group); BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh); if (err) { ext4_std_error(sb, err); goto out; } percpu_counter_dec(&sbi->s_freeinodes_counter); if (S_ISDIR(mode)) percpu_counter_inc(&sbi->s_dirs_counter); if (sbi->s_log_groups_per_flex) { flex_group = ext4_flex_group(sbi, group); atomic_dec(&sbi_array_rcu_deref(sbi, s_flex_groups, flex_group)->free_inodes); } inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb); /* This is the optimal IO size (for stat), not the fs block size */ inode->i_blocks = 0; inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); ei->i_crtime = inode->i_mtime; memset(ei->i_data, 0, sizeof(ei->i_data)); ei->i_dir_start_lookup = 0; ei->i_disksize = 0; /* Don't inherit extent flag from directory, amongst others. */ ei->i_flags = ext4_mask_flags(mode, EXT4_I(dir)->i_flags & EXT4_FL_INHERITED); ei->i_flags |= i_flags; ei->i_file_acl = 0; ei->i_dtime = 0; ei->i_block_group = group; ei->i_last_alloc_group = ~0; ext4_set_inode_flags(inode, true); if (IS_DIRSYNC(inode)) ext4_handle_sync(handle); if (insert_inode_locked(inode) < 0) { /* * Likely a bitmap corruption causing inode to be allocated * twice. */ err = -EIO; ext4_error(sb, "failed to insert inode %lu: doubly allocated?", inode->i_ino); ext4_mark_group_bitmap_corrupted(sb, group, EXT4_GROUP_INFO_IBITMAP_CORRUPT); goto out; } inode->i_generation = prandom_u32(); /* Precompute checksum seed for inode metadata */ if (ext4_has_metadata_csum(sb)) { __u32 csum; __le32 inum = cpu_to_le32(inode->i_ino); __le32 gen = cpu_to_le32(inode->i_generation); csum = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)&inum, sizeof(inum)); ei->i_csum_seed = ext4_chksum(sbi, csum, (__u8 *)&gen, sizeof(gen)); } ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */ ext4_set_inode_state(inode, EXT4_STATE_NEW); ei->i_extra_isize = sbi->s_want_extra_isize; ei->i_inline_off = 0; if (ext4_has_feature_inline_data(sb) && (!(ei->i_flags & EXT4_DAX_FL) || S_ISDIR(mode))) ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ret = inode; err = dquot_alloc_inode(inode); if (err) goto fail_drop; /* * Since the encryption xattr will always be unique, create it first so * that it's less likely to end up in an external xattr block and * prevent its deduplication. */ if (encrypt) { err = fscrypt_set_context(inode, handle); if (err) goto fail_free_drop; } if (!(ei->i_flags & EXT4_EA_INODE_FL)) { err = ext4_init_acl(handle, inode, dir); if (err) goto fail_free_drop; err = ext4_init_security(handle, inode, dir, qstr); if (err) goto fail_free_drop; } if (ext4_has_feature_extents(sb)) { /* set extent flag only for directory, file and normal symlink*/ if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) { ext4_set_inode_flag(inode, EXT4_INODE_EXTENTS); ext4_ext_tree_init(handle, inode); } } if (ext4_handle_valid(handle)) { ei->i_sync_tid = handle->h_transaction->t_tid; ei->i_datasync_tid = handle->h_transaction->t_tid; } err = ext4_mark_inode_dirty(handle, inode); if (err) { ext4_std_error(sb, err); goto fail_free_drop; } ext4_debug("allocating inode %lu\n", inode->i_ino); trace_ext4_allocate_inode(inode, dir, mode); brelse(inode_bitmap_bh); return ret; fail_free_drop: dquot_free_inode(inode); fail_drop: clear_nlink(inode); unlock_new_inode(inode); out: dquot_drop(inode); inode->i_flags |= S_NOQUOTA; iput(inode); brelse(inode_bitmap_bh); return ERR_PTR(err); } /* Verify that we are loading a valid orphan from disk */ struct inode *ext4_orphan_get(struct super_block *sb, unsigned long ino) { unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count); ext4_group_t block_group; int bit; struct buffer_head *bitmap_bh = NULL; struct inode *inode = NULL; int err = -EFSCORRUPTED; if (ino < EXT4_FIRST_INO(sb) || ino > max_ino) goto bad_orphan; block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb); bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb); bitmap_bh = ext4_read_inode_bitmap(sb, block_group); if (IS_ERR(bitmap_bh)) return ERR_CAST(bitmap_bh); /* Having the inode bit set should be a 100% indicator that this * is a valid orphan (no e2fsck run on fs). Orphans also include * inodes that were being truncated, so we can't check i_nlink==0. */ if (!ext4_test_bit(bit, bitmap_bh->b_data)) goto bad_orphan; inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL); if (IS_ERR(inode)) { err = PTR_ERR(inode); ext4_error_err(sb, -err, "couldn't read orphan inode %lu (err %d)", ino, err); brelse(bitmap_bh); return inode; } /* * If the orphans has i_nlinks > 0 then it should be able to * be truncated, otherwise it won't be removed from the orphan * list during processing and an infinite loop will result. * Similarly, it must not be a bad inode. */ if ((inode->i_nlink && !ext4_can_truncate(inode)) || is_bad_inode(inode)) goto bad_orphan; if (NEXT_ORPHAN(inode) > max_ino) goto bad_orphan; brelse(bitmap_bh); return inode; bad_orphan: ext4_error(sb, "bad orphan inode %lu", ino); if (bitmap_bh) printk(KERN_ERR "ext4_test_bit(bit=%d, block=%llu) = %d\n", bit, (unsigned long long)bitmap_bh->b_blocknr, ext4_test_bit(bit, bitmap_bh->b_data)); if (inode) { printk(KERN_ERR "is_bad_inode(inode)=%d\n", is_bad_inode(inode)); printk(KERN_ERR "NEXT_ORPHAN(inode)=%u\n", NEXT_ORPHAN(inode)); printk(KERN_ERR "max_ino=%lu\n", max_ino); printk(KERN_ERR "i_nlink=%u\n", inode->i_nlink); /* Avoid freeing blocks if we got a bad deleted inode */ if (inode->i_nlink == 0) inode->i_blocks = 0; iput(inode); } brelse(bitmap_bh); return ERR_PTR(err); } unsigned long ext4_count_free_inodes(struct super_block *sb) { unsigned long desc_count; struct ext4_group_desc *gdp; ext4_group_t i, ngroups = ext4_get_groups_count(sb); #ifdef EXT4FS_DEBUG struct ext4_super_block *es; unsigned long bitmap_count, x; struct buffer_head *bitmap_bh = NULL; es = EXT4_SB(sb)->s_es; desc_count = 0; bitmap_count = 0; gdp = NULL; for (i = 0; i < ngroups; i++) { gdp = ext4_get_group_desc(sb, i, NULL); if (!gdp) continue; desc_count += ext4_free_inodes_count(sb, gdp); brelse(bitmap_bh); bitmap_bh = ext4_read_inode_bitmap(sb, i); if (IS_ERR(bitmap_bh)) { bitmap_bh = NULL; continue; } x = ext4_count_free(bitmap_bh->b_data, EXT4_INODES_PER_GROUP(sb) / 8); printk(KERN_DEBUG "group %lu: stored = %d, counted = %lu\n", (unsigned long) i, ext4_free_inodes_count(sb, gdp), x); bitmap_count += x; } brelse(bitmap_bh); printk(KERN_DEBUG "ext4_count_free_inodes: " "stored = %u, computed = %lu, %lu\n", le32_to_cpu(es->s_free_inodes_count), desc_count, bitmap_count); return desc_count; #else desc_count = 0; for (i = 0; i < ngroups; i++) { gdp = ext4_get_group_desc(sb, i, NULL); if (!gdp) continue; desc_count += ext4_free_inodes_count(sb, gdp); cond_resched(); } return desc_count; #endif } /* Called at mount-time, super-block is locked */ unsigned long ext4_count_dirs(struct super_block * sb) { unsigned long count = 0; ext4_group_t i, ngroups = ext4_get_groups_count(sb); for (i = 0; i < ngroups; i++) { struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL); if (!gdp) continue; count += ext4_used_dirs_count(sb, gdp); } return count; } /* * Zeroes not yet zeroed inode table - just write zeroes through the whole * inode table. Must be called without any spinlock held. The only place * where it is called from on active part of filesystem is ext4lazyinit * thread, so we do not need any special locks, however we have to prevent * inode allocation from the current group, so we take alloc_sem lock, to * block ext4_new_inode() until we are finished. */ int ext4_init_inode_table(struct super_block *sb, ext4_group_t group, int barrier) { struct ext4_group_info *grp = ext4_get_group_info(sb, group); struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_group_desc *gdp = NULL; struct buffer_head *group_desc_bh; handle_t *handle; ext4_fsblk_t blk; int num, ret = 0, used_blks = 0; unsigned long used_inos = 0; /* This should not happen, but just to be sure check this */ if (sb_rdonly(sb)) { ret = 1; goto out; } gdp = ext4_get_group_desc(sb, group, &group_desc_bh); if (!gdp) goto out; /* * We do not need to lock this, because we are the only one * handling this flag. */ if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)) goto out; handle = ext4_journal_start_sb(sb, EXT4_HT_MISC, 1); if (IS_ERR(handle)) { ret = PTR_ERR(handle); goto out; } down_write(&grp->alloc_sem); /* * If inode bitmap was already initialized there may be some * used inodes so we need to skip blocks with used inodes in * inode table. */ if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))) { used_inos = EXT4_INODES_PER_GROUP(sb) - ext4_itable_unused_count(sb, gdp); used_blks = DIV_ROUND_UP(used_inos, sbi->s_inodes_per_block); /* Bogus inode unused count? */ if (used_blks < 0 || used_blks > sbi->s_itb_per_group) { ext4_error(sb, "Something is wrong with group %u: " "used itable blocks: %d; " "itable unused count: %u", group, used_blks, ext4_itable_unused_count(sb, gdp)); ret = 1; goto err_out; } used_inos += group * EXT4_INODES_PER_GROUP(sb); /* * Are there some uninitialized inodes in the inode table * before the first normal inode? */ if ((used_blks != sbi->s_itb_per_group) && (used_inos < EXT4_FIRST_INO(sb))) { ext4_error(sb, "Something is wrong with group %u: " "itable unused count: %u; " "itables initialized count: %ld", group, ext4_itable_unused_count(sb, gdp), used_inos); ret = 1; goto err_out; } } blk = ext4_inode_table(sb, gdp) + used_blks; num = sbi->s_itb_per_group - used_blks; BUFFER_TRACE(group_desc_bh, "get_write_access"); ret = ext4_journal_get_write_access(handle, group_desc_bh); if (ret) goto err_out; /* * Skip zeroout if the inode table is full. But we set the ZEROED * flag anyway, because obviously, when it is full it does not need * further zeroing. */ if (unlikely(num == 0)) goto skip_zeroout; ext4_debug("going to zero out inode table in group %d\n", group); ret = sb_issue_zeroout(sb, blk, num, GFP_NOFS); if (ret < 0) goto err_out; if (barrier) blkdev_issue_flush(sb->s_bdev, GFP_NOFS); skip_zeroout: ext4_lock_group(sb, group); gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED); ext4_group_desc_csum_set(sb, group, gdp); ext4_unlock_group(sb, group); BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata"); ret = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh); err_out: up_write(&grp->alloc_sem); ext4_journal_stop(handle); out: return ret; }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __LINUX_BIT_SPINLOCK_H #define __LINUX_BIT_SPINLOCK_H #include <linux/kernel.h> #include <linux/preempt.h> #include <linux/atomic.h> #include <linux/bug.h> /* * bit-based spin_lock() * * Don't use this unless you really need to: spin_lock() and spin_unlock() * are significantly faster. */ static inline void bit_spin_lock(int bitnum, unsigned long *addr) { /* * Assuming the lock is uncontended, this never enters * the body of the outer loop. If it is contended, then * within the inner loop a non-atomic test is used to * busywait with less bus contention for a good time to * attempt to acquire the lock bit. */ preempt_disable(); #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) while (unlikely(test_and_set_bit_lock(bitnum, addr))) { preempt_enable(); do { cpu_relax(); } while (test_bit(bitnum, addr)); preempt_disable(); } #endif __acquire(bitlock); } /* * Return true if it was acquired */ static inline int bit_spin_trylock(int bitnum, unsigned long *addr) { preempt_disable(); #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) if (unlikely(test_and_set_bit_lock(bitnum, addr))) { preempt_enable(); return 0; } #endif __acquire(bitlock); return 1; } /* * bit-based spin_unlock() */ static inline void bit_spin_unlock(int bitnum, unsigned long *addr) { #ifdef CONFIG_DEBUG_SPINLOCK BUG_ON(!test_bit(bitnum, addr)); #endif #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) clear_bit_unlock(bitnum, addr); #endif preempt_enable(); __release(bitlock); } /* * bit-based spin_unlock() * non-atomic version, which can be used eg. if the bit lock itself is * protecting the rest of the flags in the word. */ static inline void __bit_spin_unlock(int bitnum, unsigned long *addr) { #ifdef CONFIG_DEBUG_SPINLOCK BUG_ON(!test_bit(bitnum, addr)); #endif #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) __clear_bit_unlock(bitnum, addr); #endif preempt_enable(); __release(bitlock); } /* * Return true if the lock is held. */ static inline int bit_spin_is_locked(int bitnum, unsigned long *addr) { #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) return test_bit(bitnum, addr); #elif defined CONFIG_PREEMPT_COUNT return preempt_count(); #else return 1; #endif } #endif /* __LINUX_BIT_SPINLOCK_H */
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 /* SPDX-License-Identifier: GPL-2.0 */ /* * Runtime locking correctness validator * * Copyright (C) 2006,2007 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra * * see Documentation/locking/lockdep-design.rst for more details. */ #ifndef __LINUX_LOCKDEP_H #define __LINUX_LOCKDEP_H #include <linux/lockdep_types.h> #include <linux/smp.h> #include <asm/percpu.h> struct task_struct; /* for sysctl */ extern int prove_locking; extern int lock_stat; #ifdef CONFIG_LOCKDEP #include <linux/linkage.h> #include <linux/list.h> #include <linux/debug_locks.h> #include <linux/stacktrace.h> static inline void lockdep_copy_map(struct lockdep_map *to, struct lockdep_map *from) { int i; *to = *from; /* * Since the class cache can be modified concurrently we could observe * half pointers (64bit arch using 32bit copy insns). Therefore clear * the caches and take the performance hit. * * XXX it doesn't work well with lockdep_set_class_and_subclass(), since * that relies on cache abuse. */ for (i = 0; i < NR_LOCKDEP_CACHING_CLASSES; i++) to->class_cache[i] = NULL; } /* * Every lock has a list of other locks that were taken after it. * We only grow the list, never remove from it: */ struct lock_list { struct list_head entry; struct lock_class *class; struct lock_class *links_to; const struct lock_trace *trace; u16 distance; /* bitmap of different dependencies from head to this */ u8 dep; /* used by BFS to record whether "prev -> this" only has -(*R)-> */ u8 only_xr; /* * The parent field is used to implement breadth-first search, and the * bit 0 is reused to indicate if the lock has been accessed in BFS. */ struct lock_list *parent; }; /** * struct lock_chain - lock dependency chain record * * @irq_context: the same as irq_context in held_lock below * @depth: the number of held locks in this chain * @base: the index in chain_hlocks for this chain * @entry: the collided lock chains in lock_chain hash list * @chain_key: the hash key of this lock_chain */ struct lock_chain { /* see BUILD_BUG_ON()s in add_chain_cache() */ unsigned int irq_context : 2, depth : 6, base : 24; /* 4 byte hole */ struct hlist_node entry; u64 chain_key; }; #define MAX_LOCKDEP_KEYS_BITS 13 #define MAX_LOCKDEP_KEYS (1UL << MAX_LOCKDEP_KEYS_BITS) #define INITIAL_CHAIN_KEY -1 struct held_lock { /* * One-way hash of the dependency chain up to this point. We * hash the hashes step by step as the dependency chain grows. * * We use it for dependency-caching and we skip detection * passes and dependency-updates if there is a cache-hit, so * it is absolutely critical for 100% coverage of the validator * to have a unique key value for every unique dependency path * that can occur in the system, to make a unique hash value * as likely as possible - hence the 64-bit width. * * The task struct holds the current hash value (initialized * with zero), here we store the previous hash value: */ u64 prev_chain_key; unsigned long acquire_ip; struct lockdep_map *instance; struct lockdep_map *nest_lock; #ifdef CONFIG_LOCK_STAT u64 waittime_stamp; u64 holdtime_stamp; #endif /* * class_idx is zero-indexed; it points to the element in * lock_classes this held lock instance belongs to. class_idx is in * the range from 0 to (MAX_LOCKDEP_KEYS-1) inclusive. */ unsigned int class_idx:MAX_LOCKDEP_KEYS_BITS; /* * The lock-stack is unified in that the lock chains of interrupt * contexts nest ontop of process context chains, but we 'separate' * the hashes by starting with 0 if we cross into an interrupt * context, and we also keep do not add cross-context lock * dependencies - the lock usage graph walking covers that area * anyway, and we'd just unnecessarily increase the number of * dependencies otherwise. [Note: hardirq and softirq contexts * are separated from each other too.] * * The following field is used to detect when we cross into an * interrupt context: */ unsigned int irq_context:2; /* bit 0 - soft, bit 1 - hard */ unsigned int trylock:1; /* 16 bits */ unsigned int read:2; /* see lock_acquire() comment */ unsigned int check:1; /* see lock_acquire() comment */ unsigned int hardirqs_off:1; unsigned int references:12; /* 32 bits */ unsigned int pin_count; }; /* * Initialization, self-test and debugging-output methods: */ extern void lockdep_init(void); extern void lockdep_reset(void); extern void lockdep_reset_lock(struct lockdep_map *lock); extern void lockdep_free_key_range(void *start, unsigned long size); extern asmlinkage void lockdep_sys_exit(void); extern void lockdep_set_selftest_task(struct task_struct *task); extern void lockdep_init_task(struct task_struct *task); /* * Split the recrursion counter in two to readily detect 'off' vs recursion. */ #define LOCKDEP_RECURSION_BITS 16 #define LOCKDEP_OFF (1U << LOCKDEP_RECURSION_BITS) #define LOCKDEP_RECURSION_MASK (LOCKDEP_OFF - 1) /* * lockdep_{off,on}() are macros to avoid tracing and kprobes; not inlines due * to header dependencies. */ #define lockdep_off() \ do { \ current->lockdep_recursion += LOCKDEP_OFF; \ } while (0) #define lockdep_on() \ do { \ current->lockdep_recursion -= LOCKDEP_OFF; \ } while (0) extern void lockdep_register_key(struct lock_class_key *key); extern void lockdep_unregister_key(struct lock_class_key *key); /* * These methods are used by specific locking variants (spinlocks, * rwlocks, mutexes and rwsems) to pass init/acquire/release events * to lockdep: */ extern void lockdep_init_map_type(struct lockdep_map *lock, const char *name, struct lock_class_key *key, int subclass, u8 inner, u8 outer, u8 lock_type); static inline void lockdep_init_map_waits(struct lockdep_map *lock, const char *name, struct lock_class_key *key, int subclass, u8 inner, u8 outer) { lockdep_init_map_type(lock, name, key, subclass, inner, LD_WAIT_INV, LD_LOCK_NORMAL); } static inline void lockdep_init_map_wait(struct lockdep_map *lock, const char *name, struct lock_class_key *key, int subclass, u8 inner) { lockdep_init_map_waits(lock, name, key, subclass, inner, LD_WAIT_INV); } static inline void lockdep_init_map(struct lockdep_map *lock, const char *name, struct lock_class_key *key, int subclass) { lockdep_init_map_wait(lock, name, key, subclass, LD_WAIT_INV); } /* * Reinitialize a lock key - for cases where there is special locking or * special initialization of locks so that the validator gets the scope * of dependencies wrong: they are either too broad (they need a class-split) * or they are too narrow (they suffer from a false class-split): */ #define lockdep_set_class(lock, key) \ lockdep_init_map_waits(&(lock)->dep_map, #key, key, 0, \ (lock)->dep_map.wait_type_inner, \ (lock)->dep_map.wait_type_outer) #define lockdep_set_class_and_name(lock, key, name) \ lockdep_init_map_waits(&(lock)->dep_map, name, key, 0, \ (lock)->dep_map.wait_type_inner, \ (lock)->dep_map.wait_type_outer) #define lockdep_set_class_and_subclass(lock, key, sub) \ lockdep_init_map_waits(&(lock)->dep_map, #key, key, sub,\ (lock)->dep_map.wait_type_inner, \ (lock)->dep_map.wait_type_outer) #define lockdep_set_subclass(lock, sub) \ lockdep_init_map_waits(&(lock)->dep_map, #lock, (lock)->dep_map.key, sub,\ (lock)->dep_map.wait_type_inner, \ (lock)->dep_map.wait_type_outer) #define lockdep_set_novalidate_class(lock) \ lockdep_set_class_and_name(lock, &__lockdep_no_validate__, #lock) /* * Compare locking classes */ #define lockdep_match_class(lock, key) lockdep_match_key(&(lock)->dep_map, key) static inline int lockdep_match_key(struct lockdep_map *lock, struct lock_class_key *key) { return lock->key == key; } /* * Acquire a lock. * * Values for "read": * * 0: exclusive (write) acquire * 1: read-acquire (no recursion allowed) * 2: read-acquire with same-instance recursion allowed * * Values for check: * * 0: simple checks (freeing, held-at-exit-time, etc.) * 1: full validation */ extern void lock_acquire(struct lockdep_map *lock, unsigned int subclass, int trylock, int read, int check, struct lockdep_map *nest_lock, unsigned long ip); extern void lock_release(struct lockdep_map *lock, unsigned long ip); /* * Same "read" as for lock_acquire(), except -1 means any. */ extern int lock_is_held_type(const struct lockdep_map *lock, int read); static inline int lock_is_held(const struct lockdep_map *lock) { return lock_is_held_type(lock, -1); } #define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map) #define lockdep_is_held_type(lock, r) lock_is_held_type(&(lock)->dep_map, (r)) extern void lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, unsigned long ip); static inline void lock_set_subclass(struct lockdep_map *lock, unsigned int subclass, unsigned long ip) { lock_set_class(lock, lock->name, lock->key, subclass, ip); } extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip); #define NIL_COOKIE (struct pin_cookie){ .val = 0U, } extern struct pin_cookie lock_pin_lock(struct lockdep_map *lock); extern void lock_repin_lock(struct lockdep_map *lock, struct pin_cookie); extern void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie); #define lockdep_depth(tsk) (debug_locks ? (tsk)->lockdep_depth : 0) #define lockdep_assert_held(l) do { \ WARN_ON(debug_locks && !lockdep_is_held(l)); \ } while (0) #define lockdep_assert_held_write(l) do { \ WARN_ON(debug_locks && !lockdep_is_held_type(l, 0)); \ } while (0) #define lockdep_assert_held_read(l) do { \ WARN_ON(debug_locks && !lockdep_is_held_type(l, 1)); \ } while (0) #define lockdep_assert_held_once(l) do { \ WARN_ON_ONCE(debug_locks && !lockdep_is_held(l)); \ } while (0) #define lockdep_recursing(tsk) ((tsk)->lockdep_recursion) #define lockdep_pin_lock(l) lock_pin_lock(&(l)->dep_map) #define lockdep_repin_lock(l,c) lock_repin_lock(&(l)->dep_map, (c)) #define lockdep_unpin_lock(l,c) lock_unpin_lock(&(l)->dep_map, (c)) #else /* !CONFIG_LOCKDEP */ static inline void lockdep_init_task(struct task_struct *task) { } static inline void lockdep_off(void) { } static inline void lockdep_on(void) { } static inline void lockdep_set_selftest_task(struct task_struct *task) { } # define lock_acquire(l, s, t, r, c, n, i) do { } while (0) # define lock_release(l, i) do { } while (0) # define lock_downgrade(l, i) do { } while (0) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i) do { } while (0) # define lockdep_init() do { } while (0) # define lockdep_init_map_type(lock, name, key, sub, inner, outer, type) \ do { (void)(name); (void)(key); } while (0) # define lockdep_init_map_waits(lock, name, key, sub, inner, outer) \ do { (void)(name); (void)(key); } while (0) # define lockdep_init_map_wait(lock, name, key, sub, inner) \ do { (void)(name); (void)(key); } while (0) # define lockdep_init_map(lock, name, key, sub) \ do { (void)(name); (void)(key); } while (0) # define lockdep_set_class(lock, key) do { (void)(key); } while (0) # define lockdep_set_class_and_name(lock, key, name) \ do { (void)(key); (void)(name); } while (0) #define lockdep_set_class_and_subclass(lock, key, sub) \ do { (void)(key); } while (0) #define lockdep_set_subclass(lock, sub) do { } while (0) #define lockdep_set_novalidate_class(lock) do { } while (0) /* * We don't define lockdep_match_class() and lockdep_match_key() for !LOCKDEP * case since the result is not well defined and the caller should rather * #ifdef the call himself. */ # define lockdep_reset() do { debug_locks = 1; } while (0) # define lockdep_free_key_range(start, size) do { } while (0) # define lockdep_sys_exit() do { } while (0) static inline void lockdep_register_key(struct lock_class_key *key) { } static inline void lockdep_unregister_key(struct lock_class_key *key) { } #define lockdep_depth(tsk) (0) #define lockdep_is_held_type(l, r) (1) #define lockdep_assert_held(l) do { (void)(l); } while (0) #define lockdep_assert_held_write(l) do { (void)(l); } while (0) #define lockdep_assert_held_read(l) do { (void)(l); } while (0) #define lockdep_assert_held_once(l) do { (void)(l); } while (0) #define lockdep_recursing(tsk) (0) #define NIL_COOKIE (struct pin_cookie){ } #define lockdep_pin_lock(l) ({ struct pin_cookie cookie = { }; cookie; }) #define lockdep_repin_lock(l, c) do { (void)(l); (void)(c); } while (0) #define lockdep_unpin_lock(l, c) do { (void)(l); (void)(c); } while (0) #endif /* !LOCKDEP */ enum xhlock_context_t { XHLOCK_HARD, XHLOCK_SOFT, XHLOCK_CTX_NR, }; #define lockdep_init_map_crosslock(m, n, k, s) do {} while (0) /* * To initialize a lockdep_map statically use this macro. * Note that _name must not be NULL. */ #define STATIC_LOCKDEP_MAP_INIT(_name, _key) \ { .name = (_name), .key = (void *)(_key), } static inline void lockdep_invariant_state(bool force) {} static inline void lockdep_free_task(struct task_struct *task) {} #ifdef CONFIG_LOCK_STAT extern void lock_contended(struct lockdep_map *lock, unsigned long ip); extern void lock_acquired(struct lockdep_map *lock, unsigned long ip); #define LOCK_CONTENDED(_lock, try, lock) \ do { \ if (!try(_lock)) { \ lock_contended(&(_lock)->dep_map, _RET_IP_); \ lock(_lock); \ } \ lock_acquired(&(_lock)->dep_map, _RET_IP_); \ } while (0) #define LOCK_CONTENDED_RETURN(_lock, try, lock) \ ({ \ int ____err = 0; \ if (!try(_lock)) { \ lock_contended(&(_lock)->dep_map, _RET_IP_); \ ____err = lock(_lock); \ } \ if (!____err) \ lock_acquired(&(_lock)->dep_map, _RET_IP_); \ ____err; \ }) #else /* CONFIG_LOCK_STAT */ #define lock_contended(lockdep_map, ip) do {} while (0) #define lock_acquired(lockdep_map, ip) do {} while (0) #define LOCK_CONTENDED(_lock, try, lock) \ lock(_lock) #define LOCK_CONTENDED_RETURN(_lock, try, lock) \ lock(_lock) #endif /* CONFIG_LOCK_STAT */ #ifdef CONFIG_LOCKDEP /* * On lockdep we dont want the hand-coded irq-enable of * _raw_*_lock_flags() code, because lockdep assumes * that interrupts are not re-enabled during lock-acquire: */ #define LOCK_CONTENDED_FLAGS(_lock, try, lock, lockfl, flags) \ LOCK_CONTENDED((_lock), (try), (lock)) #else /* CONFIG_LOCKDEP */ #define LOCK_CONTENDED_FLAGS(_lock, try, lock, lockfl, flags) \ lockfl((_lock), (flags)) #endif /* CONFIG_LOCKDEP */ #ifdef CONFIG_PROVE_LOCKING extern void print_irqtrace_events(struct task_struct *curr); #else static inline void print_irqtrace_events(struct task_struct *curr) { } #endif /* Variable used to make lockdep treat read_lock() as recursive in selftests */ #ifdef CONFIG_DEBUG_LOCKING_API_SELFTESTS extern unsigned int force_read_lock_recursive; #else /* CONFIG_DEBUG_LOCKING_API_SELFTESTS */ #define force_read_lock_recursive 0 #endif /* CONFIG_DEBUG_LOCKING_API_SELFTESTS */ #ifdef CONFIG_LOCKDEP extern bool read_lock_is_recursive(void); #else /* CONFIG_LOCKDEP */ /* If !LOCKDEP, the value is meaningless */ #define read_lock_is_recursive() 0 #endif /* * For trivial one-depth nesting of a lock-class, the following * global define can be used. (Subsystems with multiple levels * of nesting should define their own lock-nesting subclasses.) */ #define SINGLE_DEPTH_NESTING 1 /* * Map the dependency ops to NOP or to real lockdep ops, depending * on the per lock-class debug mode: */ #define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i) #define lock_acquire_shared(l, s, t, n, i) lock_acquire(l, s, t, 1, 1, n, i) #define lock_acquire_shared_recursive(l, s, t, n, i) lock_acquire(l, s, t, 2, 1, n, i) #define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) #define spin_acquire_nest(l, s, t, n, i) lock_acquire_exclusive(l, s, t, n, i) #define spin_release(l, i) lock_release(l, i) #define rwlock_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) #define rwlock_acquire_read(l, s, t, i) \ do { \ if (read_lock_is_recursive()) \ lock_acquire_shared_recursive(l, s, t, NULL, i); \ else \ lock_acquire_shared(l, s, t, NULL, i); \ } while (0) #define rwlock_release(l, i) lock_release(l, i) #define seqcount_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) #define seqcount_acquire_read(l, s, t, i) lock_acquire_shared_recursive(l, s, t, NULL, i) #define seqcount_release(l, i) lock_release(l, i) #define mutex_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) #define mutex_acquire_nest(l, s, t, n, i) lock_acquire_exclusive(l, s, t, n, i) #define mutex_release(l, i) lock_release(l, i) #define rwsem_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) #define rwsem_acquire_nest(l, s, t, n, i) lock_acquire_exclusive(l, s, t, n, i) #define rwsem_acquire_read(l, s, t, i) lock_acquire_shared(l, s, t, NULL, i) #define rwsem_release(l, i) lock_release(l, i) #define lock_map_acquire(l) lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_) #define lock_map_acquire_read(l) lock_acquire_shared_recursive(l, 0, 0, NULL, _THIS_IP_) #define lock_map_acquire_tryread(l) lock_acquire_shared_recursive(l, 0, 1, NULL, _THIS_IP_) #define lock_map_release(l) lock_release(l, _THIS_IP_) #ifdef CONFIG_PROVE_LOCKING # define might_lock(lock) \ do { \ typecheck(struct lockdep_map *, &(lock)->dep_map); \ lock_acquire(&(lock)->dep_map, 0, 0, 0, 1, NULL, _THIS_IP_); \ lock_release(&(lock)->dep_map, _THIS_IP_); \ } while (0) # define might_lock_read(lock) \ do { \ typecheck(struct lockdep_map *, &(lock)->dep_map); \ lock_acquire(&(lock)->dep_map, 0, 0, 1, 1, NULL, _THIS_IP_); \ lock_release(&(lock)->dep_map, _THIS_IP_); \ } while (0) # define might_lock_nested(lock, subclass) \ do { \ typecheck(struct lockdep_map *, &(lock)->dep_map); \ lock_acquire(&(lock)->dep_map, subclass, 0, 1, 1, NULL, \ _THIS_IP_); \ lock_release(&(lock)->dep_map, _THIS_IP_); \ } while (0) DECLARE_PER_CPU(int, hardirqs_enabled); DECLARE_PER_CPU(int, hardirq_context); DECLARE_PER_CPU(unsigned int, lockdep_recursion); #define __lockdep_enabled (debug_locks && !this_cpu_read(lockdep_recursion)) #define lockdep_assert_irqs_enabled() \ do { \ WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirqs_enabled)); \ } while (0) #define lockdep_assert_irqs_disabled() \ do { \ WARN_ON_ONCE(__lockdep_enabled && this_cpu_read(hardirqs_enabled)); \ } while (0) #define lockdep_assert_in_irq() \ do { \ WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \ } while (0) #define lockdep_assert_preemption_enabled() \ do { \ WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ __lockdep_enabled && \ (preempt_count() != 0 || \ !this_cpu_read(hardirqs_enabled))); \ } while (0) #define lockdep_assert_preemption_disabled() \ do { \ WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ __lockdep_enabled && \ (preempt_count() == 0 && \ this_cpu_read(hardirqs_enabled))); \ } while (0) #else # define might_lock(lock) do { } while (0) # define might_lock_read(lock) do { } while (0) # define might_lock_nested(lock, subclass) do { } while (0) # define lockdep_assert_irqs_enabled() do { } while (0) # define lockdep_assert_irqs_disabled() do { } while (0) # define lockdep_assert_in_irq() do { } while (0) # define lockdep_assert_preemption_enabled() do { } while (0) # define lockdep_assert_preemption_disabled() do { } while (0) #endif #ifdef CONFIG_PROVE_RAW_LOCK_NESTING # define lockdep_assert_RT_in_threaded_ctx() do { \ WARN_ONCE(debug_locks && !current->lockdep_recursion && \ lockdep_hardirq_context() && \ !(current->hardirq_threaded || current->irq_config), \ "Not in threaded context on PREEMPT_RT as expected\n"); \ } while (0) #else # define lockdep_assert_RT_in_threaded_ctx() do { } while (0) #endif #ifdef CONFIG_LOCKDEP void lockdep_rcu_suspicious(const char *file, const int line, const char *s); #else static inline void lockdep_rcu_suspicious(const char *file, const int line, const char *s) { } #endif #endif /* __LINUX_LOCKDEP_H */
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 // SPDX-License-Identifier: GPL-2.0 /* File: fs/ext4/acl.h (C) 2001 Andreas Gruenbacher, <a.gruenbacher@computer.org> */ #include <linux/posix_acl_xattr.h> #define EXT4_ACL_VERSION 0x0001 typedef struct { __le16 e_tag; __le16 e_perm; __le32 e_id; } ext4_acl_entry; typedef struct { __le16 e_tag; __le16 e_perm; } ext4_acl_entry_short; typedef struct { __le32 a_version; } ext4_acl_header; static inline size_t ext4_acl_size(int count) { if (count <= 4) { return sizeof(ext4_acl_header) + count * sizeof(ext4_acl_entry_short); } else { return sizeof(ext4_acl_header) + 4 * sizeof(ext4_acl_entry_short) + (count - 4) * sizeof(ext4_acl_entry); } } static inline int ext4_acl_count(size_t size) { ssize_t s; size -= sizeof(ext4_acl_header); s = size - 4 * sizeof(ext4_acl_entry_short); if (s < 0) { if (size % sizeof(ext4_acl_entry_short)) return -1; return size / sizeof(ext4_acl_entry_short); } else { if (s % sizeof(ext4_acl_entry)) return -1; return s / sizeof(ext4_acl_entry) + 4; } } #ifdef CONFIG_EXT4_FS_POSIX_ACL /* acl.c */ struct posix_acl *ext4_get_acl(struct inode *inode, int type); int ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type); extern int ext4_init_acl(handle_t *, struct inode *, struct inode *); #else /* CONFIG_EXT4_FS_POSIX_ACL */ #include <linux/sched.h> #define ext4_get_acl NULL #define ext4_set_acl NULL static inline int ext4_init_acl(handle_t *handle, struct inode *inode, struct inode *dir) { return 0; } #endif /* CONFIG_EXT4_FS_POSIX_ACL */
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NETFILTER_INGRESS_H_ #define _NETFILTER_INGRESS_H_ #include <linux/netfilter.h> #include <linux/netdevice.h> #ifdef CONFIG_NETFILTER_INGRESS static inline bool nf_hook_ingress_active(const struct sk_buff *skb) { #ifdef CONFIG_JUMP_LABEL if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS])) return false; #endif return rcu_access_pointer(skb->dev->nf_hooks_ingress); } /* caller must hold rcu_read_lock */ static inline int nf_hook_ingress(struct sk_buff *skb) { struct nf_hook_entries *e = rcu_dereference(skb->dev->nf_hooks_ingress); struct nf_hook_state state; int ret; /* Must recheck the ingress hook head, in the event it became NULL * after the check in nf_hook_ingress_active evaluated to true. */ if (unlikely(!e)) return 0; nf_hook_state_init(&state, NF_NETDEV_INGRESS, NFPROTO_NETDEV, skb->dev, NULL, NULL, dev_net(skb->dev), NULL); ret = nf_hook_slow(skb, &state, e, 0); if (ret == 0) return -1; return ret; } static inline void nf_hook_ingress_init(struct net_device *dev) { RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL); } #else /* CONFIG_NETFILTER_INGRESS */ static inline int nf_hook_ingress_active(struct sk_buff *skb) { return 0; } static inline int nf_hook_ingress(struct sk_buff *skb) { return 0; } static inline void nf_hook_ingress_init(struct net_device *dev) {} #endif /* CONFIG_NETFILTER_INGRESS */ #endif /* _NETFILTER_INGRESS_H_ */
1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 /* * linux/drivers/block/loop.c * * Written by Theodore Ts'o, 3/29/93 * * Copyright 1993 by Theodore Ts'o. Redistribution of this file is * permitted under the GNU General Public License. * * DES encryption plus some minor changes by Werner Almesberger, 30-MAY-1993 * more DES encryption plus IDEA encryption by Nicholas J. Leon, June 20, 1996 * * Modularized and updated for 1.1.16 kernel - Mitch Dsouza 28th May 1994 * Adapted for 1.3.59 kernel - Andries Brouwer, 1 Feb 1996 * * Fixed do_loop_request() re-entrancy - Vincent.Renardias@waw.com Mar 20, 1997 * * Added devfs support - Richard Gooch <rgooch@atnf.csiro.au> 16-Jan-1998 * * Handle sparse backing files correctly - Kenn Humborg, Jun 28, 1998 * * Loadable modules and other fixes by AK, 1998 * * Make real block number available to downstream transfer functions, enables * CBC (and relatives) mode encryption requiring unique IVs per data block. * Reed H. Petty, rhp@draper.net * * Maximum number of loop devices now dynamic via max_loop module parameter. * Russell Kroll <rkroll@exploits.org> 19990701 * * Maximum number of loop devices when compiled-in now selectable by passing * max_loop=<1-255> to the kernel on boot. * Erik I. Bolsø, <eriki@himolde.no>, Oct 31, 1999 * * Completely rewrite request handling to be make_request_fn style and * non blocking, pushing work to a helper thread. Lots of fixes from * Al Viro too. * Jens Axboe <axboe@suse.de>, Nov 2000 * * Support up to 256 loop devices * Heinz Mauelshagen <mge@sistina.com>, Feb 2002 * * Support for falling back on the write file operation when the address space * operations write_begin is not available on the backing filesystem. * Anton Altaparmakov, 16 Feb 2005 * * Still To Fix: * - Advisory locking is ignored here. * - Should use an own CAP_* category instead of CAP_SYS_ADMIN * */ #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/sched.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/stat.h> #include <linux/errno.h> #include <linux/major.h> #include <linux/wait.h> #include <linux/blkdev.h> #include <linux/blkpg.h> #include <linux/init.h> #include <linux/swap.h> #include <linux/slab.h> #include <linux/compat.h> #include <linux/suspend.h> #include <linux/freezer.h> #include <linux/mutex.h> #include <linux/writeback.h> #include <linux/completion.h> #include <linux/highmem.h> #include <linux/kthread.h> #include <linux/splice.h> #include <linux/sysfs.h> #include <linux/miscdevice.h> #include <linux/falloc.h> #include <linux/uio.h> #include <linux/ioprio.h> #include <linux/blk-cgroup.h> #include "loop.h" #include <linux/uaccess.h> static DEFINE_IDR(loop_index_idr); static DEFINE_MUTEX(loop_ctl_mutex); static int max_part; static int part_shift; static int transfer_xor(struct loop_device *lo, int cmd, struct page *raw_page, unsigned raw_off, struct page *loop_page, unsigned loop_off, int size, sector_t real_block) { char *raw_buf = kmap_atomic(raw_page) + raw_off; char *loop_buf = kmap_atomic(loop_page) + loop_off; char *in, *out, *key; int i, keysize; if (cmd == READ) { in = raw_buf; out = loop_buf; } else { in = loop_buf; out = raw_buf; } key = lo->lo_encrypt_key; keysize = lo->lo_encrypt_key_size; for (i = 0; i < size; i++) *out++ = *in++ ^ key[(i & 511) % keysize]; kunmap_atomic(loop_buf); kunmap_atomic(raw_buf); cond_resched(); return 0; } static int xor_init(struct loop_device *lo, const struct loop_info64 *info) { if (unlikely(info->lo_encrypt_key_size <= 0)) return -EINVAL; return 0; } static struct loop_func_table none_funcs = { .number = LO_CRYPT_NONE, }; static struct loop_func_table xor_funcs = { .number = LO_CRYPT_XOR, .transfer = transfer_xor, .init = xor_init }; /* xfer_funcs[0] is special - its release function is never called */ static struct loop_func_table *xfer_funcs[MAX_LO_CRYPT] = { &none_funcs, &xor_funcs }; static loff_t get_size(loff_t offset, loff_t sizelimit, struct file *file) { loff_t loopsize; /* Compute loopsize in bytes */ loopsize = i_size_read(file->f_mapping->host); if (offset > 0) loopsize -= offset; /* offset is beyond i_size, weird but possible */ if (loopsize < 0) return 0; if (sizelimit > 0 && sizelimit < loopsize) loopsize = sizelimit; /* * Unfortunately, if we want to do I/O on the device, * the number of 512-byte sectors has to fit into a sector_t. */ return loopsize >> 9; } static loff_t get_loop_size(struct loop_device *lo, struct file *file) { return get_size(lo->lo_offset, lo->lo_sizelimit, file); } static void __loop_update_dio(struct loop_device *lo, bool dio) { struct file *file = lo->lo_backing_file; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; unsigned short sb_bsize = 0; unsigned dio_align = 0; bool use_dio; if (inode->i_sb->s_bdev) { sb_bsize = bdev_logical_block_size(inode->i_sb->s_bdev); dio_align = sb_bsize - 1; } /* * We support direct I/O only if lo_offset is aligned with the * logical I/O size of backing device, and the logical block * size of loop is bigger than the backing device's and the loop * needn't transform transfer. * * TODO: the above condition may be loosed in the future, and * direct I/O may be switched runtime at that time because most * of requests in sane applications should be PAGE_SIZE aligned */ if (dio) { if (queue_logical_block_size(lo->lo_queue) >= sb_bsize && !(lo->lo_offset & dio_align) && mapping->a_ops->direct_IO && !lo->transfer) use_dio = true; else use_dio = false; } else { use_dio = false; } if (lo->use_dio == use_dio) return; /* flush dirty pages before changing direct IO */ vfs_fsync(file, 0); /* * The flag of LO_FLAGS_DIRECT_IO is handled similarly with * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup * will get updated by ioctl(LOOP_GET_STATUS) */ if (lo->lo_state == Lo_bound) blk_mq_freeze_queue(lo->lo_queue); lo->use_dio = use_dio; if (use_dio) { blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, lo->lo_queue); lo->lo_flags |= LO_FLAGS_DIRECT_IO; } else { blk_queue_flag_set(QUEUE_FLAG_NOMERGES, lo->lo_queue); lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; } if (lo->lo_state == Lo_bound) blk_mq_unfreeze_queue(lo->lo_queue); } /** * loop_validate_block_size() - validates the passed in block size * @bsize: size to validate */ static int loop_validate_block_size(unsigned short bsize) { if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize)) return -EINVAL; return 0; } /** * loop_set_size() - sets device size and notifies userspace * @lo: struct loop_device to set the size for * @size: new size of the loop device * * Callers must validate that the size passed into this function fits into * a sector_t, eg using loop_validate_size() */ static void loop_set_size(struct loop_device *lo, loff_t size) { struct block_device *bdev = lo->lo_device; bd_set_nr_sectors(bdev, size); if (!set_capacity_revalidate_and_notify(lo->lo_disk, size, false)) kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE); } static inline int lo_do_transfer(struct loop_device *lo, int cmd, struct page *rpage, unsigned roffs, struct page *lpage, unsigned loffs, int size, sector_t rblock) { int ret; ret = lo->transfer(lo, cmd, rpage, roffs, lpage, loffs, size, rblock); if (likely(!ret)) return 0; printk_ratelimited(KERN_ERR "loop: Transfer error at byte offset %llu, length %i.\n", (unsigned long long)rblock << 9, size); return ret; } static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos) { struct iov_iter i; ssize_t bw; iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len); file_start_write(file); bw = vfs_iter_write(file, &i, ppos, 0); file_end_write(file); if (likely(bw == bvec->bv_len)) return 0; printk_ratelimited(KERN_ERR "loop: Write error at byte offset %llu, length %i.\n", (unsigned long long)*ppos, bvec->bv_len); if (bw >= 0) bw = -EIO; return bw; } static int lo_write_simple(struct loop_device *lo, struct request *rq, loff_t pos) { struct bio_vec bvec; struct req_iterator iter; int ret = 0; rq_for_each_segment(bvec, rq, iter) { ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos); if (ret < 0) break; cond_resched(); } return ret; } /* * This is the slow, transforming version that needs to double buffer the * data as it cannot do the transformations in place without having direct * access to the destination pages of the backing file. */ static int lo_write_transfer(struct loop_device *lo, struct request *rq, loff_t pos) { struct bio_vec bvec, b; struct req_iterator iter; struct page *page; int ret = 0; page = alloc_page(GFP_NOIO); if (unlikely(!page)) return -ENOMEM; rq_for_each_segment(bvec, rq, iter) { ret = lo_do_transfer(lo, WRITE, page, 0, bvec.bv_page, bvec.bv_offset, bvec.bv_len, pos >> 9); if (unlikely(ret)) break; b.bv_page = page; b.bv_offset = 0; b.bv_len = bvec.bv_len; ret = lo_write_bvec(lo->lo_backing_file, &b, &pos); if (ret < 0) break; } __free_page(page); return ret; } static int lo_read_simple(struct loop_device *lo, struct request *rq, loff_t pos) { struct bio_vec bvec; struct req_iterator iter; struct iov_iter i; ssize_t len; rq_for_each_segment(bvec, rq, iter) { iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len); len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0); if (len < 0) return len; flush_dcache_page(bvec.bv_page); if (len != bvec.bv_len) { struct bio *bio; __rq_for_each_bio(bio, rq) zero_fill_bio(bio); break; } cond_resched(); } return 0; } static int lo_read_transfer(struct loop_device *lo, struct request *rq, loff_t pos) { struct bio_vec bvec, b; struct req_iterator iter; struct iov_iter i; struct page *page; ssize_t len; int ret = 0; page = alloc_page(GFP_NOIO); if (unlikely(!page)) return -ENOMEM; rq_for_each_segment(bvec, rq, iter) { loff_t offset = pos; b.bv_page = page; b.bv_offset = 0; b.bv_len = bvec.bv_len; iov_iter_bvec(&i, READ, &b, 1, b.bv_len); len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0); if (len < 0) { ret = len; goto out_free_page; } ret = lo_do_transfer(lo, READ, page, 0, bvec.bv_page, bvec.bv_offset, len, offset >> 9); if (ret) goto out_free_page; flush_dcache_page(bvec.bv_page); if (len != bvec.bv_len) { struct bio *bio; __rq_for_each_bio(bio, rq) zero_fill_bio(bio); break; } } ret = 0; out_free_page: __free_page(page); return ret; } static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos, int mode) { /* * We use fallocate to manipulate the space mappings used by the image * a.k.a. discard/zerorange. However we do not support this if * encryption is enabled, because it may give an attacker useful * information. */ struct file *file = lo->lo_backing_file; struct request_queue *q = lo->lo_queue; int ret; mode |= FALLOC_FL_KEEP_SIZE; if (!blk_queue_discard(q)) { ret = -EOPNOTSUPP; goto out; } ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq)); if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP)) ret = -EIO; out: return ret; } static int lo_req_flush(struct loop_device *lo, struct request *rq) { struct file *file = lo->lo_backing_file; int ret = vfs_fsync(file, 0); if (unlikely(ret && ret != -EINVAL)) ret = -EIO; return ret; } static void lo_complete_rq(struct request *rq) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); blk_status_t ret = BLK_STS_OK; if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) || req_op(rq) != REQ_OP_READ) { if (cmd->ret < 0) ret = errno_to_blk_status(cmd->ret); goto end_io; } /* * Short READ - if we got some data, advance our request and * retry it. If we got no data, end the rest with EIO. */ if (cmd->ret) { blk_update_request(rq, BLK_STS_OK, cmd->ret); cmd->ret = 0; blk_mq_requeue_request(rq, true); } else { if (cmd->use_aio) { struct bio *bio = rq->bio; while (bio) { zero_fill_bio(bio); bio = bio->bi_next; } } ret = BLK_STS_IOERR; end_io: blk_mq_end_request(rq, ret); } } static void lo_rw_aio_do_completion(struct loop_cmd *cmd) { struct request *rq = blk_mq_rq_from_pdu(cmd); if (!atomic_dec_and_test(&cmd->ref)) return; kfree(cmd->bvec); cmd->bvec = NULL; if (likely(!blk_should_fake_timeout(rq->q))) blk_mq_complete_request(rq); } static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2) { struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); if (cmd->css) css_put(cmd->css); cmd->ret = ret; lo_rw_aio_do_completion(cmd); } static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos, bool rw) { struct iov_iter iter; struct req_iterator rq_iter; struct bio_vec *bvec; struct request *rq = blk_mq_rq_from_pdu(cmd); struct bio *bio = rq->bio; struct file *file = lo->lo_backing_file; struct bio_vec tmp; unsigned int offset; int nr_bvec = 0; int ret; rq_for_each_bvec(tmp, rq, rq_iter) nr_bvec++; if (rq->bio != rq->biotail) { bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec), GFP_NOIO); if (!bvec) return -EIO; cmd->bvec = bvec; /* * The bios of the request may be started from the middle of * the 'bvec' because of bio splitting, so we can't directly * copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec * API will take care of all details for us. */ rq_for_each_bvec(tmp, rq, rq_iter) { *bvec = tmp; bvec++; } bvec = cmd->bvec; offset = 0; } else { /* * Same here, this bio may be started from the middle of the * 'bvec' because of bio splitting, so offset from the bvec * must be passed to iov iterator */ offset = bio->bi_iter.bi_bvec_done; bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); } atomic_set(&cmd->ref, 2); iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq)); iter.iov_offset = offset; cmd->iocb.ki_pos = pos; cmd->iocb.ki_filp = file; cmd->iocb.ki_complete = lo_rw_aio_complete; cmd->iocb.ki_flags = IOCB_DIRECT; cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); if (cmd->css) kthread_associate_blkcg(cmd->css); if (rw == WRITE) ret = call_write_iter(file, &cmd->iocb, &iter); else ret = call_read_iter(file, &cmd->iocb, &iter); lo_rw_aio_do_completion(cmd); kthread_associate_blkcg(NULL); if (ret != -EIOCBQUEUED) cmd->iocb.ki_complete(&cmd->iocb, ret, 0); return 0; } static int do_req_filebacked(struct loop_device *lo, struct request *rq) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset; /* * lo_write_simple and lo_read_simple should have been covered * by io submit style function like lo_rw_aio(), one blocker * is that lo_read_simple() need to call flush_dcache_page after * the page is written from kernel, and it isn't easy to handle * this in io submit style function which submits all segments * of the req at one time. And direct read IO doesn't need to * run flush_dcache_page(). */ switch (req_op(rq)) { case REQ_OP_FLUSH: return lo_req_flush(lo, rq); case REQ_OP_WRITE_ZEROES: /* * If the caller doesn't want deallocation, call zeroout to * write zeroes the range. Otherwise, punch them out. */ return lo_fallocate(lo, rq, pos, (rq->cmd_flags & REQ_NOUNMAP) ? FALLOC_FL_ZERO_RANGE : FALLOC_FL_PUNCH_HOLE); case REQ_OP_DISCARD: return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE); case REQ_OP_WRITE: if (lo->transfer) return lo_write_transfer(lo, rq, pos); else if (cmd->use_aio) return lo_rw_aio(lo, cmd, pos, WRITE); else return lo_write_simple(lo, rq, pos); case REQ_OP_READ: if (lo->transfer) return lo_read_transfer(lo, rq, pos); else if (cmd->use_aio) return lo_rw_aio(lo, cmd, pos, READ); else return lo_read_simple(lo, rq, pos); default: WARN_ON_ONCE(1); return -EIO; } } static inline void loop_update_dio(struct loop_device *lo) { __loop_update_dio(lo, (lo->lo_backing_file->f_flags & O_DIRECT) | lo->use_dio); } static void loop_reread_partitions(struct loop_device *lo, struct block_device *bdev) { int rc; mutex_lock(&bdev->bd_mutex); rc = bdev_disk_changed(bdev, false); mutex_unlock(&bdev->bd_mutex); if (rc) pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n", __func__, lo->lo_number, lo->lo_file_name, rc); } static inline int is_loop_device(struct file *file) { struct inode *i = file->f_mapping->host; return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR; } static int loop_validate_file(struct file *file, struct block_device *bdev) { struct inode *inode = file->f_mapping->host; struct file *f = file; /* Avoid recursion */ while (is_loop_device(f)) { struct loop_device *l; if (f->f_mapping->host->i_bdev == bdev) return -EBADF; l = f->f_mapping->host->i_bdev->bd_disk->private_data; if (l->lo_state != Lo_bound) { return -EINVAL; } f = l->lo_backing_file; } if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode)) return -EINVAL; return 0; } /* * loop_change_fd switched the backing store of a loopback device to * a new file. This is useful for operating system installers to free up * the original file and in High Availability environments to switch to * an alternative location for the content in case of server meltdown. * This can only work if the loop device is used read-only, and if the * new backing store is the same size and type as the old backing store. */ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, unsigned int arg) { struct file *file = NULL, *old_file; int error; bool partscan; error = mutex_lock_killable(&loop_ctl_mutex); if (error) return error; error = -ENXIO; if (lo->lo_state != Lo_bound) goto out_err; /* the loop device has to be read-only */ error = -EINVAL; if (!(lo->lo_flags & LO_FLAGS_READ_ONLY)) goto out_err; error = -EBADF; file = fget(arg); if (!file) goto out_err; error = loop_validate_file(file, bdev); if (error) goto out_err; old_file = lo->lo_backing_file; error = -EINVAL; /* size of the new backing store needs to be the same */ if (get_loop_size(lo, file) != get_loop_size(lo, old_file)) goto out_err; /* and ... switch */ blk_mq_freeze_queue(lo->lo_queue); mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask); lo->lo_backing_file = file; lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping); mapping_set_gfp_mask(file->f_mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); loop_update_dio(lo); blk_mq_unfreeze_queue(lo->lo_queue); partscan = lo->lo_flags & LO_FLAGS_PARTSCAN; mutex_unlock(&loop_ctl_mutex); /* * We must drop file reference outside of loop_ctl_mutex as dropping * the file ref can take bd_mutex which creates circular locking * dependency. */ fput(old_file); if (partscan) loop_reread_partitions(lo, bdev); return 0; out_err: mutex_unlock(&loop_ctl_mutex); if (file) fput(file); return error; } /* loop sysfs attributes */ static ssize_t loop_attr_show(struct device *dev, char *page, ssize_t (*callback)(struct loop_device *, char *)) { struct gendisk *disk = dev_to_disk(dev); struct loop_device *lo = disk->private_data; return callback(lo, page); } #define LOOP_ATTR_RO(_name) \ static ssize_t loop_attr_##_name##_show(struct loop_device *, char *); \ static ssize_t loop_attr_do_show_##_name(struct device *d, \ struct device_attribute *attr, char *b) \ { \ return loop_attr_show(d, b, loop_attr_##_name##_show); \ } \ static struct device_attribute loop_attr_##_name = \ __ATTR(_name, 0444, loop_attr_do_show_##_name, NULL); static ssize_t loop_attr_backing_file_show(struct loop_device *lo, char *buf) { ssize_t ret; char *p = NULL; spin_lock_irq(&lo->lo_lock); if (lo->lo_backing_file) p = file_path(lo->lo_backing_file, buf, PAGE_SIZE - 1); spin_unlock_irq(&lo->lo_lock); if (IS_ERR_OR_NULL(p)) ret = PTR_ERR(p); else { ret = strlen(p); memmove(buf, p, ret); buf[ret++] = '\n'; buf[ret] = 0; } return ret; } static ssize_t loop_attr_offset_show(struct loop_device *lo, char *buf) { return sprintf(buf, "%llu\n", (unsigned long long)lo->lo_offset); } static ssize_t loop_attr_sizelimit_show(struct loop_device *lo, char *buf) { return sprintf(buf, "%llu\n", (unsigned long long)lo->lo_sizelimit); } static ssize_t loop_attr_autoclear_show(struct loop_device *lo, char *buf) { int autoclear = (lo->lo_flags & LO_FLAGS_AUTOCLEAR); return sprintf(buf, "%s\n", autoclear ? "1" : "0"); } static ssize_t loop_attr_partscan_show(struct loop_device *lo, char *buf) { int partscan = (lo->lo_flags & LO_FLAGS_PARTSCAN); return sprintf(buf, "%s\n", partscan ? "1" : "0"); } static ssize_t loop_attr_dio_show(struct loop_device *lo, char *buf) { int dio = (lo->lo_flags & LO_FLAGS_DIRECT_IO); return sprintf(buf, "%s\n", dio ? "1" : "0"); } LOOP_ATTR_RO(backing_file); LOOP_ATTR_RO(offset); LOOP_ATTR_RO(sizelimit); LOOP_ATTR_RO(autoclear); LOOP_ATTR_RO(partscan); LOOP_ATTR_RO(dio); static struct attribute *loop_attrs[] = { &loop_attr_backing_file.attr, &loop_attr_offset.attr, &loop_attr_sizelimit.attr, &loop_attr_autoclear.attr, &loop_attr_partscan.attr, &loop_attr_dio.attr, NULL, }; static struct attribute_group loop_attribute_group = { .name = "loop", .attrs= loop_attrs, }; static void loop_sysfs_init(struct loop_device *lo) { lo->sysfs_inited = !sysfs_create_group(&disk_to_dev(lo->lo_disk)->kobj, &loop_attribute_group); } static void loop_sysfs_exit(struct loop_device *lo) { if (lo->sysfs_inited) sysfs_remove_group(&disk_to_dev(lo->lo_disk)->kobj, &loop_attribute_group); } static void loop_config_discard(struct loop_device *lo) { struct file *file = lo->lo_backing_file; struct inode *inode = file->f_mapping->host; struct request_queue *q = lo->lo_queue; u32 granularity, max_discard_sectors; /* * If the backing device is a block device, mirror its zeroing * capability. Set the discard sectors to the block device's zeroing * capabilities because loop discards result in blkdev_issue_zeroout(), * not blkdev_issue_discard(). This maintains consistent behavior with * file-backed loop devices: discarded regions read back as zero. */ if (S_ISBLK(inode->i_mode) && !lo->lo_encrypt_key_size) { struct request_queue *backingq; backingq = bdev_get_queue(inode->i_bdev); max_discard_sectors = backingq->limits.max_write_zeroes_sectors; granularity = backingq->limits.discard_granularity ?: queue_physical_block_size(backingq); /* * We use punch hole to reclaim the free space used by the * image a.k.a. discard. However we do not support discard if * encryption is enabled, because it may give an attacker * useful information. */ } else if (!file->f_op->fallocate || lo->lo_encrypt_key_size) { max_discard_sectors = 0; granularity = 0; } else { max_discard_sectors = UINT_MAX >> 9; granularity = inode->i_sb->s_blocksize; } if (max_discard_sectors) { q->limits.discard_granularity = granularity; blk_queue_max_discard_sectors(q, max_discard_sectors); blk_queue_max_write_zeroes_sectors(q, max_discard_sectors); blk_queue_flag_set(QUEUE_FLAG_DISCARD, q); } else { q->limits.discard_granularity = 0; blk_queue_max_discard_sectors(q, 0); blk_queue_max_write_zeroes_sectors(q, 0); blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q); } q->limits.discard_alignment = 0; } static void loop_unprepare_queue(struct loop_device *lo) { kthread_flush_worker(&lo->worker); kthread_stop(lo->worker_task); } static int loop_kthread_worker_fn(void *worker_ptr) { current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } static int loop_prepare_queue(struct loop_device *lo) { kthread_init_worker(&lo->worker); lo->worker_task = kthread_run(loop_kthread_worker_fn, &lo->worker, "loop%d", lo->lo_number); if (IS_ERR(lo->worker_task)) return -ENOMEM; set_user_nice(lo->worker_task, MIN_NICE); return 0; } static void loop_update_rotational(struct loop_device *lo) { struct file *file = lo->lo_backing_file; struct inode *file_inode = file->f_mapping->host; struct block_device *file_bdev = file_inode->i_sb->s_bdev; struct request_queue *q = lo->lo_queue; bool nonrot = true; /* not all filesystems (e.g. tmpfs) have a sb->s_bdev */ if (file_bdev) nonrot = blk_queue_nonrot(bdev_get_queue(file_bdev)); if (nonrot) blk_queue_flag_set(QUEUE_FLAG_NONROT, q); else blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); } static int loop_release_xfer(struct loop_device *lo) { int err = 0; struct loop_func_table *xfer = lo->lo_encryption; if (xfer) { if (xfer->release) err = xfer->release(lo); lo->transfer = NULL; lo->lo_encryption = NULL; module_put(xfer->owner); } return err; } static int loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer, const struct loop_info64 *i) { int err = 0; if (xfer) { struct module *owner = xfer->owner; if (!try_module_get(owner)) return -EINVAL; if (xfer->init) err = xfer->init(lo, i); if (err) module_put(owner); else lo->lo_encryption = xfer; } return err; } /** * loop_set_status_from_info - configure device from loop_info * @lo: struct loop_device to configure * @info: struct loop_info64 to configure the device with * * Configures the loop device parameters according to the passed * in loop_info64 configuration. */ static int loop_set_status_from_info(struct loop_device *lo, const struct loop_info64 *info) { int err; struct loop_func_table *xfer; kuid_t uid = current_uid(); if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE) return -EINVAL; err = loop_release_xfer(lo); if (err) return err; if (info->lo_encrypt_type) { unsigned int type = info->lo_encrypt_type; if (type >= MAX_LO_CRYPT) return -EINVAL; xfer = xfer_funcs[type]; if (xfer == NULL) return -EINVAL; } else xfer = NULL; err = loop_init_xfer(lo, xfer, info); if (err) return err; lo->lo_offset = info->lo_offset; lo->lo_sizelimit = info->lo_sizelimit; memcpy(lo->lo_file_name, info->lo_file_name, LO_NAME_SIZE); memcpy(lo->lo_crypt_name, info->lo_crypt_name, LO_NAME_SIZE); lo->lo_file_name[LO_NAME_SIZE-1] = 0; lo->lo_crypt_name[LO_NAME_SIZE-1] = 0; if (!xfer) xfer = &none_funcs; lo->transfer = xfer->transfer; lo->ioctl = xfer->ioctl; lo->lo_flags = info->lo_flags; lo->lo_encrypt_key_size = info->lo_encrypt_key_size; lo->lo_init[0] = info->lo_init[0]; lo->lo_init[1] = info->lo_init[1]; if (info->lo_encrypt_key_size) { memcpy(lo->lo_encrypt_key, info->lo_encrypt_key, info->lo_encrypt_key_size); lo->lo_key_owner = uid; } return 0; } static int loop_configure(struct loop_device *lo, fmode_t mode, struct block_device *bdev, const struct loop_config *config) { struct file *file; struct inode *inode; struct address_space *mapping; struct block_device *claimed_bdev = NULL; int error; loff_t size; bool partscan; unsigned short bsize; /* This is safe, since we have a reference from open(). */ __module_get(THIS_MODULE); error = -EBADF; file = fget(config->fd); if (!file) goto out; /* * If we don't hold exclusive handle for the device, upgrade to it * here to avoid changing device under exclusive owner. */ if (!(mode & FMODE_EXCL)) { claimed_bdev = bdev->bd_contains; error = bd_prepare_to_claim(bdev, claimed_bdev, loop_configure); if (error) goto out_putf; } error = mutex_lock_killable(&loop_ctl_mutex); if (error) goto out_bdev; error = -EBUSY; if (lo->lo_state != Lo_unbound) goto out_unlock; error = loop_validate_file(file, bdev); if (error) goto out_unlock; mapping = file->f_mapping; inode = mapping->host; if ((config->info.lo_flags & ~LOOP_CONFIGURE_SETTABLE_FLAGS) != 0) { error = -EINVAL; goto out_unlock; } if (config->block_size) { error = loop_validate_block_size(config->block_size); if (error) goto out_unlock; } error = loop_set_status_from_info(lo, &config->info); if (error) goto out_unlock; if (!(file->f_mode & FMODE_WRITE) || !(mode & FMODE_WRITE) || !file->f_op->write_iter) lo->lo_flags |= LO_FLAGS_READ_ONLY; error = loop_prepare_queue(lo); if (error) goto out_unlock; set_device_ro(bdev, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0); lo->use_dio = lo->lo_flags & LO_FLAGS_DIRECT_IO; lo->lo_device = bdev; lo->lo_backing_file = file; lo->old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); if (!(lo->lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) blk_queue_write_cache(lo->lo_queue, true, false); if (config->block_size) bsize = config->block_size; else if ((lo->lo_backing_file->f_flags & O_DIRECT) && inode->i_sb->s_bdev) /* In case of direct I/O, match underlying block size */ bsize = bdev_logical_block_size(inode->i_sb->s_bdev); else bsize = 512; blk_queue_logical_block_size(lo->lo_queue, bsize); blk_queue_physical_block_size(lo->lo_queue, bsize); blk_queue_io_min(lo->lo_queue, bsize); loop_config_discard(lo); loop_update_rotational(lo); loop_update_dio(lo); loop_sysfs_init(lo); size = get_loop_size(lo, file); loop_set_size(lo, size); set_blocksize(bdev, S_ISBLK(inode->i_mode) ? block_size(inode->i_bdev) : PAGE_SIZE); lo->lo_state = Lo_bound; if (part_shift) lo->lo_flags |= LO_FLAGS_PARTSCAN; partscan = lo->lo_flags & LO_FLAGS_PARTSCAN; if (partscan) lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN; /* Grab the block_device to prevent its destruction after we * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev). */ bdgrab(bdev); mutex_unlock(&loop_ctl_mutex); if (partscan) loop_reread_partitions(lo, bdev); if (claimed_bdev) bd_abort_claiming(bdev, claimed_bdev, loop_configure); return 0; out_unlock: mutex_unlock(&loop_ctl_mutex); out_bdev: if (claimed_bdev) bd_abort_claiming(bdev, claimed_bdev, loop_configure); out_putf: fput(file); out: /* This is safe: open() is still holding a reference. */ module_put(THIS_MODULE); return error; } static int __loop_clr_fd(struct loop_device *lo, bool release) { struct file *filp = NULL; gfp_t gfp = lo->old_gfp_mask; struct block_device *bdev = lo->lo_device; int err = 0; bool partscan = false; int lo_number; mutex_lock(&loop_ctl_mutex); if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) { err = -ENXIO; goto out_unlock; } filp = lo->lo_backing_file; if (filp == NULL) { err = -EINVAL; goto out_unlock; } if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags)) blk_queue_write_cache(lo->lo_queue, false, false); /* freeze request queue during the transition */ blk_mq_freeze_queue(lo->lo_queue); spin_lock_irq(&lo->lo_lock); lo->lo_backing_file = NULL; spin_unlock_irq(&lo->lo_lock); loop_release_xfer(lo); lo->transfer = NULL; lo->ioctl = NULL; lo->lo_device = NULL; lo->lo_encryption = NULL; lo->lo_offset = 0; lo->lo_sizelimit = 0; lo->lo_encrypt_key_size = 0; memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE); memset(lo->lo_crypt_name, 0, LO_NAME_SIZE); memset(lo->lo_file_name, 0, LO_NAME_SIZE); blk_queue_logical_block_size(lo->lo_queue, 512); blk_queue_physical_block_size(lo->lo_queue, 512); blk_queue_io_min(lo->lo_queue, 512); if (bdev) { bdput(bdev); invalidate_bdev(bdev); bdev->bd_inode->i_mapping->wb_err = 0; } set_capacity(lo->lo_disk, 0); loop_sysfs_exit(lo); if (bdev) { bd_set_nr_sectors(bdev, 0); /* let user-space know about this change */ kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE); } mapping_set_gfp_mask(filp->f_mapping, gfp); /* This is safe: open() is still holding a reference. */ module_put(THIS_MODULE); blk_mq_unfreeze_queue(lo->lo_queue); partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev; lo_number = lo->lo_number; loop_unprepare_queue(lo); out_unlock: mutex_unlock(&loop_ctl_mutex); if (partscan) { /* * bd_mutex has been held already in release path, so don't * acquire it if this function is called in such case. * * If the reread partition isn't from release path, lo_refcnt * must be at least one and it can only become zero when the * current holder is released. */ if (!release) mutex_lock(&bdev->bd_mutex); err = bdev_disk_changed(bdev, false); if (!release) mutex_unlock(&bdev->bd_mutex); if (err) pr_warn("%s: partition scan of loop%d failed (rc=%d)\n", __func__, lo_number, err); /* Device is gone, no point in returning error */ err = 0; } /* * lo->lo_state is set to Lo_unbound here after above partscan has * finished. * * There cannot be anybody else entering __loop_clr_fd() as * lo->lo_backing_file is already cleared and Lo_rundown state * protects us from all the other places trying to change the 'lo' * device. */ mutex_lock(&loop_ctl_mutex); lo->lo_flags = 0; if (!part_shift) lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN; lo->lo_state = Lo_unbound; mutex_unlock(&loop_ctl_mutex); /* * Need not hold loop_ctl_mutex to fput backing file. * Calling fput holding loop_ctl_mutex triggers a circular * lock dependency possibility warning as fput can take * bd_mutex which is usually taken before loop_ctl_mutex. */ if (filp) fput(filp); return err; } static int loop_clr_fd(struct loop_device *lo) { int err; err = mutex_lock_killable(&loop_ctl_mutex); if (err) return err; if (lo->lo_state != Lo_bound) { mutex_unlock(&loop_ctl_mutex); return -ENXIO; } /* * If we've explicitly asked to tear down the loop device, * and it has an elevated reference count, set it for auto-teardown when * the last reference goes away. This stops $!~#$@ udev from * preventing teardown because it decided that it needs to run blkid on * the loopback device whenever they appear. xfstests is notorious for * failing tests because blkid via udev races with a losetup * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d * command to fail with EBUSY. */ if (atomic_read(&lo->lo_refcnt) > 1) { lo->lo_flags |= LO_FLAGS_AUTOCLEAR; mutex_unlock(&loop_ctl_mutex); return 0; } lo->lo_state = Lo_rundown; mutex_unlock(&loop_ctl_mutex); return __loop_clr_fd(lo, false); } static int loop_set_status(struct loop_device *lo, const struct loop_info64 *info) { int err; struct block_device *bdev; kuid_t uid = current_uid(); int prev_lo_flags; bool partscan = false; bool size_changed = false; err = mutex_lock_killable(&loop_ctl_mutex); if (err) return err; if (lo->lo_encrypt_key_size && !uid_eq(lo->lo_key_owner, uid) && !capable(CAP_SYS_ADMIN)) { err = -EPERM; goto out_unlock; } if (lo->lo_state != Lo_bound) { err = -ENXIO; goto out_unlock; } if (lo->lo_offset != info->lo_offset || lo->lo_sizelimit != info->lo_sizelimit) { size_changed = true; sync_blockdev(lo->lo_device); invalidate_bdev(lo->lo_device); } /* I/O need to be drained during transfer transition */ blk_mq_freeze_queue(lo->lo_queue); if (size_changed && lo->lo_device->bd_inode->i_mapping->nrpages) { /* If any pages were dirtied after invalidate_bdev(), try again */ err = -EAGAIN; pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n", __func__, lo->lo_number, lo->lo_file_name, lo->lo_device->bd_inode->i_mapping->nrpages); goto out_unfreeze; } prev_lo_flags = lo->lo_flags; err = loop_set_status_from_info(lo, info); if (err) goto out_unfreeze; /* Mask out flags that can't be set using LOOP_SET_STATUS. */ lo->lo_flags &= LOOP_SET_STATUS_SETTABLE_FLAGS; /* For those flags, use the previous values instead */ lo->lo_flags |= prev_lo_flags & ~LOOP_SET_STATUS_SETTABLE_FLAGS; /* For flags that can't be cleared, use previous values too */ lo->lo_flags |= prev_lo_flags & ~LOOP_SET_STATUS_CLEARABLE_FLAGS; if (size_changed) { loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit, lo->lo_backing_file); loop_set_size(lo, new_size); } loop_config_discard(lo); /* update dio if lo_offset or transfer is changed */ __loop_update_dio(lo, lo->use_dio); out_unfreeze: blk_mq_unfreeze_queue(lo->lo_queue); if (!err && (lo->lo_flags & LO_FLAGS_PARTSCAN) && !(prev_lo_flags & LO_FLAGS_PARTSCAN)) { lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN; bdev = lo->lo_device; partscan = true; } out_unlock: mutex_unlock(&loop_ctl_mutex); if (partscan) loop_reread_partitions(lo, bdev); return err; } static int loop_get_status(struct loop_device *lo, struct loop_info64 *info) { struct path path; struct kstat stat; int ret; ret = mutex_lock_killable(&loop_ctl_mutex); if (ret) return ret; if (lo->lo_state != Lo_bound) { mutex_unlock(&loop_ctl_mutex); return -ENXIO; } memset(info, 0, sizeof(*info)); info->lo_number = lo->lo_number; info->lo_offset = lo->lo_offset; info->lo_sizelimit = lo->lo_sizelimit; info->lo_flags = lo->lo_flags; memcpy(info->lo_file_name, lo->lo_file_name, LO_NAME_SIZE); memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE); info->lo_encrypt_type = lo->lo_encryption ? lo->lo_encryption->number : 0; if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) { info->lo_encrypt_key_size = lo->lo_encrypt_key_size; memcpy(info->lo_encrypt_key, lo->lo_encrypt_key, lo->lo_encrypt_key_size); } /* Drop loop_ctl_mutex while we call into the filesystem. */ path = lo->lo_backing_file->f_path; path_get(&path); mutex_unlock(&loop_ctl_mutex); ret = vfs_getattr(&path, &stat, STATX_INO, AT_STATX_SYNC_AS_STAT); if (!ret) { info->lo_device = huge_encode_dev(stat.dev); info->lo_inode = stat.ino; info->lo_rdevice = huge_encode_dev(stat.rdev); } path_put(&path); return ret; } static void loop_info64_from_old(const struct loop_info *info, struct loop_info64 *info64) { memset(info64, 0, sizeof(*info64)); info64->lo_number = info->lo_number; info64->lo_device = info->lo_device; info64->lo_inode = info->lo_inode; info64->lo_rdevice = info->lo_rdevice; info64->lo_offset = info->lo_offset; info64->lo_sizelimit = 0; info64->lo_encrypt_type = info->lo_encrypt_type; info64->lo_encrypt_key_size = info->lo_encrypt_key_size; info64->lo_flags = info->lo_flags; info64->lo_init[0] = info->lo_init[0]; info64->lo_init[1] = info->lo_init[1]; if (info->lo_encrypt_type == LO_CRYPT_CRYPTOAPI) memcpy(info64->lo_crypt_name, info->lo_name, LO_NAME_SIZE); else memcpy(info64->lo_file_name, info->lo_name, LO_NAME_SIZE); memcpy(info64->lo_encrypt_key, info->lo_encrypt_key, LO_KEY_SIZE); } static int loop_info64_to_old(const struct loop_info64 *info64, struct loop_info *info) { memset(info, 0, sizeof(*info)); info->lo_number = info64->lo_number; info->lo_device = info64->lo_device; info->lo_inode = info64->lo_inode; info->lo_rdevice = info64->lo_rdevice; info->lo_offset = info64->lo_offset; info->lo_encrypt_type = info64->lo_encrypt_type; info->lo_encrypt_key_size = info64->lo_encrypt_key_size; info->lo_flags = info64->lo_flags; info->lo_init[0] = info64->lo_init[0]; info->lo_init[1] = info64->lo_init[1]; if (info->lo_encrypt_type == LO_CRYPT_CRYPTOAPI) memcpy(info->lo_name, info64->lo_crypt_name, LO_NAME_SIZE); else memcpy(info->lo_name, info64->lo_file_name, LO_NAME_SIZE); memcpy(info->lo_encrypt_key, info64->lo_encrypt_key, LO_KEY_SIZE); /* error in case values were truncated */ if (info->lo_device != info64->lo_device || info->lo_rdevice != info64->lo_rdevice || info->lo_inode != info64->lo_inode || info->lo_offset != info64->lo_offset) return -EOVERFLOW; return 0; } static int loop_set_status_old(struct loop_device *lo, const struct loop_info __user *arg) { struct loop_info info; struct loop_info64 info64; if (copy_from_user(&info, arg, sizeof (struct loop_info))) return -EFAULT; loop_info64_from_old(&info, &info64); return loop_set_status(lo, &info64); } static int loop_set_status64(struct loop_device *lo, const struct loop_info64 __user *arg) { struct loop_info64 info64; if (copy_from_user(&info64, arg, sizeof (struct loop_info64))) return -EFAULT; return loop_set_status(lo, &info64); } static int loop_get_status_old(struct loop_device *lo, struct loop_info __user *arg) { struct loop_info info; struct loop_info64 info64; int err; if (!arg) return -EINVAL; err = loop_get_status(lo, &info64); if (!err) err = loop_info64_to_old(&info64, &info); if (!err && copy_to_user(arg, &info, sizeof(info))) err = -EFAULT; return err; } static int loop_get_status64(struct loop_device *lo, struct loop_info64 __user *arg) { struct loop_info64 info64; int err; if (!arg) return -EINVAL; err = loop_get_status(lo, &info64); if (!err && copy_to_user(arg, &info64, sizeof(info64))) err = -EFAULT; return err; } static int loop_set_capacity(struct loop_device *lo) { loff_t size; if (unlikely(lo->lo_state != Lo_bound)) return -ENXIO; size = get_loop_size(lo, lo->lo_backing_file); loop_set_size(lo, size); return 0; } static int loop_set_dio(struct loop_device *lo, unsigned long arg) { int error = -ENXIO; if (lo->lo_state != Lo_bound) goto out; __loop_update_dio(lo, !!arg); if (lo->use_dio == !!arg) return 0; error = -EINVAL; out: return error; } static int loop_set_block_size(struct loop_device *lo, unsigned long arg) { int err = 0; if (lo->lo_state != Lo_bound) return -ENXIO; err = loop_validate_block_size(arg); if (err) return err; if (lo->lo_queue->limits.logical_block_size == arg) return 0; sync_blockdev(lo->lo_device); invalidate_bdev(lo->lo_device); blk_mq_freeze_queue(lo->lo_queue); /* invalidate_bdev should have truncated all the pages */ if (lo->lo_device->bd_inode->i_mapping->nrpages) { err = -EAGAIN; pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n", __func__, lo->lo_number, lo->lo_file_name, lo->lo_device->bd_inode->i_mapping->nrpages); goto out_unfreeze; } blk_queue_logical_block_size(lo->lo_queue, arg); blk_queue_physical_block_size(lo->lo_queue, arg); blk_queue_io_min(lo->lo_queue, arg); loop_update_dio(lo); out_unfreeze: blk_mq_unfreeze_queue(lo->lo_queue); return err; } static int lo_simple_ioctl(struct loop_device *lo, unsigned int cmd, unsigned long arg) { int err; err = mutex_lock_killable(&loop_ctl_mutex); if (err) return err; switch (cmd) { case LOOP_SET_CAPACITY: err = loop_set_capacity(lo); break; case LOOP_SET_DIRECT_IO: err = loop_set_dio(lo, arg); break; case LOOP_SET_BLOCK_SIZE: err = loop_set_block_size(lo, arg); break; default: err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL; } mutex_unlock(&loop_ctl_mutex); return err; } static int lo_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd, unsigned long arg) { struct loop_device *lo = bdev->bd_disk->private_data; void __user *argp = (void __user *) arg; int err; switch (cmd) { case LOOP_SET_FD: { /* * Legacy case - pass in a zeroed out struct loop_config with * only the file descriptor set , which corresponds with the * default parameters we'd have used otherwise. */ struct loop_config config; memset(&config, 0, sizeof(config)); config.fd = arg; return loop_configure(lo, mode, bdev, &config); } case LOOP_CONFIGURE: { struct loop_config config; if (copy_from_user(&config, argp, sizeof(config))) return -EFAULT; return loop_configure(lo, mode, bdev, &config); } case LOOP_CHANGE_FD: return loop_change_fd(lo, bdev, arg); case LOOP_CLR_FD: return loop_clr_fd(lo); case LOOP_SET_STATUS: err = -EPERM; if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) { err = loop_set_status_old(lo, argp); } break; case LOOP_GET_STATUS: return loop_get_status_old(lo, argp); case LOOP_SET_STATUS64: err = -EPERM; if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) { err = loop_set_status64(lo, argp); } break; case LOOP_GET_STATUS64: return loop_get_status64(lo, argp); case LOOP_SET_CAPACITY: case LOOP_SET_DIRECT_IO: case LOOP_SET_BLOCK_SIZE: if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN)) return -EPERM; fallthrough; default: err = lo_simple_ioctl(lo, cmd, arg); break; } return err; } #ifdef CONFIG_COMPAT struct compat_loop_info { compat_int_t lo_number; /* ioctl r/o */ compat_dev_t lo_device; /* ioctl r/o */ compat_ulong_t lo_inode; /* ioctl r/o */ compat_dev_t lo_rdevice; /* ioctl r/o */ compat_int_t lo_offset; compat_int_t lo_encrypt_type; compat_int_t lo_encrypt_key_size; /* ioctl w/o */ compat_int_t lo_flags; /* ioctl r/o */ char lo_name[LO_NAME_SIZE]; unsigned char lo_encrypt_key[LO_KEY_SIZE]; /* ioctl w/o */ compat_ulong_t lo_init[2]; char reserved[4]; }; /* * Transfer 32-bit compatibility structure in userspace to 64-bit loop info * - noinlined to reduce stack space usage in main part of driver */ static noinline int loop_info64_from_compat(const struct compat_loop_info __user *arg, struct loop_info64 *info64) { struct compat_loop_info info; if (copy_from_user(&info, arg, sizeof(info))) return -EFAULT; memset(info64, 0, sizeof(*info64)); info64->lo_number = info.lo_number; info64->lo_device = info.lo_device; info64->lo_inode = info.lo_inode; info64->lo_rdevice = info.lo_rdevice; info64->lo_offset = info.lo_offset; info64->lo_sizelimit = 0; info64->lo_encrypt_type = info.lo_encrypt_type; info64->lo_encrypt_key_size = info.lo_encrypt_key_size; info64->lo_flags = info.lo_flags; info64->lo_init[0] = info.lo_init[0]; info64->lo_init[1] = info.lo_init[1]; if (info.lo_encrypt_type == LO_CRYPT_CRYPTOAPI) memcpy(info64->lo_crypt_name, info.lo_name, LO_NAME_SIZE); else memcpy(info64->lo_file_name, info.lo_name, LO_NAME_SIZE); memcpy(info64->lo_encrypt_key, info.lo_encrypt_key, LO_KEY_SIZE); return 0; } /* * Transfer 64-bit loop info to 32-bit compatibility structure in userspace * - noinlined to reduce stack space usage in main part of driver */ static noinline int loop_info64_to_compat(const struct loop_info64 *info64, struct compat_loop_info __user *arg) { struct compat_loop_info info; memset(&info, 0, sizeof(info)); info.lo_number = info64->lo_number; info.lo_device = info64->lo_device; info.lo_inode = info64->lo_inode; info.lo_rdevice = info64->lo_rdevice; info.lo_offset = info64->lo_offset; info.lo_encrypt_type = info64->lo_encrypt_type; info.lo_encrypt_key_size = info64->lo_encrypt_key_size; info.lo_flags = info64->lo_flags; info.lo_init[0] = info64->lo_init[0]; info.lo_init[1] = info64->lo_init[1]; if (info.lo_encrypt_type == LO_CRYPT_CRYPTOAPI) memcpy(info.lo_name, info64->lo_crypt_name, LO_NAME_SIZE); else memcpy(info.lo_name, info64->lo_file_name, LO_NAME_SIZE); memcpy(info.lo_encrypt_key, info64->lo_encrypt_key, LO_KEY_SIZE); /* error in case values were truncated */ if (info.lo_device != info64->lo_device || info.lo_rdevice != info64->lo_rdevice || info.lo_inode != info64->lo_inode || info.lo_offset != info64->lo_offset || info.lo_init[0] != info64->lo_init[0] || info.lo_init[1] != info64->lo_init[1]) return -EOVERFLOW; if (copy_to_user(arg, &info, sizeof(info))) return -EFAULT; return 0; } static int loop_set_status_compat(struct loop_device *lo, const struct compat_loop_info __user *arg) { struct loop_info64 info64; int ret; ret = loop_info64_from_compat(arg, &info64); if (ret < 0) return ret; return loop_set_status(lo, &info64); } static int loop_get_status_compat(struct loop_device *lo, struct compat_loop_info __user *arg) { struct loop_info64 info64; int err; if (!arg) return -EINVAL; err = loop_get_status(lo, &info64); if (!err) err = loop_info64_to_compat(&info64, arg); return err; } static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd, unsigned long arg) { struct loop_device *lo = bdev->bd_disk->private_data; int err; switch(cmd) { case LOOP_SET_STATUS: err = loop_set_status_compat(lo, (const struct compat_loop_info __user *)arg); break; case LOOP_GET_STATUS: err = loop_get_status_compat(lo, (struct compat_loop_info __user *)arg); break; case LOOP_SET_CAPACITY: case LOOP_CLR_FD: case LOOP_GET_STATUS64: case LOOP_SET_STATUS64: case LOOP_CONFIGURE: arg = (unsigned long) compat_ptr(arg); fallthrough; case LOOP_SET_FD: case LOOP_CHANGE_FD: case LOOP_SET_BLOCK_SIZE: case LOOP_SET_DIRECT_IO: err = lo_ioctl(bdev, mode, cmd, arg); break; default: err = -ENOIOCTLCMD; break; } return err; } #endif static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo; int err; err = mutex_lock_killable(&loop_ctl_mutex); if (err) return err; lo = bdev->bd_disk->private_data; if (!lo) { err = -ENXIO; goto out; } atomic_inc(&lo->lo_refcnt); out: mutex_unlock(&loop_ctl_mutex); return err; } static void lo_release(struct gendisk *disk, fmode_t mode) { struct loop_device *lo; mutex_lock(&loop_ctl_mutex); lo = disk->private_data; if (atomic_dec_return(&lo->lo_refcnt)) goto out_unlock; if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) { if (lo->lo_state != Lo_bound) goto out_unlock; lo->lo_state = Lo_rundown; mutex_unlock(&loop_ctl_mutex); /* * In autoclear mode, stop the loop thread * and remove configuration after last close. */ __loop_clr_fd(lo, true); return; } else if (lo->lo_state == Lo_bound) { /* * Otherwise keep thread (if running) and config, * but flush possible ongoing bios in thread. */ blk_mq_freeze_queue(lo->lo_queue); blk_mq_unfreeze_queue(lo->lo_queue); } out_unlock: mutex_unlock(&loop_ctl_mutex); } static const struct block_device_operations lo_fops = { .owner = THIS_MODULE, .open = lo_open, .release = lo_release, .ioctl = lo_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = lo_compat_ioctl, #endif }; /* * And now the modules code and kernel interface. */ static int max_loop; module_param(max_loop, int, 0444); MODULE_PARM_DESC(max_loop, "Maximum number of loop devices"); module_param(max_part, int, 0444); MODULE_PARM_DESC(max_part, "Maximum number of partitions per loop device"); MODULE_LICENSE("GPL"); MODULE_ALIAS_BLOCKDEV_MAJOR(LOOP_MAJOR); int loop_register_transfer(struct loop_func_table *funcs) { unsigned int n = funcs->number; if (n >= MAX_LO_CRYPT || xfer_funcs[n]) return -EINVAL; xfer_funcs[n] = funcs; return 0; } static int unregister_transfer_cb(int id, void *ptr, void *data) { struct loop_device *lo = ptr; struct loop_func_table *xfer = data; mutex_lock(&loop_ctl_mutex); if (lo->lo_encryption == xfer) loop_release_xfer(lo); mutex_unlock(&loop_ctl_mutex); return 0; } int loop_unregister_transfer(int number) { unsigned int n = number; struct loop_func_table *xfer; if (n == 0 || n >= MAX_LO_CRYPT || (xfer = xfer_funcs[n]) == NULL) return -EINVAL; xfer_funcs[n] = NULL; idr_for_each(&loop_index_idr, &unregister_transfer_cb, xfer); return 0; } EXPORT_SYMBOL(loop_register_transfer); EXPORT_SYMBOL(loop_unregister_transfer); static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { struct request *rq = bd->rq; struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); struct loop_device *lo = rq->q->queuedata; blk_mq_start_request(rq); if (lo->lo_state != Lo_bound) return BLK_STS_IOERR; switch (req_op(rq)) { case REQ_OP_FLUSH: case REQ_OP_DISCARD: case REQ_OP_WRITE_ZEROES: cmd->use_aio = false; break; default: cmd->use_aio = lo->use_dio; break; } /* always use the first bio's css */ #ifdef CONFIG_BLK_CGROUP if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { cmd->css = &bio_blkcg(rq->bio)->css; css_get(cmd->css); } else #endif cmd->css = NULL; kthread_queue_work(&lo->worker, &cmd->work); return BLK_STS_OK; } static void loop_handle_cmd(struct loop_cmd *cmd) { struct request *rq = blk_mq_rq_from_pdu(cmd); const bool write = op_is_write(req_op(rq)); struct loop_device *lo = rq->q->queuedata; int ret = 0; if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) { ret = -EIO; goto failed; } ret = do_req_filebacked(lo, rq); failed: /* complete non-aio request */ if (!cmd->use_aio || ret) { if (ret == -EOPNOTSUPP) cmd->ret = ret; else cmd->ret = ret ? -EIO : 0; if (likely(!blk_should_fake_timeout(rq->q))) blk_mq_complete_request(rq); } } static void loop_queue_work(struct kthread_work *work) { struct loop_cmd *cmd = container_of(work, struct loop_cmd, work); loop_handle_cmd(cmd); } static int loop_init_request(struct blk_mq_tag_set *set, struct request *rq, unsigned int hctx_idx, unsigned int numa_node) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); kthread_init_work(&cmd->work, loop_queue_work); return 0; } static const struct blk_mq_ops loop_mq_ops = { .queue_rq = loop_queue_rq, .init_request = loop_init_request, .complete = lo_complete_rq, }; static int loop_add(struct loop_device **l, int i) { struct loop_device *lo; struct gendisk *disk; int err; err = -ENOMEM; lo = kzalloc(sizeof(*lo), GFP_KERNEL); if (!lo) goto out; lo->lo_state = Lo_unbound; /* allocate id, if @id >= 0, we're requesting that specific id */ if (i >= 0) { err = idr_alloc(&loop_index_idr, lo, i, i + 1, GFP_KERNEL); if (err == -ENOSPC) err = -EEXIST; } else { err = idr_alloc(&loop_index_idr, lo, 0, 0, GFP_KERNEL); } if (err < 0) goto out_free_dev; i = err; err = -ENOMEM; lo->tag_set.ops = &loop_mq_ops; lo->tag_set.nr_hw_queues = 1; lo->tag_set.queue_depth = 128; lo->tag_set.numa_node = NUMA_NO_NODE; lo->tag_set.cmd_size = sizeof(struct loop_cmd); lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING; lo->tag_set.driver_data = lo; err = blk_mq_alloc_tag_set(&lo->tag_set); if (err) goto out_free_idr; lo->lo_queue = blk_mq_init_queue(&lo->tag_set); if (IS_ERR(lo->lo_queue)) { err = PTR_ERR(lo->lo_queue); goto out_cleanup_tags; } lo->lo_queue->queuedata = lo; blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); /* * By default, we do buffer IO, so it doesn't make sense to enable * merge because the I/O submitted to backing file is handled page by * page. For directio mode, merge does help to dispatch bigger request * to underlayer disk. We will enable merge once directio is enabled. */ blk_queue_flag_set(QUEUE_FLAG_NOMERGES, lo->lo_queue); err = -ENOMEM; disk = lo->lo_disk = alloc_disk(1 << part_shift); if (!disk) goto out_free_queue; /* * Disable partition scanning by default. The in-kernel partition * scanning can be requested individually per-device during its * setup. Userspace can always add and remove partitions from all * devices. The needed partition minors are allocated from the * extended minor space, the main loop device numbers will continue * to match the loop minors, regardless of the number of partitions * used. * * If max_part is given, partition scanning is globally enabled for * all loop devices. The minors for the main loop devices will be * multiples of max_part. * * Note: Global-for-all-devices, set-only-at-init, read-only module * parameteters like 'max_loop' and 'max_part' make things needlessly * complicated, are too static, inflexible and may surprise * userspace tools. Parameters like this in general should be avoided. */ if (!part_shift) disk->flags |= GENHD_FL_NO_PART_SCAN; disk->flags |= GENHD_FL_EXT_DEVT; atomic_set(&lo->lo_refcnt, 0); lo->lo_number = i; spin_lock_init(&lo->lo_lock); disk->major = LOOP_MAJOR; disk->first_minor = i << part_shift; disk->fops = &lo_fops; disk->private_data = lo; disk->queue = lo->lo_queue; sprintf(disk->disk_name, "loop%d", i); add_disk(disk); *l = lo; return lo->lo_number; out_free_queue: blk_cleanup_queue(lo->lo_queue); out_cleanup_tags: blk_mq_free_tag_set(&lo->tag_set); out_free_idr: idr_remove(&loop_index_idr, i); out_free_dev: kfree(lo); out: return err; } static void loop_remove(struct loop_device *lo) { del_gendisk(lo->lo_disk); blk_cleanup_queue(lo->lo_queue); blk_mq_free_tag_set(&lo->tag_set); put_disk(lo->lo_disk); kfree(lo); } static int find_free_cb(int id, void *ptr, void *data) { struct loop_device *lo = ptr; struct loop_device **l = data; if (lo->lo_state == Lo_unbound) { *l = lo; return 1; } return 0; } static int loop_lookup(struct loop_device **l, int i) { struct loop_device *lo; int ret = -ENODEV; if (i < 0) { int err; err = idr_for_each(&loop_index_idr, &find_free_cb, &lo); if (err == 1) { *l = lo; ret = lo->lo_number; } goto out; } /* lookup and return a specific i */ lo = idr_find(&loop_index_idr, i); if (lo) { *l = lo; ret = lo->lo_number; } out: return ret; } static struct kobject *loop_probe(dev_t dev, int *part, void *data) { struct loop_device *lo; struct kobject *kobj; int err; mutex_lock(&loop_ctl_mutex); err = loop_lookup(&lo, MINOR(dev) >> part_shift); if (err < 0) err = loop_add(&lo, MINOR(dev) >> part_shift); if (err < 0) kobj = NULL; else kobj = get_disk_and_module(lo->lo_disk); mutex_unlock(&loop_ctl_mutex); *part = 0; return kobj; } static long loop_control_ioctl(struct file *file, unsigned int cmd, unsigned long parm) { struct loop_device *lo; int ret; ret = mutex_lock_killable(&loop_ctl_mutex); if (ret) return ret; ret = -ENOSYS; switch (cmd) { case LOOP_CTL_ADD: ret = loop_lookup(&lo, parm); if (ret >= 0) { ret = -EEXIST; break; } ret = loop_add(&lo, parm); break; case LOOP_CTL_REMOVE: ret = loop_lookup(&lo, parm); if (ret < 0) break; if (lo->lo_state != Lo_unbound) { ret = -EBUSY; break; } if (atomic_read(&lo->lo_refcnt) > 0) { ret = -EBUSY; break; } lo->lo_disk->private_data = NULL; idr_remove(&loop_index_idr, lo->lo_number); loop_remove(lo); break; case LOOP_CTL_GET_FREE: ret = loop_lookup(&lo, -1); if (ret >= 0) break; ret = loop_add(&lo, -1); } mutex_unlock(&loop_ctl_mutex); return ret; } static const struct file_operations loop_ctl_fops = { .open = nonseekable_open, .unlocked_ioctl = loop_control_ioctl, .compat_ioctl = loop_control_ioctl, .owner = THIS_MODULE, .llseek = noop_llseek, }; static struct miscdevice loop_misc = { .minor = LOOP_CTRL_MINOR, .name = "loop-control", .fops = &loop_ctl_fops, }; MODULE_ALIAS_MISCDEV(LOOP_CTRL_MINOR); MODULE_ALIAS("devname:loop-control"); static int __init loop_init(void) { int i, nr; unsigned long range; struct loop_device *lo; int err; part_shift = 0; if (max_part > 0) { part_shift = fls(max_part); /* * Adjust max_part according to part_shift as it is exported * to user space so that user can decide correct minor number * if [s]he want to create more devices. * * Note that -1 is required because partition 0 is reserved * for the whole disk. */ max_part = (1UL << part_shift) - 1; } if ((1UL << part_shift) > DISK_MAX_PARTS) { err = -EINVAL; goto err_out; } if (max_loop > 1UL << (MINORBITS - part_shift)) { err = -EINVAL; goto err_out; } /* * If max_loop is specified, create that many devices upfront. * This also becomes a hard limit. If max_loop is not specified, * create CONFIG_BLK_DEV_LOOP_MIN_COUNT loop devices at module * init time. Loop devices can be requested on-demand with the * /dev/loop-control interface, or be instantiated by accessing * a 'dead' device node. */ if (max_loop) { nr = max_loop; range = max_loop << part_shift; } else { nr = CONFIG_BLK_DEV_LOOP_MIN_COUNT; range = 1UL << MINORBITS; } err = misc_register(&loop_misc); if (err < 0) goto err_out; if (register_blkdev(LOOP_MAJOR, "loop")) { err = -EIO; goto misc_out; } blk_register_region(MKDEV(LOOP_MAJOR, 0), range, THIS_MODULE, loop_probe, NULL, NULL); /* pre-create number of devices given by config or max_loop */ mutex_lock(&loop_ctl_mutex); for (i = 0; i < nr; i++) loop_add(&lo, i); mutex_unlock(&loop_ctl_mutex); printk(KERN_INFO "loop: module loaded\n"); return 0; misc_out: misc_deregister(&loop_misc); err_out: return err; } static int loop_exit_cb(int id, void *ptr, void *data) { struct loop_device *lo = ptr; loop_remove(lo); return 0; } static void __exit loop_exit(void) { unsigned long range; range = max_loop ? max_loop << part_shift : 1UL << MINORBITS; mutex_lock(&loop_ctl_mutex); idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); idr_destroy(&loop_index_idr); blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range); unregister_blkdev(LOOP_MAJOR, "loop"); misc_deregister(&loop_misc); mutex_unlock(&loop_ctl_mutex); } module_init(loop_init); module_exit(loop_exit); #ifndef MODULE static int __init max_loop_setup(char *str) { max_loop = simple_strtol(str, NULL, 0); return 1; } __setup("max_loop=", max_loop_setup); #endif
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _BLOCK_BLK_PM_H_ #define _BLOCK_BLK_PM_H_ #include <linux/pm_runtime.h> #ifdef CONFIG_PM static inline int blk_pm_resume_queue(const bool pm, struct request_queue *q) { if (!q->dev || !blk_queue_pm_only(q)) return 1; /* Nothing to do */ if (pm && q->rpm_status != RPM_SUSPENDED) return 1; /* Request allowed */ pm_request_resume(q->dev); return 0; } static inline void blk_pm_mark_last_busy(struct request *rq) { if (rq->q->dev && !(rq->rq_flags & RQF_PM)) pm_runtime_mark_last_busy(rq->q->dev); } static inline void blk_pm_requeue_request(struct request *rq) { lockdep_assert_held(&rq->q->queue_lock); if (rq->q->dev && !(rq->rq_flags & RQF_PM)) rq->q->nr_pending--; } static inline void blk_pm_add_request(struct request_queue *q, struct request *rq) { lockdep_assert_held(&q->queue_lock); if (q->dev && !(rq->rq_flags & RQF_PM)) q->nr_pending++; } static inline void blk_pm_put_request(struct request *rq) { lockdep_assert_held(&rq->q->queue_lock); if (rq->q->dev && !(rq->rq_flags & RQF_PM)) --rq->q->nr_pending; } #else static inline int blk_pm_resume_queue(const bool pm, struct request_queue *q) { return 1; } static inline void blk_pm_mark_last_busy(struct request *rq) { } static inline void blk_pm_requeue_request(struct request *rq) { } static inline void blk_pm_add_request(struct request_queue *q, struct request *rq) { } static inline void blk_pm_put_request(struct request *rq) { } #endif #endif /* _BLOCK_BLK_PM_H_ */
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 4